本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。
以下文章附带菜J学Python ,作者J哥
刚接触Python的新手、小白,可以复制下面的链接去免费观看Python的基础入门教学视频
https://v.douyu.<a href="https://www.gaodaima.com/tag/com" title="查看更多关于com的文章" target="_blank">com</a>/author/y6AZ4jn9jwKW
www#gaodaima.com来源gaodai.ma#com搞#代!码网搞代码
前言
前就在上,广深的朋友估计还穿着短袖羡慕着北方的下雪气氛。结果就在上周,广深也迎来了降温,大家纷纷加入“降温群聊”。
为了帮助大家抵抗御严寒,我特地爬了下京东的羽绒服数据。为啥不是天猫呢,理由很简单,滑块验证有点麻烦。
数据获取
京东网站是一个ajax动态加载的网站,只能通过解析接口或使用硒自动化测试工具去爬取。关于动态网页爬虫,本公众号历史原创文章介绍过,感兴趣的朋友可以去了解一下。
本次数据获取采用硒,由于我的谷歌浏览器版本更新较快,导致原来的谷歌驱动程序中断。于是,我替换了浏览器自动更新,并下载了对应版本的驱动。
接着,利用硒在京东网搜索羽绒服,手机扫码登录,获得了羽绒服的商品名称,商品价格,店铺名称,评论人数等信息。
<span>from</span> selenium <span>import</span><span> webdriver </span><span>from</span> selenium.webdriver.support <span>import</span><span> expected_conditions as EC </span><span>from</span> selenium.webdriver.common.by <span>import</span><span> By </span><span>from</span> selenium.webdriver.support.ui <span>import</span><span> WebDriverWait </span><span>from</span> lxml <span>import</span><span> etree </span><span>import</span><span> random </span><span>import</span><span> json </span><span>import</span><span> csv </span><span>import</span><span> time browser </span>= webdriver.Chrome(<span>"</span><span>/菜J学Python/京东/chromedriver</span><span>"</span><span>) wait </span>=WebDriverWait(browser,50) <span>#</span><span>设置等待时间</span> url = <span>"</span><span>https://www.jd.com/</span><span>"</span><span> data_list</span>= [] <span>#</span><span>设置全局变量用来存储数据</span> keyword =<span>"</span><span>羽绒服</span><span>"</span><span>#</span><span>关键词</span> <span>def</span><span> page_click(page_number): </span><span>try</span><span>: </span><span>#</span><span> 滑动到底部</span> browser.execute_script(<span>"</span><span>window.scrollTo(0, document.body.scrollHeight);</span><span>"</span><span>) time.sleep(random.randint(</span>1, 3)) <span>#</span><span>随机延迟</span> button =<span> wait.until( EC.element_to_be_clickable((By.CSS_SELECTOR, </span><span>"</span><span>#J_bottomPage > span.p-num > a.pn-next > em</span><span>"</span><span>)) )</span><span>#</span><span>翻页按钮</span> button.click()<span>#</span><span>点击按钮</span> <span> wait.until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, </span><span>"</span><span>#J_goodsList > ul > li:nth-child(30)</span><span>"</span><span>)) )</span><span>#</span><span>等到30个商品都加载出来</span> <span>#</span><span> 滑到底部,加载出后30个商品</span> browser.execute_script(<span>"</span><span>window.scrollTo(0, document.body.scrollHeight);</span><span>"</span><span>) wait.until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, </span><span>"</span><span>#J_goodsList > ul > li:nth-child(60)</span><span>"</span><span>)) )</span><span>#</span><span>等到60个商品都加载出来</span> <span> wait.until( EC.text_to_be_present_in_element((By.CSS_SELECTOR, </span><span>"</span><span>#J_bottomPage > span.p-num > a.curr</span><span>"</span><span>), str(page_number)) )</span><span>#</span><span> 判断翻页成功,高亮的按钮数字与设置的页码一样</span> html = browser.page_source<span>#</span><span>获取网页信息</span> prase_html(html)<span>#</span><span>调用提取数据的函数</span> <span>except</span><span> TimeoutError: </span><span>return</span> page_click(page_number)
数据清洗
导入数据
<span>import</span><span> pandas as pd </span><span>import</span><span> numpy as np df </span>= pd.read_csv(<span>"</span><span>/菜J学Python/京东/羽绒服.csv</span><span>"</span><span>) df.sample(</span>10)
重命名列
df = df.rename(columns={<span>"</span><span>title</span><span>"</span>:<span>"</span><span>商品名称</span><span>"</span>,<span>"</span><span>price</span><span>"</span>:<span>"</span><span>商品价格</span><span>"</span>,<span>"</span><span>shop_name</span><span>"</span>:<span>"</span><span>店铺名称</span><span>"</span>,<span>"</span><span>comment</span><span>"</span>:<span>"</span><span>评论人数</span><span>"</span>})
查看数据信息
<span>df.info() </span><span>"""</span><span> 1.可能存在重复值 2.商店名称存在缺失值 3.评价人数需要清洗 </span><span>"""</span> <<span>class</span> <span>"</span><span>pandas.core.frame.DataFrame</span><span>"</span>><span> RangeIndex: </span>4950 entries, 0 to 4949<span> Data columns (total </span>4<span> columns): </span><span>#</span><span> Column Non-Null Count Dtype </span> --- ------ -------------- -----<span> 0 商品名称 </span>4950 non-<span>null object </span>1 商品价格 4950 non-<span>null float64 </span>2 店铺名称 4949 non-<span>null object </span>3 评论人数 4950 non-<span>null object dtypes: float64(</span>1), object(3<span>) memory usage: </span>154.8+ KB
删除重复数据
df = df.drop_duplicates()
缺失值处理
df[<span>"</span><span>店铺名称</span><span>"</span>] = df[<span>"</span><span>店铺名称</span><span>"</span>].fillna(<span>"</span><span>无名氏</span><span>"</span>)
商品名称清洗
厚度
tmp=<span>[] </span><span>for</span> i <span>in</span> df[<span>"</span><span>商品名称</span><span>"</span><span>]: </span><span>if</span><span>"</span><span>厚</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>厚款</span><span>"</span><span>) </span><span>elif</span><span>"</span><span>薄</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>薄款</span><span>"</span><span>) </span><span>else</span><span>: tmp.append(</span><span>"</span><span>其他</span><span>"</span><span>) df[</span><span>"</span><span>厚度</span><span>"</span>] = tmp
版型
<span>for</span> i <span>in</span> df[<span>"</span><span>商品名称</span><span>"</span><span>]: </span><span>if</span><span>"</span><span>修身</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>修身型</span><span>"</span><span>) </span><span>elif</span><span>"</span><span>宽松</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>宽松型</span><span>"</span><span>) </span><span>else</span><span>: tmp.append(</span><span>"</span><span>其他</span><span>"</span><span>) df[</span><span>"</span><span>版型</span><span>"</span>] = tmp
风格
tmp=<span>[] </span><span>for</span> i <span>in</span> df[<span>"</span><span>商品名称</span><span>"</span><span>]: </span><span>if</span><span>"</span><span>韩</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>韩版</span><span>"</span><span>) </span><span>elif</span><span>"</span><span>商务</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>商务风</span><span>"</span><span>) </span><span>elif</span><span>"</span><span>休闲</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>休闲风</span><span>"</span><span>) </span><span>elif</span><span>"</span><span>简约</span><span>"</span><span>in</span><span> i: tmp.append(</span><span>"</span><span>简约风</span><span>"</span><span>) </span><span>else</span><span>: tmp.append(</span><span>"</span><span>其他</span><span>"</span><span>) df[</span><span>"</span><span>风格</span><span>"</span>] = tmp
商品价格清洗
df[<span>"</span><span>价格区间</span><span>"</span>] = pd.cut(df[<span>"</span><span>商品价格</span><span>"</span>],[0, 100,300, 500, 700, 1000,1000000],labels=[<span>"</span><span>100元以下</span><span>"</span>,<span>"</span><span>100元-300元</span><span>"</span>,<span>"</span><span>300元-500元</span><span>"</span>,<span>"</span><span>500元-700元</span><span>"</span>,<span>"</span><span>700元-1000元</span><span>"</span>,<span>"</span><span>1000元以上</span><span>"</span>],right=False)
评价人数清洗
<span>import</span><span> re df[</span><span>"</span><span>数字</span><span>"</span>] = [re.findall(r<span>"</span><span>(d+.{0,1}d*)</span><span>"</span>, i)[0] <span>for</span> i <span>in</span> df[<span>"</span><span>评论人数</span><span>"</span>]] <span>#</span><span>提取数字</span> df[<span>"</span><span>数字</span><span>"</span>] = df[<span>"</span><span>数字</span><span>"</span>].astype(<span>"</span><span>float</span><span>"</span>) <span>#</span><span>转化数值型</span> df[<span>"</span><span>单位</span><span>"</span>] = [<span>""</span>.join(re.findall(r<span>"</span><span>(万)</span><span>"</span>, i)) <span>for</span> i <span>in</span> df[<span>"</span><span>评论人数</span><span>"</span>]] <span>#</span><span>提取单位(万)</span> df[<span>"</span><span>单位</span><span>"</span>] = df[<span>"</span><span>单位</span><span>"</span>].apply(<span>lambda</span> x:10000<span>if</span> x==<span>"</span><span>万</span><span>"</span><span>else1) df[</span><span>"</span><span>评论人数</span><span>"</span>] = df[<span>"</span><span>数字</span><span>"</span>] * df[<span>"</span><span>单位</span><span>"</span>] <span>#</span><span> 计算评论人数</span> df[<span>"</span><span>评论人数</span><span>"</span>] = df[<span>"</span><span>评论人数</span><span>"</span>].astype(<span>"</span><span>int</span><span>"</span><span>) df.drop([</span><span>"</span><span>数字</span><span>"</span>, <span>"</span><span>单位</span><span>"</span>], axis=1, inplace=True)
店铺名称清洗
df[<span>"</span><span>店铺类型</span><span>"</span>] = df[<span>"</span><span>店铺名称</span><span>"</span>].str[-3:]
可视化
引入可视化相关库
<span>import</span><span> matplotlib.pyplot as plt </span><span>import</span><span> seaborn as sns </span>%<span>matplotlib inline plt.rcParams[</span><span>"</span><span>font.sans-serif</span><span>"</span>] = [<span>"</span><span>SimHei</span><span>"</span>] <span>#</span><span> 设置加载的字体名</span> plt.rcParams[<span>"</span><span>axes.unicode_minus</span><span>"</span>] = False<span>#</span><span> 解决保存图像是负号"-"显示为方块的问题 </span> <span>import</span><span> jieba </span><span>import</span><span> re </span><span>from</span> pyecharts.charts <span>import</span> * <span>from</span> pyecharts <span>import</span><span> options as opts </span><span>from</span> pyecharts.globals <span>import</span><span> ThemeType </span><span>import</span><span> stylecloud </span><span>from</span> IPython.display <span>import</span> Image
描述性统计
相关性分析
商品价格分布直方图
sns.set_style(<span>"</span><span>white</span><span>"</span><span>) fig,axes</span>=plt.subplots(figsize=(15,8<span>)) sns.distplot(df[</span><span>"</span><span>商品价格</span><span>"</span>],color=<span>"</span><span>salmon</span><span>"</span>,bins=10<span>) plt.xticks(fontsize</span>=16<span>) plt.yticks(fontsize</span>=16<span>) axes.set_title(</span><span>"</span><span>商品价格分布直方图</span><span>"</span>)
评论人数分布直方图
sns.set_style(<span>"</span><span>white</span><span>"</span><span>) fig,axes</span>=plt.subplots(figsize=(15,8<span>)) sns.distplot(df[</span><span>"</span><span>评论人数</span><span>"</span>],color=<span>"</span><span>green</span><span>"</span>,bins=10,rug=<span>True) plt.xticks(fontsize</span>=16<span>) plt.yticks(fontsize</span>=16<span>) axes.set_title(</span><span>"</span><span>评论人数分布直方图</span><span>"</span>)
评论人数与商品价格的关系
fig,axes=plt.subplots(figsize=(15,8<span>)) sns.regplot(x</span>=<span>"</span><span>评论人数</span><span>"</span>,y=<span>"</span><span>商品价格</span><span>"</span>,data=df,color=<span>"</span><span>orange</span><span>"</span>,marker=<span>"</span><span>*</span><span>"</span><span>) plt.xticks(fontsize</span>=16<span>) plt.yticks(fontsize</span>=16)
羽绒服价格分布
df2 = df[<span>"</span><span>价格区间</span><span>"</span>].astype(<span>"</span><span>str</span><span>"</span><span>).value_counts() </span><span>print</span><span>(df2) df2 </span>= df2.sort_values(ascending=<span>False) regions </span>=<span> df2.index.to_list() values </span>=<span> df2.to_list() c </span>=<span> ( Pie(init_opts</span>=opts.InitOpts(theme=<span>ThemeType.DARK)) .add(</span><span>""</span><span>, list(zip(regions,values))) .set_global_opts(legend_opts </span>= opts.LegendOpts(is_show = False),title_opts=opts.TitleOpts(title=<span>"</span><span>羽绒服价格区间分布</span><span>"</span>,subtitle=<span>"</span><span>数据来源:腾讯视频 制图:菜J学Python</span><span>"</span>,pos_top=<span>"</span><span>0.5%</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>)) .set_series_opts(label_opts</span>=opts.LabelOpts(formatter=<span>"</span><span>{b}:{d}%</span><span>"</span>,font_size=14<span>)) ) c.render_notebook()</span>
评论人数top10店铺
df5 = df.groupby(<span>"</span><span>店铺名称</span><span>"</span>)[<span>"</span><span>评论人数</span><span>"</span><span>].mean() df5 </span>= df5.sort_values(ascending=<span>True) df5 </span>= df5.tail(10<span>) </span><span>print</span><span>(df5.index.to_list()) </span><span>print</span><span>(df5.to_list()) c </span>=<span> ( Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1100px</span><span>"</span>,height=<span>"</span><span>600px</span><span>"</span><span>)) .add_xaxis(df5.index.to_list()) .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span> .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>评论人数TOP10</span><span>"</span>,subtitle=<span>"</span><span>数据来源:京东 制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>), xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span> <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span> yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span> <span> ) .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>)) ) c.render_notebook()</span>
版型
df5 = df.groupby(<span>"</span><span>版型</span><span>"</span>)[<span>"</span><span>商品价格</span><span>"</span><span>].mean() df5 </span>= df5.sort_values(ascending=True)[:2<span>] </span><span>#</span><span>df5 = df5.tail(10)</span> df5 = df5.round(2<span>) </span><span>print</span><span>(df5.index.to_list()) </span><span>print</span><span>(df5.to_list()) c </span>=<span> ( Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1000px</span><span>"</span>,height=<span>"</span><span>500px</span><span>"</span><span>)) .add_xaxis(df5.index.to_list()) .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span> .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>各版型羽绒服均价</span><span>"</span>,subtitle=<span>"</span><span>数据来源:中原地产 制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>), xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span> <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span> yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span> <span> ) .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>)) ) c.render_notebook()</span>
厚度
df5 = df.groupby(<span>"</span><span>厚度</span><span>"</span>)[<span>"</span><span>商品价格</span><span>"</span><span>].mean() df5 </span>= df5.sort_values(ascending=True)[:2<span>] </span><span>#</span><span>df5 = df5.tail(10)</span> df5 = df5.round(2<span>) </span><span>print</span><span>(df5.index.to_list()) </span><span>print</span><span>(df5.to_list()) c </span>=<span> ( Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1000px</span><span>"</span>,height=<span>"</span><span>500px</span><span>"</span><span>)) .add_xaxis(df5.index.to_list()) .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span> .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>各厚度羽绒服均价</span><span>"</span>,subtitle=<span>"</span><span>数据来源:京东 制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>), xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span> <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span> yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span> <span> ) .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>)) ) c.render_notebook()</span>
风格
df5 = df.groupby(<span>"</span><span>风格</span><span>"</span>)[<span>"</span><span>商品价格</span><span>"</span><span>].mean() df5 </span>= df5.sort_values(ascending=True)[:4<span>] </span><span>#</span><span>df5 = df5.tail(10)</span> df5 = df5.round(2<span>) </span><span>print</span><span>(df5.index.to_list()) </span><span>print</span><span>(df5.to_list()) c </span>=<span> ( Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1000px</span><span>"</span>,height=<span>"</span><span>500px</span><span>"</span><span>)) .add_xaxis(df5.index.to_list()) .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span> .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>各风格羽绒服均价</span><span>"</span>,subtitle=<span>"</span><span>数据来源:京东 制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>), xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span> <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span> yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span> <span> ) .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>)) ) c.render_notebook()</span>