• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

Python爬取某东羽绒服数据,用可视化帮你挑选心仪的衣服

python 搞java代码 3年前 (2022-05-21) 20次浏览 已收录 0个评论

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

以下文章附带菜J学Python ,作者J哥

刚接触Python的新手、小白,可以复制下面的链接去免费观看Python的基础入门教学视频

https://v.douyu.<a href="https://www.gaodaima.com/tag/com" title="查看更多关于com的文章" target="_blank">com</a>/author/y6AZ4jn9jwKW
www#gaodaima.com来源gaodai.ma#com搞#代!码网搞代码

 

前言

前就在上,广深的朋友估计还穿着短袖羡慕着北方的下雪气氛。结果就在上周,广深也迎来了降温,大家纷纷加入“降温群聊”。

 

为了帮助大家抵抗御严寒,我特地爬了下京东的羽绒服数据。为啥不是天猫呢,理由很简单,滑块验证有点麻烦。

数据获取

京东网站是一个ajax动态加载的网站,只能通过解析接口或使用硒自动化测试工具去爬取。关于动态网页爬虫,本公众号历史原创文章介绍过,感兴趣的朋友可以去了解一下。

本次数据获取采用硒,由于我的谷歌浏览器版本更新较快,导致原来的谷歌驱动程序中断。于是,我替换了浏览器自动更新,并下载了对应版本的驱动。

接着,利用硒在京东网搜索羽绒服,手机扫码登录,获得了羽绒服的商品名称,商品价格,店铺名称,评论人数等信息。

<span>from</span> selenium <span>import</span><span> webdriver
</span><span>from</span> selenium.webdriver.support <span>import</span><span> expected_conditions as EC
</span><span>from</span> selenium.webdriver.common.by <span>import</span><span> By
</span><span>from</span> selenium.webdriver.support.ui <span>import</span><span> WebDriverWait
</span><span>from</span> lxml <span>import</span><span> etree
</span><span>import</span><span> random
</span><span>import</span><span> json
</span><span>import</span><span> csv
</span><span>import</span><span> time

browser </span>= webdriver.Chrome(<span>"</span><span>/菜J学Python/京东/chromedriver</span><span>"</span><span>)
wait </span>=WebDriverWait(browser,50) <span>#</span><span>设置等待时间</span>
url = <span>"</span><span>https://www.jd.com/</span><span>"</span><span>
data_list</span>= [] <span>#</span><span>设置全局变量用来存储数据</span>
keyword =<span>"</span><span>羽绒服</span><span>"</span><span>#</span><span>关键词</span>

<span>def</span><span> page_click(page_number):
    </span><span>try</span><span>:
        </span><span>#</span><span> 滑动到底部</span>
        browser.execute_script(<span>"</span><span>window.scrollTo(0, document.body.scrollHeight);</span><span>"</span><span>)
        time.sleep(random.randint(</span>1, 3)) <span>#</span><span>随机延迟</span>
        button =<span> wait.until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, </span><span>"</span><span>#J_bottomPage > span.p-num > a.pn-next > em</span><span>"</span><span>))
        )</span><span>#</span><span>翻页按钮</span>
        button.click()<span>#</span><span>点击按钮</span>
<span>        wait.until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, </span><span>"</span><span>#J_goodsList > ul > li:nth-child(30)</span><span>"</span><span>))
        )</span><span>#</span><span>等到30个商品都加载出来</span>
        <span>#</span><span> 滑到底部,加载出后30个商品</span>
        browser.execute_script(<span>"</span><span>window.scrollTo(0, document.body.scrollHeight);</span><span>"</span><span>)
        wait.until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, </span><span>"</span><span>#J_goodsList > ul > li:nth-child(60)</span><span>"</span><span>))
        )</span><span>#</span><span>等到60个商品都加载出来</span>
<span>        wait.until(
            EC.text_to_be_present_in_element((By.CSS_SELECTOR, </span><span>"</span><span>#J_bottomPage > span.p-num > a.curr</span><span>"</span><span>), str(page_number))
        )</span><span>#</span><span> 判断翻页成功,高亮的按钮数字与设置的页码一样</span>
        html = browser.page_source<span>#</span><span>获取网页信息</span>
        prase_html(html)<span>#</span><span>调用提取数据的函数</span>
    <span>except</span><span> TimeoutError:
        </span><span>return</span> page_click(page_number)

 

数据清洗

导入数据

<span>import</span><span> pandas as pd
</span><span>import</span><span> numpy as np
df </span>= pd.read_csv(<span>"</span><span>/菜J学Python/京东/羽绒服.csv</span><span>"</span><span>)
df.sample(</span>10)

 

 

 重命名列

df = df.rename(columns={<span>"</span><span>title</span><span>"</span>:<span>"</span><span>商品名称</span><span>"</span>,<span>"</span><span>price</span><span>"</span>:<span>"</span><span>商品价格</span><span>"</span>,<span>"</span><span>shop_name</span><span>"</span>:<span>"</span><span>店铺名称</span><span>"</span>,<span>"</span><span>comment</span><span>"</span>:<span>"</span><span>评论人数</span><span>"</span>})

 

查看数据信息

<span>df.info()
</span><span>"""</span><span>
1.可能存在重复值
2.商店名称存在缺失值
3.评价人数需要清洗
</span><span>"""</span>
<<span>class</span> <span>"</span><span>pandas.core.frame.DataFrame</span><span>"</span>><span>
RangeIndex: </span>4950 entries, 0 to 4949<span>
Data columns (total </span>4<span> columns):
 </span><span>#</span><span>   Column  Non-Null Count  Dtype  </span>
---  ------  --------------  -----<span>  
 0   商品名称    </span>4950 non-<span>null   object 
 </span>1   商品价格    4950 non-<span>null   float64
 </span>2   店铺名称    4949 non-<span>null   object 
 </span>3   评论人数    4950 non-<span>null   object 
dtypes: float64(</span>1), object(3<span>)
memory usage: </span>154.8+ KB

 

删除重复数据

df = df.drop_duplicates()

 

缺失值处理

df[<span>"</span><span>店铺名称</span><span>"</span>] = df[<span>"</span><span>店铺名称</span><span>"</span>].fillna(<span>"</span><span>无名氏</span><span>"</span>)

 

商品名称清洗

厚度

tmp=<span>[]
</span><span>for</span> i <span>in</span> df[<span>"</span><span>商品名称</span><span>"</span><span>]:
    </span><span>if</span><span>"</span><span>厚</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>厚款</span><span>"</span><span>)
    </span><span>elif</span><span>"</span><span>薄</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>薄款</span><span>"</span><span>)
    </span><span>else</span><span>:
        tmp.append(</span><span>"</span><span>其他</span><span>"</span><span>)
df[</span><span>"</span><span>厚度</span><span>"</span>] = tmp

 

版型

<span>for</span> i <span>in</span> df[<span>"</span><span>商品名称</span><span>"</span><span>]:
    </span><span>if</span><span>"</span><span>修身</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>修身型</span><span>"</span><span>)
    </span><span>elif</span><span>"</span><span>宽松</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>宽松型</span><span>"</span><span>)
    </span><span>else</span><span>:
        tmp.append(</span><span>"</span><span>其他</span><span>"</span><span>)
df[</span><span>"</span><span>版型</span><span>"</span>] = tmp

 

风格

tmp=<span>[]
</span><span>for</span> i <span>in</span> df[<span>"</span><span>商品名称</span><span>"</span><span>]:
    </span><span>if</span><span>"</span><span>韩</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>韩版</span><span>"</span><span>)
    </span><span>elif</span><span>"</span><span>商务</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>商务风</span><span>"</span><span>)
    </span><span>elif</span><span>"</span><span>休闲</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>休闲风</span><span>"</span><span>)
    </span><span>elif</span><span>"</span><span>简约</span><span>"</span><span>in</span><span> i:
        tmp.append(</span><span>"</span><span>简约风</span><span>"</span><span>)
    </span><span>else</span><span>:
        tmp.append(</span><span>"</span><span>其他</span><span>"</span><span>)
df[</span><span>"</span><span>风格</span><span>"</span>] = tmp

 

商品价格清洗

df[<span>"</span><span>价格区间</span><span>"</span>] = pd.cut(df[<span>"</span><span>商品价格</span><span>"</span>],[0, 100,300, 500, 700, 1000,1000000],labels=[<span>"</span><span>100元以下</span><span>"</span>,<span>"</span><span>100元-300元</span><span>"</span>,<span>"</span><span>300元-500元</span><span>"</span>,<span>"</span><span>500元-700元</span><span>"</span>,<span>"</span><span>700元-1000元</span><span>"</span>,<span>"</span><span>1000元以上</span><span>"</span>],right=False)

 

评价人数清洗

<span>import</span><span> re
df[</span><span>"</span><span>数字</span><span>"</span>] = [re.findall(r<span>"</span><span>(d+.{0,1}d*)</span><span>"</span>, i)[0] <span>for</span> i <span>in</span> df[<span>"</span><span>评论人数</span><span>"</span>]]  <span>#</span><span>提取数字</span>
df[<span>"</span><span>数字</span><span>"</span>] = df[<span>"</span><span>数字</span><span>"</span>].astype(<span>"</span><span>float</span><span>"</span>)  <span>#</span><span>转化数值型</span>
df[<span>"</span><span>单位</span><span>"</span>] = [<span>""</span>.join(re.findall(r<span>"</span><span>(万)</span><span>"</span>, i)) <span>for</span> i <span>in</span> df[<span>"</span><span>评论人数</span><span>"</span>]]  <span>#</span><span>提取单位(万)</span>
df[<span>"</span><span>单位</span><span>"</span>] = df[<span>"</span><span>单位</span><span>"</span>].apply(<span>lambda</span> x:10000<span>if</span> x==<span>"</span><span>万</span><span>"</span><span>else1)
df[</span><span>"</span><span>评论人数</span><span>"</span>] = df[<span>"</span><span>数字</span><span>"</span>] * df[<span>"</span><span>单位</span><span>"</span>] <span>#</span><span> 计算评论人数</span>
df[<span>"</span><span>评论人数</span><span>"</span>] = df[<span>"</span><span>评论人数</span><span>"</span>].astype(<span>"</span><span>int</span><span>"</span><span>)
df.drop([</span><span>"</span><span>数字</span><span>"</span>, <span>"</span><span>单位</span><span>"</span>], axis=1, inplace=True)

 

店铺名称清洗

df[<span>"</span><span>店铺类型</span><span>"</span>] = df[<span>"</span><span>店铺名称</span><span>"</span>].str[-3:]

 

可视化

引入可视化相关库

<span>import</span><span> matplotlib.pyplot as plt
</span><span>import</span><span> seaborn as sns
</span>%<span>matplotlib inline
plt.rcParams[</span><span>"</span><span>font.sans-serif</span><span>"</span>] = [<span>"</span><span>SimHei</span><span>"</span>]  <span>#</span><span> 设置加载的字体名</span>
plt.rcParams[<span>"</span><span>axes.unicode_minus</span><span>"</span>] = False<span>#</span><span> 解决保存图像是负号"-"显示为方块的问题 </span>
<span>import</span><span> jieba
</span><span>import</span><span> re
</span><span>from</span> pyecharts.charts <span>import</span> *
<span>from</span> pyecharts <span>import</span><span> options as opts 
</span><span>from</span> pyecharts.globals <span>import</span><span> ThemeType  
</span><span>import</span><span> stylecloud
</span><span>from</span> IPython.display <span>import</span> Image

 

描述性统计

 

 

相关性分析

商品价格分布直方图

sns.set_style(<span>"</span><span>white</span><span>"</span><span>)   
fig,axes</span>=plt.subplots(figsize=(15,8<span>)) 
sns.distplot(df[</span><span>"</span><span>商品价格</span><span>"</span>],color=<span>"</span><span>salmon</span><span>"</span>,bins=10<span>) 
plt.xticks(fontsize</span>=16<span>)
plt.yticks(fontsize</span>=16<span>)
axes.set_title(</span><span>"</span><span>商品价格分布直方图</span><span>"</span>)

 

 

 

评论人数分布直方图

sns.set_style(<span>"</span><span>white</span><span>"</span><span>)  
fig,axes</span>=plt.subplots(figsize=(15,8<span>)) 
sns.distplot(df[</span><span>"</span><span>评论人数</span><span>"</span>],color=<span>"</span><span>green</span><span>"</span>,bins=10,rug=<span>True) 
plt.xticks(fontsize</span>=16<span>)
plt.yticks(fontsize</span>=16<span>)
axes.set_title(</span><span>"</span><span>评论人数分布直方图</span><span>"</span>)

 

 

 

评论人数与商品价格的关系

fig,axes=plt.subplots(figsize=(15,8<span>)) 
sns.regplot(x</span>=<span>"</span><span>评论人数</span><span>"</span>,y=<span>"</span><span>商品价格</span><span>"</span>,data=df,color=<span>"</span><span>orange</span><span>"</span>,marker=<span>"</span><span>*</span><span>"</span><span>)
plt.xticks(fontsize</span>=16<span>)
plt.yticks(fontsize</span>=16)

 

 

 

羽绒服价格分布

df2 = df[<span>"</span><span>价格区间</span><span>"</span>].astype(<span>"</span><span>str</span><span>"</span><span>).value_counts()
</span><span>print</span><span>(df2)
df2 </span>= df2.sort_values(ascending=<span>False)
regions </span>=<span> df2.index.to_list()
values </span>=<span> df2.to_list()
c </span>=<span> (
        Pie(init_opts</span>=opts.InitOpts(theme=<span>ThemeType.DARK))
        .add(</span><span>""</span><span>, list(zip(regions,values)))
        .set_global_opts(legend_opts </span>= opts.LegendOpts(is_show = False),title_opts=opts.TitleOpts(title=<span>"</span><span>羽绒服价格区间分布</span><span>"</span>,subtitle=<span>"</span><span>数据来源:腾讯视频
制图:菜J学Python</span><span>"</span>,pos_top=<span>"</span><span>0.5%</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>))
        .set_series_opts(label_opts</span>=opts.LabelOpts(formatter=<span>"</span><span>{b}:{d}%</span><span>"</span>,font_size=14<span>))
        
    )
c.render_notebook()</span>

 

 

 

评论人数top10店铺

df5 = df.groupby(<span>"</span><span>店铺名称</span><span>"</span>)[<span>"</span><span>评论人数</span><span>"</span><span>].mean()
df5 </span>= df5.sort_values(ascending=<span>True)
df5 </span>= df5.tail(10<span>)
</span><span>print</span><span>(df5.index.to_list())
</span><span>print</span><span>(df5.to_list())
c </span>=<span> (
    Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1100px</span><span>"</span>,height=<span>"</span><span>600px</span><span>"</span><span>))
    .add_xaxis(df5.index.to_list())
    .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span>
    .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>评论人数TOP10</span><span>"</span>,subtitle=<span>"</span><span>数据来源:京东 	制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>),
                       xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span>
                       <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span>
                        yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span>
<span>                       )
    .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>))
    )
c.render_notebook()</span>

 

 

 

版型

df5 = df.groupby(<span>"</span><span>版型</span><span>"</span>)[<span>"</span><span>商品价格</span><span>"</span><span>].mean()
df5 </span>= df5.sort_values(ascending=True)[:2<span>]
</span><span>#</span><span>df5 = df5.tail(10)</span>
df5 = df5.round(2<span>)
</span><span>print</span><span>(df5.index.to_list())
</span><span>print</span><span>(df5.to_list())
c </span>=<span> (
    Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1000px</span><span>"</span>,height=<span>"</span><span>500px</span><span>"</span><span>))
    .add_xaxis(df5.index.to_list())
    .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span>
    .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>各版型羽绒服均价</span><span>"</span>,subtitle=<span>"</span><span>数据来源:中原地产 	制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>),
                       xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span>
                       <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span>
                        yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span>
<span>                       )
    .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>))
    )
c.render_notebook()</span>

 

 

 

厚度

df5 = df.groupby(<span>"</span><span>厚度</span><span>"</span>)[<span>"</span><span>商品价格</span><span>"</span><span>].mean()
df5 </span>= df5.sort_values(ascending=True)[:2<span>]
</span><span>#</span><span>df5 = df5.tail(10)</span>
df5 = df5.round(2<span>)
</span><span>print</span><span>(df5.index.to_list())
</span><span>print</span><span>(df5.to_list())
c </span>=<span> (
    Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1000px</span><span>"</span>,height=<span>"</span><span>500px</span><span>"</span><span>))
    .add_xaxis(df5.index.to_list())
    .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span>
    .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>各厚度羽绒服均价</span><span>"</span>,subtitle=<span>"</span><span>数据来源:京东 	制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>),
                       xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span>
                       <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span>
                        yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span>
<span>                       )
    .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>))
    )
c.render_notebook()</span>

 

 

 

风格

df5 = df.groupby(<span>"</span><span>风格</span><span>"</span>)[<span>"</span><span>商品价格</span><span>"</span><span>].mean()
df5 </span>= df5.sort_values(ascending=True)[:4<span>]
</span><span>#</span><span>df5 = df5.tail(10)</span>
df5 = df5.round(2<span>)
</span><span>print</span><span>(df5.index.to_list())
</span><span>print</span><span>(df5.to_list())
c </span>=<span> (
    Bar(init_opts</span>=opts.InitOpts(theme=ThemeType.DARK,width=<span>"</span><span>1000px</span><span>"</span>,height=<span>"</span><span>500px</span><span>"</span><span>))
    .add_xaxis(df5.index.to_list())
    .add_yaxis(</span><span>""</span>,df5.to_list()).reversal_axis() <span>#</span><span>X轴与y轴调换顺序</span>
    .set_global_opts(title_opts=opts.TitleOpts(title=<span>"</span><span>各风格羽绒服均价</span><span>"</span>,subtitle=<span>"</span><span>数据来源:京东 	制图:J哥</span><span>"</span>,pos_left = <span>"</span><span>left</span><span>"</span><span>),
                       xaxis_opts</span>=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=11)), <span>#</span><span>更改横坐标字体大小</span>
                       <span>#</span><span>yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),</span>
                        yaxis_opts=opts.AxisOpts(axislabel_opts={<span>"</span><span>rotate</span><span>"</span>:30})<span>#</span><span>更改纵坐标字体大小</span>
<span>                       )
    .set_series_opts(label_opts</span>=opts.LabelOpts(font_size=16,position=<span>"</span><span>right</span><span>"</span><span>))
    )
c.render_notebook()</span>

 

 

 

羽绒服词云图


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Python爬取某东羽绒服数据,用可视化帮你挑选心仪的衣服

喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址