本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。
以下文章来源于Python干货铺子 ,作者:不正经的kimol君
一、模拟登陆
在搜索栏里填好关键词:“显卡”,小手轻快敲击着回车键(小样~看我地)。
心情愉悦的我满怀期待地等待着,等待着那满屏的商品信息,然而苦苦的等待换来的却是302,于是我意外地来到了登陆界面:
情况基本就是这么个情况了…
随后我查了一下,随着淘宝反爬手段的不断加强,很多小伙伴应该已经发现,淘宝搜索功能是需要用户登陆的!
关于淘宝模拟登陆,有大大已经利用requests成功模拟登陆。然而,这个方法得先分析淘宝登陆的各种请求,并模拟生成相应的参数,相对来说有一定的难度。于是我决定换一种思路,通过selenium+二维码的方式:
<span>#</span><span> 打开图片</span> <span>def</span><span> Openimg(img_location): img</span>=<span>Image.open(img_location) img.show() </span><span>#</span><span> 登陆获取cookies</span> <span>def</span><span> Login(): driver </span>=<span> webdriver.PhantomJS() driver.get(</span><span>"</span><span>https://login.taobao.com/member/login.jhtml</span><span>"</span><span>) </span><span>try</span><span>: driver.find_element_by_xpath(</span><span>"</span><span>//*[@id="login"]/div[1]/i</span><span>"</span><span>).click() </span><span>except</span><span>: </span><span>pass</span><span> time.sleep(</span>3<span>) </span><span>#</span><span> 执行JS获得canvas的二维码</span> JS = <span>"</span><span>return document.getElementsByTagName("canvas")[0].toDataURL("image/png");</span><span>"</span><span> im_info </span>= driver.execute_script(JS) <span>#</span><span> 执行JS获取图片信息</span> im_base64 = im_info.split(<span>"</span><span>,</span><span>"</span>)[1] <span>#</span><span>拿到base64编码的图片信息</span> im_bytes = base64.b64decode(im_base64) <span>#</span><span>转为bytes类型</span> time.sleep(2<span>) with open(</span><span>"</span><span>./login.png</span><span>"</span>,<span>"</span><span>wb</span><span>"</span><span>) as f: f.write(im_bytes) f.close() t </span>= threading.Thread(target=Openimg,args=(<span>"</span><span>./login.png</span><span>"</span><span>,)) t.start() </span><span>print</span>(<span>"</span><span>Logining...Please sweep the code! </span><span>"</span><span>) </span><span>while</span><span>(True): c </span>=<span> driver.get_cookies() </span><span>if</span> len(c) > 20: <span>#</span><span>登陆成功获取到cookies</span> cookies =<span> {} </span><span>for</span> i <span>in</span><span> range(len(c)): cookies[c[i][</span><span>"</span><span>name</span><span>"</span>]] = c[i][<span>"</span><span>value</span><span>"</span><span>] driver.close() </span><span>print</span>(<span>"</span><span>Login in successfully! </span><span>"</span><span>) </span><span>return</span><span> cookies time.sleep(</span>1)
www#gaodaima.com来源gao.dai.ma.com搞@代*码网搞代码
通过webdriver打开淘宝登陆界面,把二维码下载到本地并打开等待用户扫码(相应的元素大家通过浏览器的F12元素分析很容易就能找出)。待扫码成功后,将webdriver里的cookies转为DICT形式,并返回。
这里是为了后续requests爬取信息的时候使用
二、爬取商品信息
当我拿到cookies之后,爬取商品信息便是信手拈来。(小样,我来啦~)
1. 定义相关参数
定义相应的请求地址,请求头等等:
<span>#</span><span> 定义参数</span> headers = {<span>"</span><span>Host</span><span>"</span>:<span>"</span><span>s.taobao.com</span><span>"</span><span>, </span><span>"</span><span>User-Agent</span><span>"</span>:<span>"</span><span>Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0</span><span>"</span><span>, </span><span>"</span><span>Accept</span><span>"</span>:<span>"</span><span>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</span><span>"</span><span>, </span><span>"</span><span>Accept-Language</span><span>"</span>:<span>"</span><span>zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2</span><span>"</span><span>, </span><span>"</span><span>Accept-Encoding</span><span>"</span>:<span>"</span><span>gzip, deflate, br</span><span>"</span><span>, </span><span>"</span><span>Connection</span><span>"</span>:<span>"</span><span>keep-alive</span><span>"</span><span>} list_url </span>= <span>"</span><span>http://s.taobao.com/search?q=%(key)s&ie=utf8&s=%(page)d</span><span>"</span>
2. 分析并定义正则
当请求得到HTML页面后,想要得到我们想要的数据就必须得对其进行提取,这里我选择了正则的方式。通过查看页面源码:
偷懒的我上面只标志了两个数据,不过其他也是类似的,于是得到以下正则:
<span>#</span><span> 正则模式</span> p_title = <span>"</span><span>"raw_title":"(.*?)"</span><span>"</span> <span>#</span><span>标题</span> p_location = <span>"</span><span>"item_loc":"(.*?)"</span><span>"</span> <span>#</span><span>销售地</span> p_sale = <span>"</span><span>"view_sales":"(.*?)人付款"</span><span>"</span> <span>#</span><span>销售量</span> p_comment = <span>"</span><span>"comment_count":"(.*?)"</span><span>"</span><span>#</span><span>评论数</span> p_price = <span>"</span><span>"view_price":"(.*?)"</span><span>"</span> <span>#</span><span>销售价格</span> p_nid = <span>"</span><span>"nid":"(.*?)"</span><span>"</span> <span>#</span><span>商品唯一ID</span> p_img = <span>"</span><span>"pic_url":"(.*?)"</span><span>"</span> <span>#</span><span>图片URL</span>
(ps.聪明的小伙伴应该已经发现了,其实商品信息是被保存在了g_page_config变量里面,所以我们也可以先提取这个变量(一个字典),然后再读取数据,亦可!)
3. 数据爬取
万事俱备,只欠东风。于是,东风踏着它轻快的脚步来了:
<span>#</span><span> 数据爬取</span> key = input(<span>"</span><span>请输入关键字:</span><span>"</span>) <span>#</span><span> 商品的关键词</span> N = 20 <span>#</span><span> 爬取的页数 </span> data =<span> [] cookies </span>=<span> Login() </span><span>for</span> i <span>in</span><span> range(N): </span><span>try</span><span>: page </span>= i*44<span> url </span>= list_url%{<span>"</span><span>key</span><span>"</span>:key,<span>"</span><span>page</span><span>"</span><span>:page} res </span>= requests.get(url,headers=headers,cookies=<span>cookies) html </span>=<span> res.text title </span>=<span> re.findall(p_title,html) location </span>=<span> re.findall(p_location,html) sale </span>=<span> re.findall(p_sale,html) comment </span>=<span> re.findall(p_comment,html) price </span>=<span> re.findall(p_price,html) nid </span>=<span> re.findall(p_nid,html) img </span>=<span> re.findall(p_img,html) </span><span>for</span> j <span>in</span><span> range(len(title)): data.append([title[j],location[j],sale[j],comment[j],price[j],nid[j],img[j]]) </span><span>print</span>(<span>"</span><span>-------Page%s complete!-------- </span><span>"</span>%(i+1<span>)) time.sleep(</span>3<span>) </span><span>except</span><span>: </span><span>pass</span><span> data </span>= pd.DataFrame(data,columns=[<span>"</span><span>title</span><span>"</span>,<span>"</span><span>location</span><span>"</span>,<span>"</span><span>sale</span><span>"</span>,<span>"</span><span>comment</span><span>"</span>,<span>"</span><span>price</span><span>"</span>,<span>"</span><span>nid</span><span>"</span>,<span>"</span><span>img</span><span>"</span><span>]) data.to_csv(</span><span>"</span><span>%s.csv</span><span>"</span>%key,encoding=<span>"</span><span>utf-8</span><span>"</span>,index=False)
上面代码爬取了20页商品信息,并将其保存在本地的csv文件中,效果是这样的:
三、简单数据分析
有了数据,放着岂不是浪费,我可是社会主义好青年,怎能做这种事?那么,就让我们来简简单单分析一下这些数据:
(当然,数据量小,仅供娱乐参考)
1.导入库
<span>#</span><span> 导入相关库</span> <span>import</span><span> jieba </span><span>import</span><span> operator </span><span>import</span><span> pandas as pd </span><span>from</span> wordcloud <span>import</span><span> WordCloud </span><span>from</span> matplotlib <span>import</span> pyplot as plt
相应库的安装方法(其实基本都能通过pip解决):
- jieba
- pandas
- wordcloud
- matplotlib
2.中文显示
<span>#</span><span> matplotlib中文显示</span> plt.rcParams[<span>"</span><span>font.family</span><span>"</span>] = [<span>"</span><span>sans-serif</span><span>"</span><span>] plt.rcParams[</span><span>"</span><span>font.sans-serif</span><span>"</span>] = [<span>"</span><span>SimHei</span><span>"</span>]
不设置可能出现中文乱码等闹心的情况哦~
3.读取数据
<span>#</span><span> 读取数据</span> key = <span>"</span><span>显卡</span><span>"</span><span> data </span>= pd.read_csv(<span>"</span><span>%s.csv</span><span>"</span>%key,encoding=<span>"</span><span>utf-8</span><span>"</span>,engine=<span>"</span><span>python</span><span>"</span>)
4.分析价格分布
<span>#</span><span> 价格分布</span> plt.figure(figsize=(16,9<span>)) plt.hist(data[</span><span>"</span><span>price</span><span>"</span>],bins=20,alpha=0.6<span>) plt.title(</span><span>"</span><span>价格频率分布直方图</span><span>"</span><span>) plt.xlabel(</span><span>"</span><span>价格</span><span>"</span><span>) plt.ylabel(</span><span>"</span><span>频数</span><span>"</span><span>) plt.savefig(</span><span>"</span><span>价格分布.png</span><span>"</span>)
价格频率分布直方图:
5.分析销售地分布
<span>#</span><span> 销售地分布</span> group_data = list(data.groupby(<span>"</span><span>location</span><span>"</span><span>)) loc_num </span>=<span> {} </span><span>for</span> i <span>in</span><span> range(len(group_data)): loc_num[group_data[i][0]] </span>= len(group_data[i][1<span>]) plt.figure(figsize</span>=(19,9<span>)) plt.title(</span><span>"</span><span>销售地</span><span>"</span><span>) plt.scatter(list(loc_num.keys())[:</span>20],list(loc_num.values())[:20],color=<span>"</span><span>r</span><span>"</span><span>) plt.plot(list(loc_num.keys())[:</span>20],list(loc_num.values())[:20<span>]) plt.savefig(</span><span>"</span><span>销售地.png</span><span>"</span><span>) sorted_loc_num </span>= sorted(loc_num.items(), key=operator.itemgetter(1),reverse=True)<span>#</span><span>排序</span> loc_num_10 = sorted_loc_num[:10] <span>#</span><span>取前10</span> loc_10 =<span> [] num_10 </span>=<span> [] </span><span>for</span> i <span>in</span> range(10<span>): loc_10.append(loc_num_10[i][0]) num_10.append(loc_num_10[i][</span>1<span>]) plt.figure(figsize</span>=(16,9<span>)) plt.title(</span><span>"</span><span>销售地TOP10</span><span>"</span><span>) plt.bar(loc_10,num_10,facecolor </span>= <span>"</span><span>lightskyblue</span><span>"</span>,edgecolor = <span>"</span><span>white</span><span>"</span><span>) plt.savefig(</span><span>"</span><span>销售地TOP10.png</span><span>"</span>)
销售地分布:
销售地TOP10:
6.词云分析
<span>#</span><span> 制作词云</span> content = <span>""</span> <span>for</span> i <span>in</span><span> range(len(data)): content </span>+= data[<span>"</span><span>title</span><span>"</span><span>][i] wl </span>= jieba.cut(content,cut_all=<span>True) wl_space_split </span>= <span>"</span> <span>"</span><span>.join(wl) wc </span>= WordCloud(<span>"</span><span>simhei.ttf</span><span>"</span><span>, background_color</span>=<span>"</span><span>white</span><span>"</span>, <span>#</span><span> 背景颜色</span> width=1000<span>, height</span>=600<span>,).generate(wl_space_split) wc.to_file(</span><span>"</span><span>%s.png</span><span>"</span>%key)
淘宝商品”显卡“的词云: