本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。
以下文章来源于云+社区,作者 深雾
转载地址
<code><span class="hljs-attribute">https:<span class="hljs-comment">//www.gaodaima.<a href="https://www.gaodaima.com/tag/com" title="查看更多关于com的文章" target="_blank">com</a>/fei347795790?t=1</span></span></code>
www#gaodaima.com来源gao.dai.ma.com搞@代*码网搞代码
前言
想关注你的爱豆最近在娱乐圈发展的怎么样吗?本文和你一起爬取娱乐圈的排行榜数据,来看看你的爱豆现在排名变化情况,有几次登顶,几次进了前十名呀。
一、网站原始信息
我们先来看下原始的网站页面
如果我们想一个一个复制这些数据,再进行分析,估计要花一天的时间,才可以把明星的各期排行数据处理好。估计会处理到崩溃,还有可能会因为人为原因出错。
而用爬虫,半个小时不到就可以处理好这些数据。接下来看看怎么把这些数据用Python爬下来吧。
二、先来看下爬取后数据的部分截图
1 男明星人气榜数据
2 女明星人气榜数据
三、如何获取123粉丝网的爬虫信息
以下是获取代码用到信息的具体步骤:
- step1:浏览器(一般用火狐和Google我用的360)中打开123粉丝网
- step2:按键盘F12 -> ctrl+r
- step3: 点击results.php -> 到Headers中找到代码所需的参数
四、分步爬虫代码解析
1 用Python中的Requests库获取网页信息
<code><span class="hljs-comment">#爬取当前页信息,并用BeautifulSoup解析成标准格式 <span class="hljs-keyword">import requests <span class="hljs-comment">#导入requests模块 <span class="hljs-keyword">import bs4 url = <span class="hljs-string">"https://123fans.cn/lastresults.php?c=1" headers = {<span class="hljs-string">"User-Agent":<span class="hljs-string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", <span class="hljs-string">"Request Method":<span class="hljs-string">"Get"} req = requests.get(url, timeout=<span class="hljs-number">30, headers=headers) soup = bs4.BeautifulSoup(req.text, <span class="hljs-string">"html.parser")</span></span></span></span></span></span></span></span></span></span></span></code>
代码解析:
url = :待爬取网页的url链接,相当于指定爬取评论的路径,本文对应填入上文step3中标注的Requests URL值。
headers = :待爬取网页的首部信息,把上文step3中标注的Headers中关键词后面的内容对应填入即可。
req =:用get方法获取待爬网页的所有信息。
soup:用BeautifulSoup把爬取内容解析成标准格式,方便数据处理。
注1:有些网站访问时必须带有浏览器等信息,如果不传入headers就会报错,所以本例中加入了头部的一些信息。我试了一下该链接不加首部信息也可以正常运行,和加了首部信息得到的结果完全一致。
注2:如果对Requests库不了解,可以参见本公众号中文章【Python】【爬虫】Requests库详解
2 把爬取到的数据整合到一个数据框中
<code><span class="hljs-comment">#把爬取的数据整合到数据框中 <span class="hljs-keyword">import re <span class="hljs-comment">#正则表达式库 <span class="hljs-keyword">import numpy <span class="hljs-keyword">as np <span class="hljs-keyword">import pandas <span class="hljs-keyword">as pd period_data = pd.DataFrame(np.zeros((<span class="hljs-number">400,<span class="hljs-number">5))) <span class="hljs-comment">#构造400行5列的全0矩阵备用 period_data.columns = [<span class="hljs-string">"name", <span class="hljs-string">"popularity_value", <span class="hljs-string">"period_num", <span class="hljs-string">"end_time",<span class="hljs-string">"rank"] <span class="hljs-comment">#给0矩阵列命名 <span class="hljs-comment">#把当期的数据填入表格中 <span class="hljs-comment">#姓名信息 i = <span class="hljs-number">0 name = soup.findAll(<span class="hljs-string">"td", {<span class="hljs-string">"class":<span class="hljs-string">"name"}) <span class="hljs-keyword">for each <span class="hljs-keyword">in name: period_data[<span class="hljs-string">"name"][i]=each.a.text <span class="hljs-comment">#依次加入姓名 i += <span class="hljs-number">1 <span class="hljs-comment">#人气信息 j = <span class="hljs-number">0 popularity = soup.findAll(<span class="hljs-string">"td", {<span class="hljs-string">"class":<span class="hljs-string">"ballot"}) <span class="hljs-keyword">for each <span class="hljs-keyword">in popularity: period_data[<span class="hljs-string">"popularity_value"][j]=float(each.text.replace(<span class="hljs-string">",",<span class="hljs-string">"")) <span class="hljs-comment">#依次加入人气值 j += <span class="hljs-number">1 <span class="hljs-comment">#期数信息 period_num = int(re.findall(<span class="hljs-string">"[0-9]+", str(soup.h2.text))[<span class="hljs-number">0]) period_data[<span class="hljs-string">"period_num"] = period_num <span class="hljs-comment">#截止日期 end_time_0 = str(re.findall(<span class="hljs-string">"结束日期.+[0-9]+", str(soup.findAll(<span class="hljs-string">"div", {<span class="hljs-string">"class":<span class="hljs-string">"results"})))).split(<span class="hljs-string">".") end_time = <span class="hljs-string">"" <span class="hljs-keyword">for str_1 <span class="hljs-keyword">in end_time_0: end_time = end_time + re.findall(<span class="hljs-string">"[0-9]+",str_1)[<span class="hljs-number">0] period_data[<span class="hljs-string">"end_time"] = end_time <span class="hljs-comment">#有序数,方便截取前多少位 period_data_1 = period_data.sort_values(<span class="hljs-keyword">by=<span class="hljs-string">"popularity_value",ascending=False) period_data_1[<span class="hljs-string">"rank"] = range(period_data_1.shape[<span class="hljs-number">0])</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>
代码解析:
period_data:构造400行5列的矩阵用来存放每一期排行数据(前几期排行榜存放了前341位明星的人气值,我怕往期的会多一点数据,所以取了400行)。
period_data.columns:给数据加一个列名。
name:用findAll函数取出所有的名字信息。
for each in name:用循环把名字信息存放到period_data中。
popularity:用findAll函数取出所有的人气值信息。
for each in popularity:用循环把人气信息存放到period_data中。
period_num:获取期数信息。
end_time:获取截止日期。
period_data_1[“rank”]:在最后一列加入有序数,方便数据截取使用。
接下来展示批量爬虫代码
五、批量爬虫代码解析
1 定义爬虫函数
<code><span class="hljs-keyword">import requests <span class="hljs-comment">#导入requests模块 <span class="hljs-keyword">import bs4 <span class="hljs-keyword">import re <span class="hljs-comment">#正则表达式库 <span class="hljs-keyword">import numpy <span class="hljs-keyword">as np <span class="hljs-keyword">import pandas <span class="hljs-keyword">as pd <span class="hljs-keyword">import warnings <span class="hljs-keyword">import time <span class="hljs-keyword">import random warnings.filterwarnings(<span class="hljs-string">"ignore") <span class="hljs-comment">#忽视ignore <span class="hljs-comment">#headers的内容在Headers里面都可以找到 headers = {<span class="hljs-string">"User-Agent":<span class="hljs-string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", <span class="hljs-string">"Request Method":<span class="hljs-string">"Get"} <span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">crawler<span class="hljs-params">(url): req = requests.get(url, timeout=<span class="hljs-number">30, headers=headers) <span class="hljs-comment"># 获取网页信息 soup = bs4.BeautifulSoup(req.text, <span class="hljs-string">"html.parser") <span class="hljs-comment">#用soup库解析 period_data = pd.DataFrame(np.zeros((<span class="hljs-number">400,<span class="hljs-number">5))) <span class="hljs-comment">#构造400行5列的全0矩阵备用 period_data.columns = [<span class="hljs-string">"name", <span class="hljs-string">"popularity_value", <span class="hljs-string">"period_num", <span class="hljs-string">"end_time",<span class="hljs-string">"rank"] <span class="hljs-comment">#给0矩阵列命名 <span class="hljs-comment">#把当期的数据填入表格中 <span class="hljs-comment">#姓名信息 i = <span class="hljs-number">0 name = soup.findAll(<span class="hljs-string">"td", {<span class="hljs-string">"class":<span class="hljs-string">"name"}) <span class="hljs-keyword">for each <span class="hljs-keyword">in name: period_data[<span class="hljs-string">"name"][i]=each.a.text <span class="hljs-comment">#依次加入姓名 i += <span class="hljs-number">1 <span class="hljs-comment">#人气信息 j = <span class="hljs-number">0 popularity = soup.findAll(<span class="hljs-string">"td", {<span class="hljs-string">"class":<span class="hljs-string">"ballot"}) <span class="hljs-keyword">for each <span class="hljs-keyword">in popularity: period_data[<span class="hljs-string">"popularity_value"][j]=float(each.text.replace(<span class="hljs-string">",",<span class="hljs-string">"")) <span class="hljs-comment">#依次加入人气值 j += <span class="hljs-number">1 <span class="hljs-comment">#期数信息 period_num = int(re.findall(<span class="hljs-string">"[0-9]+", str(soup.h2.text))[<span class="hljs-number">0]) period_data[<span class="hljs-string">"period_num"] = period_num <span class="hljs-comment">#截止日期 end_time_0 = str(re.findall(<span class="hljs-string">"结束日期.+[0-9]+", str(soup.findAll(<span class="hljs-string">"div", {<span class="hljs-string">"class":<span class="hljs-string">"results"})))).split(<span class="hljs-string">".") end_time = <span class="hljs-string">"" <span class="hljs-keyword">for str_1 <span class="hljs-keyword">in end_time_0: end_time = end_time + re.findall(<span class="hljs-string">"[0-9]+",str_1)[<span class="hljs-number">0] period_data[<span class="hljs-string">"end_time"] = end_time <span class="hljs-comment">#有序数,方便截取前多少位 period_data_1 = period_data.sort_values(by=<span class="hljs-string">"popularity_value",ascending=<span class="hljs-literal">False) period_data_1[<span class="hljs-string">"rank"] = range(period_data_1.shape[<span class="hljs-number">0]) <span class="hljs-keyword">return period_data_1</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>
本段代码是把分段爬虫代码整合到一个函数中,方便反复调用。
2 反复调用函数实现批量爬虫
<code> period_data_final = pd.DataFrame(np.zeros((<span class="hljs-number">1,<span class="hljs-number">5))) #构造<span class="hljs-number">400行<span class="hljs-number">5列的全<span class="hljs-number">0矩阵备用 period_data_final.columns = [<span class="hljs-string">"name", <span class="hljs-string">"popularity_value", <span class="hljs-string">"period_num", <span class="hljs-string">"end_time",<span class="hljs-string">"rank"] #给<span class="hljs-number">0矩阵列命名 <span class="hljs-keyword">for qi <span class="hljs-keyword">in range(<span class="hljs-number">538,<span class="hljs-number">499,<span class="hljs-number">-1): <span class="hljs-built_in">print(<span class="hljs-string">"目前爬到了第",qi,<span class="hljs-string">"期") <span class="hljs-keyword">if qi == <span class="hljs-number">538: url=<span class="hljs-string">"https://123fans.cn/lastresults.php?c=1" <span class="hljs-keyword">else: url=<span class="hljs-string">"https://123fans.cn/results.php?qi={}&c=1".<span class="hljs-built_in">format(qi) <span class="hljs-built_in">time.sleep(<span class="hljs-built_in">random.uniform(<span class="hljs-number">1, <span class="hljs-number">2)) <span class="hljs-built_in">date = crawler(url) period_data_final = period_data_final.append(<span class="hljs-built_in">date) period_data_final_1 = period_data_fina.loc[<span class="hljs-number">1:,:] #去掉第一行无用数据</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>
本段代码是反复调用爬虫函数获取页面数据,并用append整合到一个数据框中。