前言
以下文章来源于数据分析和Python ,作者冈坂日川
今天发的是python爬虫爬取中国大学排名,并且保存到excel中,当然这个代码很简单,我用了半小时就写完了,我的整体框架非常清晰,可以直接拿去用,也希望有小白可以学习到关于爬虫的一些知识,当然我也只是在学习中,有不好的地方还麻烦大佬们指正!谢谢!
爬取中国大学排名
URL : http://m.gaosan.com/gaokao/265440.html
<code><span class="hljs-attribute">request 获取 html<br>beautiful soup 解析网页re 正则表达式匹配内容新建并保存 excel</span></code>
www#gaodaima.com来源[email protected]搞@^&代*@码)网搞代码
1<span style="color: #0000ff">from</span> bs4 <span style="color: #0000ff">import</span> BeautifulSoup <span style="color: #008000">#</span><span style="color: #008000"> 网页解析 获取数据</span> 2<span style="color: #0000ff">import</span> re <span style="color: #008000">#</span><span style="color: #008000"> 正则表达式 进行文字匹配</span> 3<span style="color: #0000ff">import</span> urllib.request, urllib.error <span style="color: #008000">#</span><span style="color: #008000"> 制定url 获取网页数据</span> 4<span style="color: #0000ff">import</span><span style="color: #000000"> xlwt </span>5<span style="color: #000000"> 6</span><span style="color: #0000ff">def</span><span style="color: #000000"> main(): </span>7 baseurl = <span style="color: #800000">"</span><span style="color: #800000">http://m.gaosan.com/gaokao/265440.html</span><span style="color: #800000">"</span> 8 <span style="color: #008000">#</span><span style="color: #008000"> 1爬取网页</span> 9 datalist =<span style="color: #000000"> getData(baseurl) </span>10 savepath = <span style="color: #800000">"</span><span style="color: #800000">中国大学排名.xls</span><span style="color: #800000">"</span> 11<span style="color: #000000"> saveData(datalist,savepath) </span>12 13<span style="color: #008000">#</span><span style="color: #008000"> 正则表达式</span> 14paiming = re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>(.*)</td><td>.*</td><td>.*</td><td>.*</td><td>.*</td></span><span style="color: #800000">"</span>) <span style="color: #008000">#</span><span style="color: #008000"> 创建超链接正则表达式对象,表示字符串模式,规则</span> 15xuexiao = re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>(.*)</td><td>.*</td><td>.*</td><td>.*</td></span><span style="color: #800000">"</span><span style="color: #000000">) 16defen </span>= re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>.*</td><td>(.*)</td><td>.*</td><td>.*</td></span><span style="color: #800000">"</span><span style="color: #000000">) 17xingji </span>= re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>.*</td><td>.*</td><td>(.*)</td><td>.*</td></span><span style="color: #800000">"</span><span style="color: #000000">) 18cengci </span>= re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>.*</td><td>.*</td><td>.*</td><td>(.*)</td></span><span style="color: #800000">"</span><span style="color: #000000">) </span>19 20<span style="color: #008000">#</span><span style="color: #008000"> 爬取网页</span> 21<span style="color: #0000ff">def</span><span style="color: #000000"> getData(baseurl): </span>22 datalist =<span style="color: #000000"> [] </span>23 html = askURL(baseurl) <span style="color: #008000">#</span><span style="color: #008000"> 保存获取到的网页源码</span> 24 <span style="color: #008000">#</span><span style="color: #008000"> print(html)</span> 25 <span style="color: #008000">#</span><span style="color: #008000">【逐一】解析数据 (一个网页就解析一次)</span> 26 soup = BeautifulSoup(html, <span style="color: #800000">"</span><span style="color: #800000">html.parser</span><span style="color: #800000">"</span>) <span style="color: #008000">#</span><span style="color: #008000"> soup是解析后的树形结构对象</span> 27 <span style="color: #0000ff">for</span> item <span style="color: #0000ff">in</span> soup.find_all(<span style="color: #800000">"</span><span style="color: #800000">tr</span><span style="color: #800000">"</span>): <span style="color: #008000">#</span><span style="color: #008000"> 查找符合要求的字符串形成列表</span> 28 <span style="color: #008000">#</span><span style="color: #008000"> print(item) #测试查看item全部</span> 29 data = [] <span style="color: #008000">#</span><span style="color: #008000"> 保存一个学校的所有信息</span> 30 item =<span style="color: #000000"> str(item) </span>31 <span style="color: #008000">#</span><span style="color: #008000">排名</span> 32 paiming1 = re.findall(paiming, item) <span style="color: #008000">#</span><span style="color: #008000"> re正则表达式查找指定字符串 0表示只要第一个 前面是标准后面是找的范围</span> 33 <span style="color: #008000">#</span><span style="color: #008000"> print(paiming1)</span> 34 <span style="color: #0000ff">if</span>(<span style="color: #0000ff">not</span><span style="color: #000000"> paiming1): </span>35 <span style="color: #0000ff">pass</span> 36 <span style="color: #0000ff">else</span><span style="color: #000000">: </span>37 <span style="color: #0000ff">print</span><span style="color: #000000">(paiming1[0]) </span>38<span style="color: #000000"> data.append(paiming1) </span>39 <span style="color: #0000ff">if</span>(paiming1 <span style="color: #0000ff">in</span><span style="color: #000000"> data): </span>40 <span style="color: #008000">#</span><span style="color: #008000">学校名字</span> 41 xuexiao1 =<span style="color: #000000"> re.findall(xuexiao, item)[0] </span>42 <span style="color: #008000">#</span><span style="color: #008000"> print(xuexiao1)</span> 43<span style="color: #000000"> data.append(xuexiao1) </span>44 <span style="color: #008000">#</span><span style="color: #008000">得分</span> 45 defen1 =<span style="color: #000000"> re.findall(defen, item)[0] </span>46 <span style="color: #008000">#</span><span style="color: #008000"> print(defen1)</span> 47<span style="color: #000000"> data.append(defen1) </span>48 <span style="color: #008000">#</span><span style="color: #008000">星级</span> 49 xingji1 =<span style="color: #000000"> re.findall(xingji, item)[0] </span>50 <span style="color: #008000">#</span><span style="color: #008000"> print(xingji1)</span> 51<span style="color: #000000"> data.append(xingji1) </span>52 <span style="color: #008000">#</span><span style="color: #008000">层次</span> 53 cengci1 =<span style="color: #000000"> re.findall(cengci, item)[0] </span>54 <span style="color: #008000">#</span><span style="color: #008000"> print(cengci1)</span> 55<span style="color: #000000"> data.append(cengci1) </span>56 <span style="color: #008000">#</span><span style="color: #008000"> print("-"*80)</span> 57 datalist.append(data) <span style="color: #008000">#</span><span style="color: #008000"> 把处理好的一个学校信息放入datalist中</span> 58 <span style="color: #0000ff">return</span><span style="color: #000000"> datalist </span>59 60 61<span style="color: #008000">#</span><span style="color: #008000"> 得到指定一个url网页信息内容</span> 62<span style="color: #0000ff">def</span><span style="color: #000000"> askURL(url): </span>63 <span style="color: #008000">#</span><span style="color: #008000"> 我的初始访问user agent</span> 64 head = { <span style="color: #008000">#</span><span style="color: #008000"> 模拟浏览器头部信息,向豆瓣服务器发送消息 伪装用的</span> 65 <span style="color: #800000">"</span><span style="color: #800000">User-Agent</span><span style="color: #800000">"</span>: <span style="color: #800000">"</span><span style="color: #800000">Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36</span><span style="color: #800000">"</span> 66<span style="color: #000000"> } </span>67 <span style="color: #008000">#</span><span style="color: #008000"> 用户代理表示告诉豆瓣服务器我们是什么类型的机器--浏览器 本质是告诉浏览器我们可以接受什么水平的文件内容</span> 68 request = urllib.request.Request(url, headers=head) <span style="color: #008000">#</span><span style="color: #008000"> 携带头部信息访问url</span> 69 <span style="color: #008000">#</span><span style="color: #008000"> 用request对象访问</span> 70 html = <span style="color: #800000">""</span> 71 <span style="color: #0000ff">try</span><span style="color: #000000">: </span>72 response = urllib.request.urlopen(request) <span style="color: #008000">#</span><span style="color: #008000"> 用urlopen传递封装好的request对象</span> 73 html = response.read().decode(<span style="color: #800000">"</span><span style="color: #800000">utf-8</span><span style="color: #800000">"</span>) <span style="color: #008000">#</span><span style="color: #008000"> read 读取 可以解码 防治乱码</span> 74 <span style="color: #008000">#</span><span style="color: #008000"> print(html)</span> 75 <span style="color: #0000ff">except</span><span style="color: #000000"> urllib.error.URLError as e: </span>76 <span style="color: #0000ff">if</span> hasattr(e, <span style="color: #800000">"</span><span style="color: #800000">code</span><span style="color: #800000">"</span><span style="color: #000000">): </span>77 <span style="color: #0000ff">print</span>(e.code) <span style="color: #008000">#</span><span style="color: #008000"> 打印错误代码</span> 78 <span style="color: #0000ff">if</span> hasattr(e, <span style="color: #800000">"</span><span style="color: #800000">reason</span><span style="color: #800000">"</span><span style="color: #000000">): </span>79 <span style="color: #0000ff">print</span>(e.reason) <span style="color: #008000">#</span><span style="color: #008000"> 打印错误原因</span> 80 <span style="color: #0000ff">return</span><span style="color: #000000"> html </span>81 82 83<span style="color: #008000">#</span><span style="color: #008000"> 3保存数据</span> 84<span style="color: #0000ff">def</span><span style="color: #000000"> saveData(datalist, savepath): </span>85 book = xlwt.Workbook(encoding=<span style="color: #800000">"</span><span style="color: #800000">utf-8</span><span style="color: #800000">"</span>, style_compression=0) <span style="color: #008000">#</span><span style="color: #008000"> 创建workbook对象 样式压缩效果</span> 86 sheet = book.add_sheet(<span style="color: #800000">"</span><span style="color: #800000">中国大学排名</span><span style="color: #800000">"</span>, cell_overwrite_ok=True) <span style="color: #008000">#</span><span style="color: #008000"> 创建工作表 一个表单 cell覆盖</span> 87 <span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span> range(0, 640<span style="color: #000000">): </span>88 <span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">第%d条</span><span style="color: #800000">"</span> % (i + 1<span style="color: #000000">)) </span>89 data =<span style="color: #000000"> datalist[i] </span>90 <span style="color: #008000">#</span><span style="color: #008000"> print(data)</span> 91 <span style="color: #0000ff">for</span> j <span style="color: #0000ff">in</span> range(0, 5): <span style="color: #008000">#</span><span style="color: #008000"> 每一行数据保存进去</span> 92 sheet.write(i , j, data[j]) <span style="color: #008000">#</span><span style="color: #008000"> 数据</span> 93 book.save(savepath) <span style="color: #008000">#</span><span style="color: #008000"> 保存数据表</span> 94 95 96<span style="color: #008000">#</span><span style="color: #008000">主函数</span> 97<span style="color: #0000ff">if</span> <span style="color: #800080">__name__</span> == <span style="color: #800000">"</span><span style="color: #800000">__main__</span><span style="color: #800000">"</span>: <span style="color: #008000">#</span><span style="color: #008000"> 当程序执行时</span> 98 <span style="color: #008000">#</span><span style="color: #008000"> #调用函数 程序执行入口</span> 99<span style="color: #000000"> main() </span>100 <span style="color: #008000">#</span><span style="color: #008000"> init_db("movietest.db")</span> 101 <span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">爬取完毕!</span><span style="color: #800000">"</span>)
<code><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-string"> </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>
具体实现效果如下
一共600多条数据
具体的过程在代码中也已经清晰的标注好备注,如有不懂可以留言,如果改进的地方,麻烦大佬们指正,谢谢!