• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

Python爬取中国大学排名,并且保存到excel中

python 搞java代码 3年前 (2022-05-21) 8次浏览 已收录 0个评论

前言

以下文章来源于数据分析和Python ,作者冈坂日川

 

今天发的是python爬虫爬取中国大学排名,并且保存到excel中,当然这个代码很简单,我用了半小时就写完了,我的整体框架非常清晰,可以直接拿去用,也希望有小白可以学习到关于爬虫的一些知识,当然我也只是在学习中,有不好的地方还麻烦大佬们指正!谢谢!

 

爬取中国大学排名

URL : http://m.gaosan.com/gaokao/265440.html

<code><span class="hljs-attribute">request 获取 html<br>beautiful soup 解析网页re 正则表达式匹配内容新建并保存 excel</span></code>

www#gaodaima.com来源[email protected]搞@^&代*@码)网搞代码

  1<span style="color: #0000ff">from</span> bs4 <span style="color: #0000ff">import</span> BeautifulSoup  <span style="color: #008000">#</span><span style="color: #008000"> 网页解析  获取数据</span>
  2<span style="color: #0000ff">import</span> re  <span style="color: #008000">#</span><span style="color: #008000"> 正则表达式 进行文字匹配</span>
  3<span style="color: #0000ff">import</span> urllib.request, urllib.error  <span style="color: #008000">#</span><span style="color: #008000"> 制定url 获取网页数据</span>
  4<span style="color: #0000ff">import</span><span style="color: #000000"> xlwt
  </span>5<span style="color: #000000">
  6</span><span style="color: #0000ff">def</span><span style="color: #000000"> main():
  </span>7    baseurl = <span style="color: #800000">"</span><span style="color: #800000">http://m.gaosan.com/gaokao/265440.html</span><span style="color: #800000">"</span>
  8    <span style="color: #008000">#</span><span style="color: #008000"> 1爬取网页</span>
  9    datalist =<span style="color: #000000"> getData(baseurl)
 </span>10    savepath = <span style="color: #800000">"</span><span style="color: #800000">中国大学排名.xls</span><span style="color: #800000">"</span>
 11<span style="color: #000000">    saveData(datalist,savepath)
 </span>12
 13<span style="color: #008000">#</span><span style="color: #008000"> 正则表达式</span>
 14paiming = re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>(.*)</td><td>.*</td><td>.*</td><td>.*</td><td>.*</td></span><span style="color: #800000">"</span>)  <span style="color: #008000">#</span><span style="color: #008000"> 创建超链接正则表达式对象,表示字符串模式,规则</span>
 15xuexiao = re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>(.*)</td><td>.*</td><td>.*</td><td>.*</td></span><span style="color: #800000">"</span><span style="color: #000000">)
 16defen   </span>= re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>.*</td><td>(.*)</td><td>.*</td><td>.*</td></span><span style="color: #800000">"</span><span style="color: #000000">)
 17xingji  </span>= re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>.*</td><td>.*</td><td>(.*)</td><td>.*</td></span><span style="color: #800000">"</span><span style="color: #000000">)
 18cengci  </span>= re.compile(r<span style="color: #800000">"</span><span style="color: #800000"><td>.*</td><td>.*</td><td>.*</td><td>.*</td><td>(.*)</td></span><span style="color: #800000">"</span><span style="color: #000000">)
 </span>19
 20<span style="color: #008000">#</span><span style="color: #008000"> 爬取网页</span>
 21<span style="color: #0000ff">def</span><span style="color: #000000"> getData(baseurl):
 </span>22    datalist =<span style="color: #000000"> []
 </span>23    html = askURL(baseurl)  <span style="color: #008000">#</span><span style="color: #008000"> 保存获取到的网页源码</span>
 24    <span style="color: #008000">#</span><span style="color: #008000"> print(html)</span>
 25    <span style="color: #008000">#</span><span style="color: #008000">【逐一】解析数据  (一个网页就解析一次)</span>
 26    soup = BeautifulSoup(html, <span style="color: #800000">"</span><span style="color: #800000">html.parser</span><span style="color: #800000">"</span>)  <span style="color: #008000">#</span><span style="color: #008000"> soup是解析后的树形结构对象</span>
 27    <span style="color: #0000ff">for</span> item <span style="color: #0000ff">in</span> soup.find_all(<span style="color: #800000">"</span><span style="color: #800000">tr</span><span style="color: #800000">"</span>):  <span style="color: #008000">#</span><span style="color: #008000"> 查找符合要求的字符串形成列表</span>
 28        <span style="color: #008000">#</span><span style="color: #008000"> print(item)    #测试查看item全部</span>
 29        data = []  <span style="color: #008000">#</span><span style="color: #008000"> 保存一个学校的所有信息</span>
 30        item =<span style="color: #000000"> str(item)
 </span>31        <span style="color: #008000">#</span><span style="color: #008000">排名</span>
 32        paiming1 = re.findall(paiming, item)  <span style="color: #008000">#</span><span style="color: #008000"> re正则表达式查找指定字符串 0表示只要第一个 前面是标准后面是找的范围</span>
 33        <span style="color: #008000">#</span><span style="color: #008000"> print(paiming1)</span>
 34        <span style="color: #0000ff">if</span>(<span style="color: #0000ff">not</span><span style="color: #000000"> paiming1):
 </span>35            <span style="color: #0000ff">pass</span>
 36        <span style="color: #0000ff">else</span><span style="color: #000000">:
 </span>37            <span style="color: #0000ff">print</span><span style="color: #000000">(paiming1[0])
 </span>38<span style="color: #000000">            data.append(paiming1)
 </span>39        <span style="color: #0000ff">if</span>(paiming1 <span style="color: #0000ff">in</span><span style="color: #000000"> data):
 </span>40            <span style="color: #008000">#</span><span style="color: #008000">学校名字</span>
 41            xuexiao1 =<span style="color: #000000"> re.findall(xuexiao, item)[0]
 </span>42            <span style="color: #008000">#</span><span style="color: #008000"> print(xuexiao1)</span>
 43<span style="color: #000000">            data.append(xuexiao1)
 </span>44            <span style="color: #008000">#</span><span style="color: #008000">得分</span>
 45            defen1 =<span style="color: #000000"> re.findall(defen, item)[0]
 </span>46            <span style="color: #008000">#</span><span style="color: #008000"> print(defen1)</span>
 47<span style="color: #000000">            data.append(defen1)
 </span>48            <span style="color: #008000">#</span><span style="color: #008000">星级</span>
 49            xingji1 =<span style="color: #000000"> re.findall(xingji, item)[0]
 </span>50            <span style="color: #008000">#</span><span style="color: #008000"> print(xingji1)</span>
 51<span style="color: #000000">            data.append(xingji1)
 </span>52            <span style="color: #008000">#</span><span style="color: #008000">层次</span>
 53            cengci1 =<span style="color: #000000"> re.findall(cengci, item)[0]
 </span>54            <span style="color: #008000">#</span><span style="color: #008000"> print(cengci1)</span>
 55<span style="color: #000000">            data.append(cengci1)
 </span>56            <span style="color: #008000">#</span><span style="color: #008000"> print("-"*80)</span>
 57        datalist.append(data)  <span style="color: #008000">#</span><span style="color: #008000"> 把处理好的一个学校信息放入datalist中</span>
 58    <span style="color: #0000ff">return</span><span style="color: #000000"> datalist
 </span>59
 60
 61<span style="color: #008000">#</span><span style="color: #008000"> 得到指定一个url网页信息内容</span>
 62<span style="color: #0000ff">def</span><span style="color: #000000"> askURL(url):
 </span>63    <span style="color: #008000">#</span><span style="color: #008000"> 我的初始访问user agent</span>
 64    head = {  <span style="color: #008000">#</span><span style="color: #008000"> 模拟浏览器头部信息,向豆瓣服务器发送消息 伪装用的</span>
 65        <span style="color: #800000">"</span><span style="color: #800000">User-Agent</span><span style="color: #800000">"</span>: <span style="color: #800000">"</span><span style="color: #800000">Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36</span><span style="color: #800000">"</span>
 66<span style="color: #000000">    }
 </span>67    <span style="color: #008000">#</span><span style="color: #008000"> 用户代理表示告诉豆瓣服务器我们是什么类型的机器--浏览器  本质是告诉浏览器我们可以接受什么水平的文件内容</span>
 68    request = urllib.request.Request(url, headers=head)  <span style="color: #008000">#</span><span style="color: #008000"> 携带头部信息访问url</span>
 69    <span style="color: #008000">#</span><span style="color: #008000"> 用request对象访问</span>
 70    html = <span style="color: #800000">""</span>
 71    <span style="color: #0000ff">try</span><span style="color: #000000">:
 </span>72        response = urllib.request.urlopen(request)  <span style="color: #008000">#</span><span style="color: #008000"> 用urlopen传递封装好的request对象</span>
 73        html = response.read().decode(<span style="color: #800000">"</span><span style="color: #800000">utf-8</span><span style="color: #800000">"</span>)  <span style="color: #008000">#</span><span style="color: #008000"> read 读取 可以解码 防治乱码</span>
 74        <span style="color: #008000">#</span><span style="color: #008000"> print(html)</span>
 75    <span style="color: #0000ff">except</span><span style="color: #000000"> urllib.error.URLError as e:
 </span>76        <span style="color: #0000ff">if</span> hasattr(e, <span style="color: #800000">"</span><span style="color: #800000">code</span><span style="color: #800000">"</span><span style="color: #000000">):
 </span>77            <span style="color: #0000ff">print</span>(e.code)  <span style="color: #008000">#</span><span style="color: #008000"> 打印错误代码</span>
 78        <span style="color: #0000ff">if</span> hasattr(e, <span style="color: #800000">"</span><span style="color: #800000">reason</span><span style="color: #800000">"</span><span style="color: #000000">):
 </span>79            <span style="color: #0000ff">print</span>(e.reason)  <span style="color: #008000">#</span><span style="color: #008000"> 打印错误原因</span>
 80    <span style="color: #0000ff">return</span><span style="color: #000000"> html
 </span>81
 82
 83<span style="color: #008000">#</span><span style="color: #008000"> 3保存数据</span>
 84<span style="color: #0000ff">def</span><span style="color: #000000"> saveData(datalist, savepath):
 </span>85    book = xlwt.Workbook(encoding=<span style="color: #800000">"</span><span style="color: #800000">utf-8</span><span style="color: #800000">"</span>, style_compression=0)  <span style="color: #008000">#</span><span style="color: #008000"> 创建workbook对象   样式压缩效果</span>
 86    sheet = book.add_sheet(<span style="color: #800000">"</span><span style="color: #800000">中国大学排名</span><span style="color: #800000">"</span>, cell_overwrite_ok=True)  <span style="color: #008000">#</span><span style="color: #008000"> 创建工作表  一个表单  cell覆盖</span>
 87    <span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span> range(0, 640<span style="color: #000000">):
 </span>88        <span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">第%d条</span><span style="color: #800000">"</span> % (i + 1<span style="color: #000000">))
 </span>89        data =<span style="color: #000000"> datalist[i]
 </span>90        <span style="color: #008000">#</span><span style="color: #008000"> print(data)</span>
 91        <span style="color: #0000ff">for</span> j <span style="color: #0000ff">in</span> range(0, 5):  <span style="color: #008000">#</span><span style="color: #008000"> 每一行数据保存进去</span>
 92            sheet.write(i , j, data[j])  <span style="color: #008000">#</span><span style="color: #008000"> 数据</span>
 93    book.save(savepath)  <span style="color: #008000">#</span><span style="color: #008000"> 保存数据表</span>
 94
 95
 96<span style="color: #008000">#</span><span style="color: #008000">主函数</span>
 97<span style="color: #0000ff">if</span> <span style="color: #800080">__name__</span> == <span style="color: #800000">"</span><span style="color: #800000">__main__</span><span style="color: #800000">"</span>:  <span style="color: #008000">#</span><span style="color: #008000"> 当程序执行时</span>
 98    <span style="color: #008000">#</span><span style="color: #008000"> #调用函数     程序执行入口</span>
 99<span style="color: #000000">    main()
</span>100    <span style="color: #008000">#</span><span style="color: #008000"> init_db("movietest.db")</span>
101    <span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">爬取完毕!</span><span style="color: #800000">"</span>)

 

<code><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-number"><span class="hljs-keyword"><span class="hljs-keyword"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-number"><span class="hljs-number"><span class="hljs-string"><span class="hljs-number"><span class="hljs-built_in"><span class="hljs-string"> </span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>

具体实现效果如下

 

一共600多条数据

 

具体的过程在代码中也已经清晰的标注好备注,如有不懂可以留言,如果改进的地方,麻烦大佬们指正,谢谢!

 


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Python爬取中国大学排名,并且保存到excel中

喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址