前言
有一段没用 python 了,我也不知道自己为什么对 python 越来越淡,可能自己还是比较喜欢 android ,毕竟自己第一次接触编程就是 android,为了android学java,然后接触的python,这次也是因为android,我要用一次python来帮我爬数据
正文
目标网站 https://divnil.com
首先看看这网站是怎样加载数据的;打开网站后发现底部有下一页的按钮,ok,爬这个网站就很简单了;
我们目标是获取每张图片的高清的源地址,并且下载图片到桌面;先随便打开一张图片看看详细;emmm,只有一张图
看起来还挺清晰的,单击新窗口打开图片
然后下载图片,说实话,这图片很小,我很担心不是高清原图(管他的);
PS:一定要禁用广告拦截插件,不然加载不出图,我就在这被坑T_T;
接着分析我们从何入手
1、先去主页面获取每个图片的详细页面的链接
这链接还是比较好获取的,直接 F12 审核元素,或者右键查看代码,手机上chrome和firefox在url前面加上 “view-source”
比如: view-source:https://www.baidu.com/
2、从详细页面获取图片大图地址
随便打开一个图片详细页面如图:
接着按 F12 审核元素,我们需要定位该图片的链接,首先单击左上角的这玩意儿,像一个鼠标的图标:
接着只需要单击网页上的图片就能定位到代码了:
3、用大图地址下载该图片
这个很简单,看代码
先安装 Requests 和 BeautifulSoup 库
pip install requests bs4
www#gaodaima.com来源gao@daima#com搞(%代@#码网搞代码
导入库
<span style="color: #0000ff">import</span> requestsfrom bs4 <span style="color: #0000ff">import</span> BeautifulSoupimport sys
请求获取网页源代码
url = <span style="color: #800000">"</span><span style="color: #800000">https://divnil.com/wallpaper/iphone8/%E3%82%A2%E3%83%8B%E3%83%A1%E3%81%AE%E5%A3%81%E7%B4%99_2.html</span><span style="color: #800000">"</span><span style="color: #000000"> headers </span>=<span style="color: #000000"> { </span><span style="color: #800000">"</span><span style="color: #800000">User-Agent</span><span style="color: #800000">"</span>: <span style="color: #800000">"</span><span style="color: #800000">Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0</span><span style="color: #800000">"</span><span style="color: #000000">, } resp </span>= requests.get(url, headers=<span style="color: #000000">headers) </span><span style="color: #0000ff">if</span> resp.status_code !=<span style="color: #000000"> requests.codes.OK: </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">Request Error, Code: %d</span><span style="color: #800000">"</span>%<span style="color: #000000"> resp.status_code) sys.exit()</span>
然后解析出所有图片的详细地址
soup = BeautifulSoup(resp.text, <span style="color: #800000">"</span><span style="color: #800000">html.parser</span><span style="color: #800000">"</span><span style="color: #000000">) contents </span>= soup.findAll(<span style="color: #800000">"</span><span style="color: #800000">div</span><span style="color: #800000">"</span>, id=<span style="color: #800000">"</span><span style="color: #800000">contents</span><span style="color: #800000">"</span><span style="color: #000000">)[0] wallpapers </span>= contents.findAll(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>, rel=<span style="color: #800000">"</span><span style="color: #800000">wallpaper</span><span style="color: #800000">"</span><span style="color: #000000">) links </span>=<span style="color: #000000"> [] </span><span style="color: #0000ff">for</span> wallpaper <span style="color: #0000ff">in</span><span style="color: #000000"> wallpapers: links.append(wallpaper[</span><span style="color: #800000">"</span><span style="color: #800000">href</span><span style="color: #800000">"</span>])
接着在详细网页里获取那个看似高清的图片的不确定是否为真实图片链接并下载(/滑稽)
<span style="color: #0000ff">import</span><span style="color: #000000"> os head </span>= <span style="color: #800000">"</span><span style="color: #800000">https://divnil.com/wallpaper/iphone8/</span><span style="color: #800000">"</span> <span style="color: #0000ff">if</span> os.path.exists(<span style="color: #800000">"</span><span style="color: #800000">./Divnil</span><span style="color: #800000">"</span>) !=<span style="color: #000000"> True: os.mkdir(</span><span style="color: #800000">"</span><span style="color: #800000">./Divnil</span><span style="color: #800000">"</span><span style="color: #000000">) </span><span style="color: #0000ff">for</span> url <span style="color: #0000ff">in</span><span style="color: #000000"> links: url </span>= head +<span style="color: #000000"> url resp </span>= requests.get(url, headers=<span style="color: #000000">headers) </span><span style="color: #0000ff">if</span> resp.status_code !=<span style="color: #000000"> requests.codes.OK: </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">URL: %s REQUESTS ERROR. CODE: %d</span><span style="color: #800000">"</span> %<span style="color: #000000"> (url, resp.status_code)) </span><span style="color: #0000ff">continue</span><span style="color: #000000"> soup </span>= BeautifulSoup(resp.text, <span style="color: #800000">"</span><span style="color: #800000">html.parser</span><span style="color: #800000">"</span><span style="color: #000000">) img </span>= soup.find(<span style="color: #800000">"</span><span style="color: #800000">div</span><span style="color: #800000">"</span>, id=<span style="color: #800000">"</span><span style="color: #800000">contents</span><span style="color: #800000">"</span>).contents.find(<span style="color: #800000">"</span><span style="color: #800000">img</span><span style="color: #800000">"</span>, id=<span style="color: #800000">"</span><span style="color: #800000">main_content</span><span style="color: #800000">"</span><span style="color: #000000">) img_url </span>= head + img[<span style="color: #800000">"</span><span style="color: #800000">"original</span><span style="color: #800000">"</span>].replace(<span style="color: #800000">"</span><span style="color: #800000">../</span><span style="color: #800000">"</span>, <span style="color: #800000">""</span><span style="color: #000000">) img_name </span>= img[<span style="color: #800000">"</span><span style="color: #800000">alt</span><span style="color: #800000">"</span><span style="color: #000000">] </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">start download %s ...</span><span style="color: #800000">"</span> %<span style="color: #000000"> img_url) resp </span>= requests.get(img_url, headers=<span style="color: #000000">headers) </span><span style="color: #0000ff">if</span> resp.status_code !=<span style="color: #000000"> requests.codes.OK: </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">IMAGE %s DOWNLOAD FAILED.</span><span style="color: #800000">"</span> %<span style="color: #000000"> img_name) with open(</span><span style="color: #800000">"</span><span style="color: #800000">./Divnil/</span><span style="color: #800000">"</span> + img_name + <span style="color: #800000">"</span><span style="color: #800000">.jpg</span><span style="color: #800000">"</span>, <span style="color: #800000">"</span><span style="color: #800000">wb</span><span style="color: #800000">"</span><span style="color: #000000">) as f: f.write(resp.content)</span>
完成,贴上所有代码
<span style="color: #0000ff">import</span><span style="color: #000000"> requests </span><span style="color: #0000ff">from</span> bs4 <span style="color: #0000ff">import</span><span style="color: #000000"> BeautifulSoup </span><span style="color: #0000ff">import</span><span style="color: #000000"> sys </span><span style="color: #0000ff">import</span><span style="color: #000000"> os </span><span style="color: #0000ff">class</span><span style="color: #000000"> Divnil: </span><span style="color: #0000ff">def</span> <span style="color: #800080">__init__</span><span style="color: #000000">(self): self.url </span>= <span style="color: #800000">"</span><span style="color: #800000">https://divnil.com/wallpaper/iphone8/%E3%82%A2%E3%83%8B%E3%83%A1%E3%81%AE%E5%A3%81%E7%B4%99.html</span><span style="color: #800000">"</span><span style="color: #000000"> self.head </span>= <span style="color: #800000">"</span><span style="color: #800000">https://divnil.com/wallpaper/iphone8/</span><span style="color: #800000">"</span><span style="color: #000000"> self.headers </span>=<span style="color: #000000"> { </span><span style="color: #800000">"</span><span style="color: #800000">User-Agent</span><span style="color: #800000">"</span>: <span style="color: #800000">"</span><span style="color: #800000">Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0</span><span style="color: #800000">"</span><span style="color: #000000">, } </span><span style="color: #0000ff">def</span><span style="color: #000000"> getImageInfoUrl(self): resp </span>= requests.get(self.url, headers=<span style="color: #000000">self.headers) </span><span style="color: #0000ff">if</span> resp.status_code !=<span style="color: #000000"> requests.codes.OK: </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">Request Error, Code: %d</span><span style="color: #800000">"</span>%<span style="color: #000000"> resp.status_code) sys.exit() soup </span>= BeautifulSoup(resp.text, <span style="color: #800000">"</span><span style="color: #800000">html.parser</span><span style="color: #800000">"</span><span style="color: #000000">) contents </span>= soup.find(<span style="color: #800000">"</span><span style="color: #800000">div</span><span style="color: #800000">"</span>, id=<span style="color: #800000">"</span><span style="color: #800000">contents</span><span style="color: #800000">"</span><span style="color: #000000">) wallpapers </span>= contents.findAll(<span style="color: #800000">"</span><span style="color: #800000">a</span><span style="color: #800000">"</span>, rel=<span style="color: #800000">"</span><span style="color: #800000">wallpaper</span><span style="color: #800000">"</span><span style="color: #000000">) self.links </span>=<span style="color: #000000"> [] </span><span style="color: #0000ff">for</span> wallpaper <span style="color: #0000ff">in</span><span style="color: #000000"> wallpapers: self.links.append(wallpaper[</span><span style="color: #800000">"</span><span style="color: #800000">href</span><span style="color: #800000">"</span><span style="color: #000000">]) </span><span style="color: #0000ff">def</span><span style="color: #000000"> downloadImage(self): </span><span style="color: #0000ff">if</span> os.path.exists(<span style="color: #800000">"</span><span style="color: #800000">./Divnil</span><span style="color: #800000">"</span>) !=<span style="color: #000000"> True: os.mkdir(</span><span style="color: #800000">"</span><span style="color: #800000">./Divnil</span><span style="color: #800000">"</span><span style="color: #000000">) </span><span style="color: #0000ff">for</span> url <span style="color: #0000ff">in</span><span style="color: #000000"> self.links: url </span>= self.head +<span style="color: #000000"> url resp </span>= requests.get(url, headers=<span style="color: #000000">self.headers) </span><span style="color: #0000ff">if</span> resp.status_code !=<span style="color: #000000"> requests.codes.OK: </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">URL: %s REQUESTS ERROR. CODE: %d</span><span style="color: #800000">"</span> %<span style="color: #000000"> (url, resp.status_code)) </span><span style="color: #0000ff">continue</span><span style="color: #000000"> soup </span>= BeautifulSoup(resp.text, <span style="color: #800000">"</span><span style="color: #800000">html.parser</span><span style="color: #800000">"</span><span style="color: #000000">) img </span>= soup.find(<span style="color: #800000">"</span><span style="color: #800000">div</span><span style="color: #800000">"</span>, id=<span style="color: #800000">"</span><span style="color: #800000">contents</span><span style="color: #800000">"</span>).find(<span style="color: #800000">"</span><span style="color: #800000">img</span><span style="color: #800000">"</span>, id=<span style="color: #800000">"</span><span style="color: #800000">main_content</span><span style="color: #800000">"</span><span style="color: #000000">) img_url </span>= self.head + img[<span style="color: #800000">"</span><span style="color: #800000">original</span><span style="color: #800000">"</span>].replace(<span style="color: #800000">"</span><span style="color: #800000">../</span><span style="color: #800000">"</span>, <span style="color: #800000">""</span><span style="color: #000000">) img_name </span>= img[<span style="color: #800000">"</span><span style="color: #800000">alt</span><span style="color: #800000">"</span><span style="color: #000000">] </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">start download %s ...</span><span style="color: #800000">"</span> %<span style="color: #000000"> img_url) resp </span>= requests.get(img_url, headers=<span style="color: #000000">self.headers) </span><span style="color: #0000ff">if</span> resp.status_code !=<span style="color: #000000"> requests.codes.OK: </span><span style="color: #0000ff">print</span>(<span style="color: #800000">"</span><span style="color: #800000">IMAGE %s DOWNLOAD FAILED.</span><span style="color: #800000">"</span> %<span style="color: #000000"> img_name) </span><span style="color: #0000ff">continue</span> <span style="color: #0000ff">if</span> <span style="color: #800000">"</span><span style="color: #800000">/</span><span style="color: #800000">"</span> <span style="color: #0000ff">in</span><span style="color: #000000"> img_name: img_name </span>= img_name.split(<span style="color: #800000">"</span><span style="color: #800000">/</span><span style="color: #800000">"</span>)[1<span style="color: #000000">] with open(</span><span style="color: #800000">"</span><span style="color: #800000">./Divnil/</span><span style="color: #800000">"</span> + img_name + <span style="color: #800000">"</span><span style="color: #800000">.jpg</span><span style="color: #800000">"</span>, <span style="color: #800000">"</span><span style="color: #800000">wb</span><span style="color: #800000">"</span><span style="color: #000000">) as f: f.write(resp.content) </span><span style="color: #0000ff">def</span><span style="color: #000000"> main(self): self.getImageInfoUrl() self.downloadImage() </span><span style="color: #0000ff">if</span> <span style="color: #800080">__name__</span> == <span style="color: #800000">"</span><span style="color: #800000">__main__</span><span style="color: #800000">"</span><span style="color: #000000">: divnil </span>=<span style="color: #000000"> Divnil() divnil.main()</span>
本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。
作者| zckun
来源|简书