本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理
以下文章来源于凌晨安全,作者 MoLing
1. 爬虫就是模拟浏览器抓取东西,爬虫三部曲:数据爬取、数据解析、数据存储
数据爬取:手机端、pc端数据解析:正则表达式数据存储:存储到文件、存储到数据库
2. 相关python库
爬虫需要两个库模块:requests和re
1. requests库
requests是比较简单易用的HTTP库,相较于urllib会简洁很多,但由于是第三方库,所以需要安装,文末附安装教程链接(链接全在后面,这样会比较方便看吧,贴心吧~)
requests库支持的HTTP特性:
保持活动和连接池、Cookie持久性会话、分段文件上传、分块请求等
Requests库中有许多方法,所有方法在底层的调用都是通过request()方法实现的,所以严格来说Requests库只有request()方法,但一般不会直接使用request()方法。以下介绍Requests库的7个主要的方法:
①requests.request()
构造一个请求,支撑一下请求的方法
具体形式:requests.request(method,url,**kwargs)
method:请求方式,对应get,post,put等方式
url:拟获取页面的url连接
**kwargs:控制访问参数
②requests.get()
获取网页HTM网页的主要方法,对应HTTP的GET。构造一个向服务器请求资源的Requests对象,返回一个包含服务器资源的Response对象。
Response对象的属性:
具体形式:res=requests.get(url)
code=res.text (text为文本形式;bin为二进制;json为json解析)
③requests.head()
获取HTML的网页头部信息,对应HTTP的HEAD
具体形式:res=requests.head(url)
④requests.post()
向网页提交post请求方法,对应HTTP的POST
具体形式:res=requests.post(url)
⑤requests.put()
向网页提交put请求方法,对应HTTP的PUT
⑥requests.patch()
向网页提交局部修改的请求,对应HTTP的PATCH
⑦requests.delete()
向网页提交删除的请求,对应HTTP的DELETE
<span>#</span><span> requests 操作练习</span> <span>import</span><span> requests </span><span>import</span><span> re </span><span>#</span><span>数据的爬取</span> h =<span> { </span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span> } response </span>= requests.get(<span>"</span><span>https://movie.douban.com/chart</span><span>"</span>,headers=<span>h) html_str </span>=<span> response.text </span><span>#</span><span>数据解析<a href="https://movie.XX.com/subject/34961898/" title="汉密尔顿"></span> pattern = re.compile(<span>"</span><span><a.*?title="(.*?)"></span><span>"</span>) <span>#</span><span> .*? 任意匹配尽可能多的匹配尽可能少的字符</span> result =<span> re.findall(pattern,html_str) </span><span>print</span>(result)
www#gaodaima.com来源gaodaima#com搞(代@码网搞代码
2. re正则表达式:(Regular Expression)
一组由字母和符号组成的特殊字符串,作用:从文本中找到你想要的格式的句子
关于 .*? 的解释:
3. xpath解析源码
<span>import</span><span> requests </span><span>import</span><span> re </span><span>from</span> bs4 <span>import</span><span> BeautifulSoup </span><span>from</span> lxml <span>import</span><span> etree </span><span>#</span><span>数据爬取(一些HTTP头的信息)</span> h =<span> { </span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span> } response </span>= requests.get(<span>"</span><span>https://movie.XX.com/chart</span><span>"</span>,headers=<span>h) html_str </span>=<span> response.text </span><span>#</span><span>数据解析</span><span> #</span><span>正则表达式解析</span> <span>def</span><span> re_parse(html_str): pattern </span>= re.compile(<span>"</span><span><a.*?title="(.*?)"</span><span>"</span><span>) results </span>=<span> re.findall(pattern,html_str) </span><span>print</span><span>(results) </span><span>return</span><span> results </span><span>#</span><span>bs4解析</span> <span>def</span><span> bs4_parse(html_str): soup </span>= BeautifulSoup(html_str,<span>"</span><span>lxml</span><span>"</span><span>) items </span>= soup.find_all(class_=<span>"</span><span>nbg</span><span>"</span><span>) </span><span>for</span> item <span>in</span><span> items: </span><span>print</span>(item.attrs[<span>"</span><span>title</span><span>"</span><span>]) </span><span>#</span><span>lxml解析</span> <span>def</span><span> lxml_parse(html_str): html </span>=<span> etree.HTML(html_str) results </span>= html.xpath(<span>"</span><span>//a[@class="nbg"]/@title</span><span>"</span><span>) </span><span>print</span><span>(results) </span><span>return</span><span> results re_parse(html_str) bs4_parse(html_str) lxml_parse(html_str)</span>
4. python写爬虫的架构
从图上可以看到,整个基础爬虫架构分为5大类:爬虫调度器、URL管理器、HTML下载器、HTML解析器、数据存储器。
下面给大家依次来介绍一下这5个大类的功能:
① 爬虫调度器:主要是配合调用其他四个模块,所谓调度就是取调用其他的模板。
② URL管理器:就是负责管理URL链接的,URL链接分为已经爬取的和未爬取的,这就需要URL管理器来管理它们,同时它也为获取新URL链接提供接口。
③ HTML下载器:就是将要爬取的页面的HTML下载下来。
④ HTML解析器:就是将要爬取的数据从HTML源码中获取出来,同时也将新的URL链接发送给URL管理器以及将处理后的数据发送给数据存储器。
⑤ 数据存储器:就是将HTML下载器发送过来的数据存储到本地。
0x01 whois爬取
每年,有成百上千万的个人、企业、组织和政府机构注册域名,每个注册人都必须提供身份识别信息和联系方式,包括姓名、地址、电子邮件、联系电话、管理联系人和技术联系人一这类信息通常被叫做whois数据
<span>import</span><span> requests </span><span>import</span><span> re h </span>=<span> { </span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span> } response </span>= requests.get(<span>"</span><span>http://whois.chinaz.com/</span><span>"</span>+input(<span>"</span><span>请输入网址:</span><span>"</span>),headers=<span>h) </span><span>print</span><span>(response.status_code) html </span>=<span> response.text </span><span>#</span><span>print(html)</span><span> #</span><span>解析数据</span> pattern = re.compile(<span>"</span><span>class="MoreInfo".*?>(.*?)</p></span><span>"</span><span>,re.S) result </span>=<span> re.findall(pattern,html) </span><span>#</span><span> 方法一:</span><span> #</span><span> str = re.sub(" ",",",result[0])</span><span> #</span><span> print(str)</span><span> #</span><span>方法二:</span> <span>print</span>(result[0].replace(<span>"</span><span>/n</span><span>"</span>,<span>"</span><span>,</span><span>"</span>))
0x02 爬取电影信息
<span>import</span><span> requests </span><span>import</span><span> re </span><span>import</span><span> time </span><span>#</span><span> count = [0,10,20,30,40,50,60,70,80,90]</span> h =<span> { </span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span> } responce </span>= requests.get(<span>"</span><span>https://XX.com/board/4?offset=0</span><span>"</span>, headers=<span>h) responce.encoding </span>= <span>"</span><span>utf-8</span><span>"</span><span> html </span>=<span> responce.text </span><span>#</span><span> 解析数据 time.sleep(2)</span> patter = re.compile(<span>"</span><span>class="name">.*?title="(.*?)".*?主演:(.*?)</p>.*?上映时间:(.*?)</p></span><span>"</span><span>, re.S) </span><span>#</span><span>time.sleep(2)</span> result =<span> re.findall(patter, html) </span><span>print</span><span>(result) with open(</span><span>"</span><span>maoyan.txt</span><span>"</span>, <span>"</span><span>a</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) as f: </span><span>for</span> item <span>in</span> result: <span>#</span><span> 读取result(以元组的形式储存)中的内容=》</span> <span>for</span> i <span>in</span><span> item: f.write(i.strip().replace(</span><span>"</span><span> </span><span>"</span>, <span>"</span><span>,</span><span>"</span><span>)) </span><span>#</span><span>print(" ")</span>
0x03 爬取图片
<span>import</span><span> requests </span><span>import</span><span> re </span><span>import</span><span> time </span><span>#</span><span>数据的爬取img的url</span> <span>def</span><span> get_urls(): response </span>= requests.get(<span>"</span><span>http://XX.com/png/</span><span>"</span><span>) html_str </span>=<span> response.text </span><span>#</span><span>解析数据,得到url</span> pattern = re.compile(<span>"</span><span><img data-original="(.*?)"</span><span>"</span><span>) results </span>=<span> re.findall(pattern,html_str) </span><span>print</span><span>(results) </span><span>return</span><span> results </span><span>#</span><span><img data-original="http://XX.616pic.com/ys_img/00/06/20/64dXxVfv6k.jpg"></span><span> #</span><span>下载图片</span> <span>def</span><span> down_load_img(urls): </span><span>for</span> url <span>in</span><span> urls: response </span>=<span> requests.get(url) with open(</span><span>"</span><span>temp/</span><span>"</span>+url.split(<span>"</span><span>/</span><span>"</span>)[-1], <span>"</span><span>wb</span><span>"</span><span>) as f: f.write(response.content) </span><span>print</span>(url.split(<span>"</span><span>/</span><span>"</span>)[-1],<span>"</span><span>已经下载成功</span><span>"</span><span>) </span><span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span><span>: urls </span>=<span> get_urls() down_load_img(urls)</span>
0x04 线程池
线程池是一种多线程处理形式,处理过程中将任务添加到队列,然后在创建线程后自动启动这些任务。线程池线程都是后台线程。每个线程都使用默认的堆栈大小,以默认的优先级运行,并处于多线程单元中。
<span>"""</span><span>线程池</span><span>"""</span> <span>from</span> concurrent.futures <span>import</span><span> ThreadPoolExecutor </span><span>import</span><span> time </span><span>import</span><span> threading </span><span>def</span><span> ban_zhuang(i): </span><span>print</span>(threading.current_thread().name,<span>"</span><span>**开始搬砖{}**</span><span>"</span><span>.format(i)) time.sleep(</span>2<span>) </span><span>print</span>(<span>"</span><span>**员工{}搬砖完成**一共搬砖:{}</span><span>"</span>.format(i,12**2)) <span>#</span><span>将format里的内容输出到{}</span> <span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span>: <span>#</span><span>主线程</span> start_time =<span> time.time() </span><span>print</span>(threading.current_thread().name,<span>"</span><span>开始搬砖</span><span>"</span><span>) with ThreadPoolExecutor(max_workers</span>=5<span>) as pool: </span><span>for</span> i <span>in</span> range(10<span>): p </span>=<span> pool.submit(ban_zhuang,i) end_time </span>=<span>time.time() </span><span>print</span>(<span>"</span><span>一共搬砖{}秒</span><span>"</span>.format(end_time-start_time))
结合多线程的爬虫:
<span>"""</span><span>美女爬取</span><span>"""</span> <span>import</span><span> requests </span><span>import</span><span> re </span><span>from</span> urllib.parse <span>import</span><span> urlencode </span><span>import</span><span> time </span><span>import</span><span> threading </span><span>#</span><span>https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E7%BE%8E%E5%A5%B3&autoload=true&count=20</span> <span>def</span><span> get_urls(page): keys </span>=<span> { </span><span>"</span><span>aid</span><span>"</span>:<span>"</span><span>24</span><span>"</span><span>, </span><span>"</span><span>app_name</span><span>"</span>:<span>"</span><span>web_search</span><span>"</span><span>, </span><span>"</span><span>offset</span><span>"</span>:20*<span>page, </span><span>"</span><span>keyword</span><span>"</span>:<span>"</span><span>美女</span><span>"</span><span>, </span><span>"</span><span>count</span><span>"</span>:<span>"</span><span>20</span><span>"</span><span> } keys_word </span>=<span> urlencode(keys) url </span>= <span>"</span><span>https://www.toutiao.com/api/search/content/?</span><span>"</span>+<span>keys_word response </span>=<span> requests.get(url) </span><span>print</span><span>(response.status_code) html_str </span>=<span> response.text </span><span>#</span><span> 解析"large_image_url":"(.*?)"</span> pattern = re.compile(<span>"</span><span>"large_image_url":"(.*?)"</span><span>"</span><span>,re.S) urls </span>=<span> re.findall(pattern, html_str) </span><span>return</span><span> urls </span><span>#</span><span>下载图片</span> <span>def</span><span> download_imags(urls): </span><span>for</span> url <span>in</span><span> urls: </span><span>try</span><span>: response </span>=<span> requests.get(url) with open(</span><span>"</span><span>pic/</span><span>"</span>+url.split(<span>"</span><span>/</span><span>"</span>)[-1]+<span>"</span><span>.jpg</span><span>"</span>,<span>"</span><span>wb</span><span>"</span><span>) as f: f.write(response.content) </span><span>print</span>(url.split(<span>"</span><span>/</span><span>"</span>)[-1]+<span>"</span><span>.jpg</span><span>"</span>,<span>"</span><span>已下载~~</span><span>"</span><span>) </span><span>except</span><span> Exception as err: </span><span>print</span>(<span>"</span><span>An exception happened: </span><span>"</span><span>) </span><span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span><span>: start </span>=<span> time.time() thread </span>=<span> [] </span><span>for</span> page <span>in</span> range(3<span>): urls </span>=<span> get_urls(page) </span><span>#</span><span>print(urls)</span> <span>#</span><span>多线程</span> <span>for</span> url <span>in</span><span> urls: th </span>= threading.Thread(target=download_imags,args=<span>(url,)) </span><span>#</span><span>download_imags(urls)</span> <span> thread.append(th) </span><span>for</span> t <span>in</span><span> thread: t.start() </span><span>for</span> t <span>in</span><span> thread: t.join() end </span>=<span> time.time() </span><span>print</span>(<span>"</span><span>耗时:</span><span>"</span>,end-start)