• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

Python爬虫入门,详细讲解爬虫过程

python 搞java代码 3年前 (2022-05-21) 23次浏览 已收录 0个评论

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理

以下文章来源于凌晨安全,作者 MoLing

 

1. 爬虫就是模拟浏览器抓取东西,爬虫三部曲:数据爬取、数据解析、数据存储

数据爬取:手机端、pc端数据解析:正则表达式数据存储:存储到文件、存储到数据库

2. 相关python库

爬虫需要两个库模块:requests和re

1. requests库

requests是比较简单易用的HTTP库,相较于urllib会简洁很多,但由于是第三方库,所以需要安装,文末附安装教程链接(链接全在后面,这样会比较方便看吧,贴心吧~)

requests库支持的HTTP特性:

保持活动和连接池、Cookie持久性会话、分段文件上传、分块请求等

Requests库中有许多方法,所有方法在底层的调用都是通过request()方法实现的,所以严格来说Requests库只有request()方法,但一般不会直接使用request()方法。以下介绍Requests库的7个主要的方法:

①requests.request()

构造一个请求,支撑一下请求的方法

具体形式:requests.request(method,url,**kwargs)

method:请求方式,对应get,post,put等方式

url:拟获取页面的url连接

**kwargs:控制访问参数

②requests.get()

获取网页HTM网页的主要方法,对应HTTP的GET。构造一个向服务器请求资源的Requests对象,返回一个包含服务器资源的Response对象。

Response对象的属性:

 

具体形式:res=requests.get(url)

code=res.text (text为文本形式;bin为二进制;json为json解析)

③requests.head()

获取HTML的网页头部信息,对应HTTP的HEAD

具体形式:res=requests.head(url)

④requests.post()

向网页提交post请求方法,对应HTTP的POST

具体形式:res=requests.post(url)

⑤requests.put()

向网页提交put请求方法,对应HTTP的PUT

⑥requests.patch()

向网页提交局部修改的请求,对应HTTP的PATCH

⑦requests.delete()

向网页提交删除的请求,对应HTTP的DELETE

<span>#</span><span> requests 操作练习</span>
<span>import</span><span> requests
</span><span>import</span><span> re
</span><span>#</span><span>数据的爬取</span>
h =<span> {
</span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span>
}
response </span>= requests.get(<span>"</span><span>https://movie.douban.com/chart</span><span>"</span>,headers=<span>h)
html_str </span>=<span> response.text
</span><span>#</span><span>数据解析<a href="https://movie.XX.com/subject/34961898/"  title="汉密尔顿"></span>
pattern = re.compile(<span>"</span><span><a.*?title="(.*?)"></span><span>"</span>)     <span>#</span><span>  .*? 任意匹配尽可能多的匹配尽可能少的字符</span>
result =<span> re.findall(pattern,html_str)
</span><span>print</span>(result)

www#gaodaima.com来源gaodaima#com搞(代@码网搞代码

 

2. re正则表达式:(Regular Expression)

一组由字母和符号组成的特殊字符串,作用:从文本中找到你想要的格式的句子

关于 .*? 的解释:

 

3. xpath解析源码

<span>import</span><span> requests
</span><span>import</span><span> re
</span><span>from</span> bs4 <span>import</span><span>  BeautifulSoup
</span><span>from</span> lxml <span>import</span><span> etree
</span><span>#</span><span>数据爬取(一些HTTP头的信息)</span>
h =<span> {
</span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span>
}
response </span>= requests.get(<span>"</span><span>https://movie.XX.com/chart</span><span>"</span>,headers=<span>h)
html_str </span>=<span> response.text
</span><span>#</span><span>数据解析</span><span>
#</span><span>正则表达式解析</span>
<span>def</span><span> re_parse(html_str):
pattern </span>= re.compile(<span>"</span><span><a.*?title="(.*?)"</span><span>"</span><span>)
results </span>=<span> re.findall(pattern,html_str)
</span><span>print</span><span>(results)
</span><span>return</span><span> results
</span><span>#</span><span>bs4解析</span>
<span>def</span><span> bs4_parse(html_str):
soup </span>= BeautifulSoup(html_str,<span>"</span><span>lxml</span><span>"</span><span>)
items </span>= soup.find_all(class_=<span>"</span><span>nbg</span><span>"</span><span>)
</span><span>for</span> item <span>in</span><span> items:
</span><span>print</span>(item.attrs[<span>"</span><span>title</span><span>"</span><span>])
</span><span>#</span><span>lxml解析</span>
<span>def</span><span> lxml_parse(html_str):
html </span>=<span> etree.HTML(html_str)
results </span>= html.xpath(<span>"</span><span>//a[@class="nbg"]/@title</span><span>"</span><span>)
</span><span>print</span><span>(results)
</span><span>return</span><span> results
re_parse(html_str)
bs4_parse(html_str)
lxml_parse(html_str)</span>

 

4. python写爬虫的架构

 

从图上可以看到,整个基础爬虫架构分为5大类:爬虫调度器、URL管理器、HTML下载器、HTML解析器、数据存储器。

下面给大家依次来介绍一下这5个大类的功能:

① 爬虫调度器:主要是配合调用其他四个模块,所谓调度就是取调用其他的模板。

② URL管理器:就是负责管理URL链接的,URL链接分为已经爬取的和未爬取的,这就需要URL管理器来管理它们,同时它也为获取新URL链接提供接口。

③ HTML下载器:就是将要爬取的页面的HTML下载下来。

④ HTML解析器:就是将要爬取的数据从HTML源码中获取出来,同时也将新的URL链接发送给URL管理器以及将处理后的数据发送给数据存储器。

⑤ 数据存储器:就是将HTML下载器发送过来的数据存储到本地。

0x01 whois爬取

每年,有成百上千万的个人、企业、组织和政府机构注册域名,每个注册人都必须提供身份识别信息和联系方式,包括姓名、地址、电子邮件、联系电话、管理联系人和技术联系人一这类信息通常被叫做whois数据

<span>import</span><span> requests
</span><span>import</span><span> re
h </span>=<span> {
</span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span>
}
response </span>= requests.get(<span>"</span><span>http://whois.chinaz.com/</span><span>"</span>+input(<span>"</span><span>请输入网址:</span><span>"</span>),headers=<span>h)
</span><span>print</span><span>(response.status_code)
html </span>=<span> response.text
</span><span>#</span><span>print(html)</span><span>
#</span><span>解析数据</span>
pattern = re.compile(<span>"</span><span>class="MoreInfo".*?>(.*?)</p></span><span>"</span><span>,re.S)
result </span>=<span> re.findall(pattern,html)
</span><span>#</span><span> 方法一:</span><span>
#</span><span> str = re.sub("
",",",result[0])</span><span>
#</span><span> print(str)</span><span>
#</span><span>方法二:</span>
<span>print</span>(result[0].replace(<span>"</span><span>/n</span><span>"</span>,<span>"</span><span>,</span><span>"</span>))

 

0x02 爬取电影信息

<span>import</span><span> requests
</span><span>import</span><span> re
</span><span>import</span><span> time
</span><span>#</span><span> count = [0,10,20,30,40,50,60,70,80,90]</span>
h =<span> {
</span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36</span><span>"</span><span>
}
responce </span>= requests.get(<span>"</span><span>https://XX.com/board/4?offset=0</span><span>"</span>, headers=<span>h)
responce.encoding </span>= <span>"</span><span>utf-8</span><span>"</span><span>
html </span>=<span> responce.text
</span><span>#</span><span> 解析数据  time.sleep(2)</span>
patter = re.compile(<span>"</span><span>class="name">.*?title="(.*?)".*?主演:(.*?)</p>.*?上映时间:(.*?)</p></span><span>"</span><span>, re.S)
</span><span>#</span><span>time.sleep(2)</span>
result =<span> re.findall(patter, html)
</span><span>print</span><span>(result)
with open(</span><span>"</span><span>maoyan.txt</span><span>"</span>, <span>"</span><span>a</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) as f:
</span><span>for</span> item <span>in</span> result:  <span>#</span><span> 读取result(以元组的形式储存)中的内容=》</span>
<span>for</span> i <span>in</span><span> item:
f.write(i.strip().replace(</span><span>"</span><span>
</span><span>"</span>, <span>"</span><span>,</span><span>"</span><span>))
</span><span>#</span><span>print("
")</span>

 

0x03 爬取图片

<span>import</span><span> requests
</span><span>import</span><span> re
</span><span>import</span><span> time
</span><span>#</span><span>数据的爬取img的url</span>
<span>def</span><span> get_urls():
response </span>= requests.get(<span>"</span><span>http://XX.com/png/</span><span>"</span><span>)
html_str </span>=<span> response.text
</span><span>#</span><span>解析数据,得到url</span>
pattern = re.compile(<span>"</span><span><img data-original="(.*?)"</span><span>"</span><span>)
results </span>=<span> re.findall(pattern,html_str)
</span><span>print</span><span>(results)
</span><span>return</span><span> results
</span><span>#</span><span><img data-original="http://XX.616pic.com/ys_img/00/06/20/64dXxVfv6k.jpg"></span><span>
#</span><span>下载图片</span>
<span>def</span><span> down_load_img(urls):
</span><span>for</span> url <span>in</span><span> urls:
response </span>=<span> requests.get(url)
with open(</span><span>"</span><span>temp/</span><span>"</span>+url.split(<span>"</span><span>/</span><span>"</span>)[-1], <span>"</span><span>wb</span><span>"</span><span>) as f:
f.write(response.content)
</span><span>print</span>(url.split(<span>"</span><span>/</span><span>"</span>)[-1],<span>"</span><span>已经下载成功</span><span>"</span><span>)
</span><span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span><span>:
urls </span>=<span> get_urls()
down_load_img(urls)</span>

 

0x04 线程池

线程池是一种多线程处理形式,处理过程中将任务添加到队列,然后在创建线程后自动启动这些任务。线程池线程都是后台线程。每个线程都使用默认的堆栈大小,以默认的优先级运行,并处于多线程单元中。

<span>"""</span><span>线程池</span><span>"""</span>
<span>from</span> concurrent.futures <span>import</span><span> ThreadPoolExecutor
</span><span>import</span><span> time
</span><span>import</span><span> threading

</span><span>def</span><span> ban_zhuang(i):
    </span><span>print</span>(threading.current_thread().name,<span>"</span><span>**开始搬砖{}**</span><span>"</span><span>.format(i))
    time.sleep(</span>2<span>)
    </span><span>print</span>(<span>"</span><span>**员工{}搬砖完成**一共搬砖:{}</span><span>"</span>.format(i,12**2))   <span>#</span><span>将format里的内容输出到{}</span>

<span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span>:             <span>#</span><span>主线程</span>
    start_time =<span> time.time()
    </span><span>print</span>(threading.current_thread().name,<span>"</span><span>开始搬砖</span><span>"</span><span>)
    with ThreadPoolExecutor(max_workers</span>=5<span>) as pool:
        </span><span>for</span> i <span>in</span> range(10<span>):
            p </span>=<span> pool.submit(ban_zhuang,i)
    end_time </span>=<span>time.time()
    </span><span>print</span>(<span>"</span><span>一共搬砖{}秒</span><span>"</span>.format(end_time-start_time))

 

结合多线程的爬虫:

<span>"""</span><span>美女爬取</span><span>"""</span>
<span>import</span><span> requests
</span><span>import</span><span> re
</span><span>from</span> urllib.parse <span>import</span><span> urlencode
</span><span>import</span><span> time

</span><span>import</span><span> threading
</span><span>#</span><span>https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E7%BE%8E%E5%A5%B3&autoload=true&count=20</span>

<span>def</span><span> get_urls(page):
    keys </span>=<span> {
        </span><span>"</span><span>aid</span><span>"</span>:<span>"</span><span>24</span><span>"</span><span>,
        </span><span>"</span><span>app_name</span><span>"</span>:<span>"</span><span>web_search</span><span>"</span><span>,
        </span><span>"</span><span>offset</span><span>"</span>:20*<span>page,
        </span><span>"</span><span>keyword</span><span>"</span>:<span>"</span><span>美女</span><span>"</span><span>,
        </span><span>"</span><span>count</span><span>"</span>:<span>"</span><span>20</span><span>"</span><span>
    }
    keys_word </span>=<span> urlencode(keys)
    url </span>= <span>"</span><span>https://www.toutiao.com/api/search/content/?</span><span>"</span>+<span>keys_word
    response </span>=<span> requests.get(url)
    </span><span>print</span><span>(response.status_code)
    html_str </span>=<span> response.text
    </span><span>#</span><span> 解析"large_image_url":"(.*?)"</span>
    pattern = re.compile(<span>"</span><span>"large_image_url":"(.*?)"</span><span>"</span><span>,re.S)
    urls </span>=<span> re.findall(pattern, html_str)
    </span><span>return</span><span> urls

</span><span>#</span><span>下载图片</span>
<span>def</span><span> download_imags(urls):
    </span><span>for</span> url <span>in</span><span> urls:
        </span><span>try</span><span>:
            response </span>=<span> requests.get(url)
            with open(</span><span>"</span><span>pic/</span><span>"</span>+url.split(<span>"</span><span>/</span><span>"</span>)[-1]+<span>"</span><span>.jpg</span><span>"</span>,<span>"</span><span>wb</span><span>"</span><span>) as f:
                f.write(response.content)
                </span><span>print</span>(url.split(<span>"</span><span>/</span><span>"</span>)[-1]+<span>"</span><span>.jpg</span><span>"</span>,<span>"</span><span>已下载~~</span><span>"</span><span>)
        </span><span>except</span><span> Exception as err:
            </span><span>print</span>(<span>"</span><span>An exception happened: </span><span>"</span><span>)


</span><span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span><span>:
    start </span>=<span> time.time()
    thread </span>=<span> []
    </span><span>for</span> page <span>in</span> range(3<span>):
        urls </span>=<span> get_urls(page)
        </span><span>#</span><span>print(urls)</span>
        <span>#</span><span>多线程</span>
        <span>for</span> url <span>in</span><span> urls:
            th </span>= threading.Thread(target=download_imags,args=<span>(url,))
            </span><span>#</span><span>download_imags(urls)</span>
<span>            thread.append(th)
    </span><span>for</span> t <span>in</span><span> thread:
        t.start()
    </span><span>for</span> t <span>in</span><span> thread:
        t.join()

    end </span>=<span> time.time()
    </span><span>print</span>(<span>"</span><span>耗时:</span><span>"</span>,end-start)

 


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Python爬虫入门,详细讲解爬虫过程

喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址