Python爬虫进阶之Beautiful Soup库详解

文章目录[隐藏]

一、Beautiful Soup库简介
二、Beautiful Soup库安装
三、Beautiful Soup 库解析器

一、Beautiful Soup库简介

BeautifulSoup4 是一个 HTML/XML 的解析器，主要的功能是解析和提取 HTML/XML 的数据。和 lxml 库一样。

lxml 只会局部遍历，而 BeautifulSoup4 是基于 HTML DOM 的，会加载整个文档，解析整个 DOM 树，因此内存开销比较大，性能比较低。

BeautifulSoup4 用来解析 HTML 比较简单，API 使用非常人性化，支持 CSS 选择器，是 Python 标准库中的 HTML 解析器，也支持 lxml 解析器。

二、Beautiful Soup库安装

目前，Beautiful Soup 的最新版本是 4.x 版本，之前的版本已经停止开发，这里推荐使用 pip 来安装，安装命令如下：

pip install beautifulsoup4

查看 Beautiful Soup 安装是否成功

from bs4 import BeautifulSoup 
soup = BeautifulSoup('<p>Hello</p>','html.parser') 
print(soup.p.strin<i>本文来源gaodai$ma#com搞$$代**码网</i>g)

注意：
□ 这里虽然安装的是 beautifulsoup4 这个包，但是引入的时候却是 bs4，因为这个包源代码本身的库文件名称就是bs4，所以安装完成后，这个库文件就被移入到本机 Python3 的 lib 库里，识别到的库文件就叫作 bs4。
□ 因此，包本身的名称和我们使用时导入包名称并不一定是一致的。

三、Beautiful Soup 库解析器

Beautiful Soup 在解析时实际上依赖解析器，它除了支持 Python 标准库中的 HTML 解析器外，还支持一些第三方解析器（比如 lxml）。下表列出了 Beautiful Soup 支持的解析器。

获取 title 节点，查看它的类型

from bs4 import BeautifulSoup

html = '''
    <html><head><title>The Dormouse's story</title></head> 
    <body> 
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>, 
    <a href="http://example.com/lacie" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="sister" id="link2">Lacie</a> and 
    <a href="http://example.com/tillie" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="sister" id="link3">Tillie</a>; 
    and they lived at the bottom of a well.</p> 
    <p class="story">...</p> 
    </body> 
    </html>
'''

# 获取bs4解析对象，使用解析器：lxml，html：解析内容
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

执行结果如下所示：

The Dormouse’s story

上述示例首先声明变量 html，它是一个 HTML 字符串。接着将它当作第一个参数传给 BeautifulSoup 对象，该对象的第二个参数为解析器的类型（这里使用 lxml），此时就完成了 BeaufulSoup 对象的初始化。
接着调用 soup 的各个方法和属性解析这串 HTML 代码了。
调用 prettify()方法。可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是，输出结果里面包含 body 和 html 节点，也就是说对于不标准的 HTML 字符串 BeautifulSoup，可以自动更正格式。
调用 soup.title.string，输出 HTML 中 title 节点的文本内容。所以，soup.title 可以选出 HTML 中的 title 节点，再调用 string 属性就可以得到里面的文本了。

搞代码网（gaodaima.com）提供的所有资源部分来自互联网，如果有侵犯您的版权或其他权益，请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected]‍，我们会在看到邮件的第一时间内为您处理，或直接联系QQ：872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接：Python爬虫进阶之Beautiful Soup库详解

一、Beautiful Soup库简介

二、Beautiful Soup库安装

三、Beautiful Soup 库解析器

Hi，您需要填写昵称和邮箱！