文章目录[隐藏]
beautiful soup库的安装
pip install beautifulsoup4
beautiful soup库的理解
beautiful soup库是解析、遍历、维护“标签树”的功能库
beautiful soup库的引用
from bs4 import BeautifulSoup import bs4
BeautifulSoup类
BeautifulSoup对应一个HTML/XML文档的全部内容
回顾demo.html
import requests r = requests.get("http://python123.io/ws/demo.html") demo = r.text print(demo)
<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> <<a style="color:transparent">来@源gao*daima.com搞@代#码网</a>;/body></html>
Tag标签
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.title) tag = soup.a print(tag)
<title>This is a python demo page</title> <a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>
任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时,soup.返回第一个
Tag的name
基本元素 | 说明 |
---|---|
Name | 标签的名字,
… 的名字是’p’,格式:.name |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.a.name) print(soup.a.parent.name) print(soup.a.parent.parent.name)
a p body