网站开发需要哪些资料,买的网站模板里面是什么,网址提交入口大全,在闲鱼可以做网站吗python爬虫-bs4 目录 python爬虫-bs4说明安装导入 基础用法解析对象获取文本Tag对象获取HTML中的标签内容find参数获取标签属性获取所有标签获取标签名嵌套获取子节点和父节点 说明
BeautifulSoup 是一个HTML/XML的解析器#xff0c;主要的功能也是如何解析和提取 HTML/XML 数…python爬虫-bs4 目录 python爬虫-bs4说明安装导入 基础用法解析对象获取文本Tag对象获取HTML中的标签内容find参数获取标签属性获取所有标签获取标签名嵌套获取子节点和父节点 说明
BeautifulSoup 是一个HTML/XML的解析器主要的功能也是如何解析和提取 HTML/XML 数据
在爬虫项目中经常会遇到不规范、及其复杂的HTML代码
BeautifulSoup4提供了强大的方法来遍历文档的节点以及根据各种条件搜索和过滤文档中的元素。你可以使用CSS选择器、正则表达式等灵活的方式来定位和提取所需的数据
安装
pip install BeautiifulSoup4导入
from bs4 import BeautifulSoup基础用法
解析对象
soup BeautifulSoup(目标数据,解析器)目前有三种主流解析器
html.parserlxml(推荐)html5lib
获取文本
获取文本的方法两种方式text和contents
contents
from bs4 import BeautifulSoupdata
h1Welcome to BeautifulSoup Practice/h1div classarticleh2Article Title/h2pThis is a paragraph of text for practicing BeautifulSoup./pa hrefhttps://www.example.comLink to Example Website/asoup BeautifulSoup(data, lxml)
print(soup.contents)
# 输出[htmlbodyh1Welcome to BeautifulSoup Practice/h1
div classarticle
h2Article Title/h2
pThis is a paragraph of text for practicing BeautifulSoup./p
a hrefhttps://www.example.comLink to Example Website/a
/div/body/html]text
print(soup.text)Welcome to BeautifulSoup PracticeArticle Title
This is a paragraph of text for practicing BeautifulSoup.
Link to Example WebsiteTag对象
获取HTML中的标签内容
比如p div
示例
print(soup.h2)
# h2Article Title/h2print(soup.h2.text)
# Article Titlefind参数
获取class要加下划线因为在python中它属于关键字除了class还可以换成任意属性名
data
h1Welcome to BeautifulSoup Practice/h1div classarticlepThis is a paragraph of text for practicing BeautifulSoup./p/divdiv classex2pThis is a abcd./p/divsoup BeautifulSoup(data, lxml)
print(soup.find(div, class_article))获取标签属性
data p id appleThis is a paragraph of text for practicing BeautifulSoup./p
soup BeautifulSoup(data, lxml)
tag soup.find(p)
print(tag.get(id))
# apple获取所有标签
soup BeautifulSoup(data, lxml)
print(soup.find_all(p))
# [pThis is a paragraph of text for practicing BeautifulSoup./p, pThis is a abcd./p]print(len(soup.find_all(p)))
# 2括号为空则获取全部标签
获取标签名
print(soup.div.name)
# div嵌套获取
示例HTML如下
html
div classarticleh2Article Title/h2pThis is a paragraph of text for practicing BeautifulSoup./ppThis is a abcd./pa hrefhttps://www.example.comLink to Example Website/a
/div目标获取div下的所有p标签内容
print(soup.find(div, class_article).find_all(p))子节点和父节点
soup BeautifulSoup(data, lxml)
# 遍历获取所有父节点
for item in soup.p.parents:print(item)# 遍历获取所有子节点
for i in soup.p.children:print(soup.p.children)