Python beautiful soup解析html获得数据

您所在的位置：网站首页 › dormouse怎么读 › Python beautiful soup解析html获得数据

Python beautiful soup解析html获得数据

2023-12-01 07:06| 来源: 网络整理| 查看: 265

1. 用beautiful soup 解析网页的HTML的信息

https://blog.csdn.net/i_chaoren/article/details/63282877

1.1 BeautifulSoup的安装及介绍

官方给出的几点介绍： Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

Beautiful Soup的安装

pip install beautifulsoup4

1.2 BeautifulSoup中的HTML解析器对比

https://blog.csdn.net/IMW_MG/article/details/78220979

解析器使用方法优势劣势Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库；执行速度适中；文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文档容错能力差lxml HTML解析器BeautifulSoup(markup, “lxml”)速度快；文档容错能力强需要安装C语言库lxml XML 解析器BeautifulSoup(markup, [“lxml”, “xml”])速度快；唯一支持XML的解析器需要安装C语言库html5libBeautifulSoup(markup, “html5lib”)最好的容错性；以浏览器的方式解析文档；生成HTML5格式的文档速度慢；不依赖外部扩展

1.3 BeautifulSoup4解析器

解析html获得数据

以beautifulsoup为例，包含获取标签、链接，以及根据html层次结构遍历等方法。

REF: https://blog.csdn.net/i_chaoren/article/details/63282877

https://blog.csdn.net/qq_33689414/article/details/78585304

#!/usr/bin/python # -*- coding: UTF-8 -*- import re from bs4 import BeautifulSoup html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" #创建一个BeautifulSoup解析对象 soup = BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8") #获取所有的链接 links = soup.find_all('a') print ("所有的链接") for link in links: print (link.name,link['href'],link.get_text()) print ("获取特定的URL地址") link_node = soup.find('a',href="http://example.com/elsie") print (link_node.name,link_node['href'],link_node['class'],link_node.get_text()) print ("正则表达式匹配") link_node = soup.find('a',href=re.compile(r"ti")) print (link_node.name,link_node['href'],link_node['class'],link_node.get_text()) print ("获取P段落的文字") p_node = soup.find('p',class_='story') print (p_node.name,p_node['class'],p_node.get_text()) 2. Beautiful Soup 4.2.0 文档¶ https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ https://blog.csdn.net/i_chaoren/article/details/63282877

1、Beautiful Soup库基础知识

（1）Beautiful Soup库的理解

Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

BeautifulSoup对应一个HTML/XML文档的全部内容。

（2）Beautiful Soup库解析器

（3）BeautifulSoup类的基本元素

a. Tag 标签.

任何存在于HTML语法中的标签都可以用soup.访问获得。

当HTML文档中存在多个相同对应内容时，soup.返回第一个。

示例代码如下：

b. Tag的name.

每个都有自己的名字，通过.name获取，字符串类型。

示例代码如下：

c. Tag的attrs.

一个可以有0或多个属性，字典类型

示例代码如下：

d. Tag的NavigableString.

NavigableString可以跨越多个层次。

示例代码如下：

(4)基于bs4库的HTML内容遍历方法

a. HTML基本格式

b. 标签树的下行遍历

遍历儿子节点：

遍历子孙节点：

c. 标签树的上行遍历

d. 标签树的上行遍历

总结如下图：

（5）bs4库的prettify()方法--让HTML内容更加“友好”的显示

prettify()为HTML文本及其内容增加更加'\n'

显示效果如下：

2、信息标记与提取方法

（1）信息标记的三种方式

a. XML实例

b. JSON实例

c. YAML实例

三种信息标记形式的比较：

XML最早的通用信息标记语言，可扩展性好，但繁琐。Internet上的信息交互与传递。

JSON信息有类型，适合程序处理(js)，较XML简洁。移动应用云端和节点的信息通信，无注释。

YAML信息无类型，文本信息比例最高，可读性好。各类系统的配置文件，有注释易读。

（2）信息提取

Beautiful Soup库提供了.find_all()函数，返回一个列表类型，存储查找的结果。

详细介绍如下：

示例代码如下：

扩展方法如下：

3. Beautiful Soup库实例--&&中文对齐问题的解决

功能描述：

输入：大学排名URL链接

输出：大学排名信息的屏幕输出（排名，大学名称，总分）

技术路线：requests‐bs4

定向爬虫：仅对输入URL进行爬取，不扩展爬取

程序的结构设计：

步骤1：从网络上获取大学排名网页内容 getHTMLText()

步骤2：提取网页内容中信息到合适的数据结构 fillUnivList()

步骤3：利用数据结构展示并输出结果 printUnivList()

输出结果如下：

源代码如下：

import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivList(ulist, html): soup = BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: #遍历表签树 if isinstance(tr, bs4.element.Tag): tds = tr('td') #简写，等价于下一行代码 #tds = tr.find_all('td') ulist.append([tds[0].string, tds[1].string, tds[2].string,tds[3].string]) def printUnivList(ulist, num): tplt = "{0:^10}\t{1:{4}^8}\t{2:6}\t{3:10}" print(tplt.format("排名","学校名称","城市","总分",chr(12288))) for i in range(num): u=ulist[i] print(tplt.format(u[0],u[1],u[2],u[3],chr(12288))) def main(): uinfo = [] url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html' html = getHTMLText(url) fillUnivList(uinfo, html) printUnivList(uinfo, 20) # 20 univs main()

中文对齐问题的原因：

当中文字符宽度不够时，采用西文字符填充；中西文字符占用宽度不同。

解决方法：

【本文地址】

Python beautiful soup解析html获得数据

Python beautiful soup解析html获得数据

今日新闻

推荐新闻