python爬取百度贴吧图片及基本信息

您所在的位置:网站首页 Python爬取贴吧代码 python爬取百度贴吧图片及基本信息

python爬取百度贴吧图片及基本信息

2024-07-11 23:55| 来源: 网络整理| 查看: 265

本文分享的是如何用python爬取百度贴吧的图片和基本信息,虽然知识点不算很难,当作为自己的笔记整理之余还可以分享给我们这些热爱学习的人,我觉得我很愿意,同时也希望各位大佬指点出错误或更好的方法,时刻保持谦卑好学是我的宗旨!

预想达到的目的:

输入想要爬取的贴吧名字,输入起始页和停止页,程序会自动爬取输入页码范围内的所有图片和基本信息

完整代码在最后! 首先,分析贴吧不同页数的url规律: 以周杰伦吧为例:

第一页:

第二页:

第三页:

 

(ps:要翻动一下页数网址才会和我放出来的一样)

我们可以发现每一页url不同的地方就只是后面的“pn=”,即我们只要改变pn参数就可以得到不同页的url,同理“kw=”为贴吧名

你们应该好奇为什么我上面不直接复制url而用截图,是因为复制下来是这样的:

https://tieba.baidu.com/f?kw=%E5%91%A8%E6%9D%B0%E4%BC%A6&ie=utf-8&pn=0

 

我们可以发现“kw=”不是周杰伦(贴吧名),而是一串编码,所以这里我们需要将贴吧名编码(可用到这里详细了解)

url规律为:

pn=(页数-1)*50

kw=贴吧名进行编码后的所得

其中URL中的&ie=utf-8去掉也是一样可用的,下面我们就默认URL都是去掉&ie=utf-8的

 

def teibaSpider(url): name = input('请输入贴吧名字:') beginPage = int(input('请输入起始页:')) endPage = int(input('请输入结束页:')) kw = {'kw':name} ret = parse.urlencode(kw) print(ret) url = url + ret + '&pn=' for page in range(beginPage,endPage+1): pn = (page-1) * 50 fullurl = url + str(pn) print(fullurl) html = loadPage(fullurl) filename = name+'吧第%s页.html'%page #tiebaInfo = name+'吧第%s页.html'%page + 'Info' writePage(html, filename) tiebaInfo(html)

此处传入的URL为最基本的贴吧URL:https://tieba.baidu.com/f?

ret为贴吧名name编码后的所得

 

 

def loadPage(url) : #headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'} #req = request.Request(url,headers=headers) #构建请求体 response = request.urlopen(url) #发送请求-得到响应对象 html = response.read() #读取响应内容 return html def writePage(html,filename): html = html.decode('utf-8') with open(filename,'w',encoding='utf-8') as f: f.write(html) print('正在下载%s·····'%filename)

得到网页html内容写成html文件保存到本地

 

def tiebaInfo(html): # 解析HTML文档 content = etree.HTML(html) print(content) # 通过xpath规则匹配对应的数据信息 title_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/text()") link_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/@href") replies_list = content.xpath("//div[@class='t_con cleafix']/div/span/text()") writer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[1]/div[2]/span[1]/@title") introduce_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div/div/text()") lastResponer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[1]/@title") lastResponTime_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[2]/text()") #print(type(lastResponTime_list)) for title, link ,replies,writer,introduce,lastResponer,lastResponTime in zip(title_list, link_list, replies_list,writer_list,introduce_list,lastResponer_list,lastResponTime_list): fulllink = 'https://tieba.baidu.com'+link info = ' 标题:%s\n 链接:%s\n 回复数:%s\n 楼主名:%s\n %s\n 最后回复时间:%s\n 简介:%s\n '%(title, fulllink ,replies,writer,lastResponer,lastResponTime,introduce) print(info) loadImage(fulllink) filename = 'tiebaInfo' writeInfo(info, filename)

传入刚刚得到的html;解析html文档;通过xpath规则匹配对应的数据信息,分别为标题,图片链接(图片url),回复数量,楼主名,最后回复时间,简介。

 

通过loadImage()加载图片url,writeInfo()把以上信息保存起来,writeImage()把图片保存到本地

def writeInfo(info,filename): with open(filename, 'a', encoding='utf-8') as f: f.write(info) def loadImage(url): '''匹配图片url''' html = loadPage(url) #发送请求得到响应内容 content = etree.HTML(html) #解析html文档 imgUrl_list = content.xpath("//img[@class='BDE_Image']/@src") for imgUrl in imgUrl_list: print(imgUrl) writeImage(imgUrl) def writeImage(url): '''将图片写入到本地''' img = loadPage(url) #filename = url[-15:] global i i += 1 filename = str(i) + '.jpg' with open('E:\\Pycharm\\workSpace\\day2\\image\%s'%filename,'wb') as f: f.write(img) print('正在下载%s图片'%filename)

 

完整代码: from urllib import request,parse from lxml import etree global i i = 0 def loadPage(url) : #headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'} #req = request.Request(url,headers=headers) #构建请求体 response = request.urlopen(url) #发送请求-得到响应对象 html = response.read() #读取响应内容 return html def writePage(html,filename): html = html.decode('utf-8') with open(filename,'w',encoding='utf-8') as f: f.write(html) print('正在下载%s·····'%filename) def teibaSpider(url): name = input('请输入贴吧名字:') beginPage = int(input('请输入起始页:')) endPage = int(input('请输入结束页:')) kw = {'kw':name} ret = parse.urlencode(kw) print(ret) url = url + ret + '&pn=' for page in range(beginPage,endPage+1): pn = (page-1) * 50 fullurl = url + str(pn) print(fullurl) html = loadPage(fullurl) filename = name+'吧第%s页.html'%page #tiebaInfo = name+'吧第%s页.html'%page + 'Info' writePage(html, filename) tiebaInfo(html) def writeInfo(info,filename): with open(filename, 'a', encoding='utf-8') as f: f.write(info) def loadImage(url): '''匹配图片url''' html = loadPage(url) #发送请求得到响应内容 content = etree.HTML(html) #解析html文档 imgUrl_list = content.xpath("//img[@class='BDE_Image']/@src") for imgUrl in imgUrl_list: print(imgUrl) writeImage(imgUrl) def writeImage(url): '''将图片写入到本地''' img = loadPage(url) #filename = url[-15:] global i i += 1 filename = str(i) + '.jpg' with open('E:\\Pycharm\\workSpace\\day2\\image\%s'%filename,'wb') as f: f.write(img) print('正在下载%s图片'%filename) def tiebaInfo(html): # 解析HTML文档 content = etree.HTML(html) print(content) # 通过xpath规则匹配对应的数据信息 title_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/text()") link_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/@href") replies_list = content.xpath("//div[@class='t_con cleafix']/div/span/text()") writer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[1]/div[2]/span[1]/@title") introduce_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div/div/text()") lastResponer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[1]/@title") lastResponTime_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[2]/text()") #print(type(lastResponTime_list)) for title, link ,replies,writer,introduce,lastResponer,lastResponTime in zip(title_list, link_list, replies_list,writer_list,introduce_list,lastResponer_list,lastResponTime_list): fulllink = 'https://tieba.baidu.com'+link info = ' 标题:%s\n 链接:%s\n 回复数:%s\n 楼主名:%s\n %s\n 最后回复时间:%s\n 简介:%s\n '%(title, fulllink ,replies,writer,lastResponer,lastResponTime,introduce) print(info) loadImage(fulllink) filename = 'tiebaInfo' writeInfo(info, filename) if __name__ == '__main__': url = 'https://tieba.baidu.com/f?' teibaSpider(url)

 

接下来我们来测试一下:

爬取 周杰伦吧 第2到第5页

 

运行结果:

爬取到的图片:

 

tiebainfo文件:

 

HTML内容:

 

到此,我们已经可以实现爬取指定贴吧的指定页码的内容和图片了

 

若哪里还可以优化可以告诉我,我会虚心学习的

若有哪里还可以提升的,欢迎各位指点。

第一次写,若写得不好请见谅!

 



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3