python 爬虫系列(三) json格式提取

您所在的位置:网站首页 小黑盒怎么搜用户名 python 爬虫系列(三) json格式提取

python 爬虫系列(三) json格式提取

2024-02-11 23:00| 来源: 网络整理| 查看: 265

这一章,我们以小黑盒官网网页版改版前的一个JSON链接为例,详细讲解,如何提取json文件,并且保存到excel中(保存到数据库的方式类似只是格式不太相同而已)。

我们跳过如何找到这个链接的过程,这个链接为:https://api.xiaoheihe.cn/bbs/web/link/list?limit=20&offset=0&topic_id=55058&heybox_id=17864741&sort_filter=reply&type_filter=all&os_type=web&version=999.0.0&hkey=a3af1bddb204cadd9420b0869ee0cc90&_time=1567065575

首先,还是老样子,获取header和url

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'} url = 'https://api.xiaoheihe.cn/bbs/web/link/list?limit=20&offset=0&topic_id=55058&heybox_id=17864741&sort_filter=reply&type_filter=all&os_type=web&version=999.0.0&hkey=a3af1bddb204cadd9420b0869ee0cc90&_time=1567065575'

然后,我们寻找url的链接规律,发现控制页面的参数是offset于是有了

for j in range(0, 141): url = 'https://api.xiaoheihe.cn/bbs/web/link/list?limit=20&offset={0}&topic_id=55058&heybox_id=17864741&sort_filter=reply&type_filter=all&os_type=web&version=999.0.0&hkey=a3af1bddb204cadd9420b0869ee0cc90&_time=1567065575'.format(j*20)

通过这种方式,我们可以自行控制页面的跳转,然后我们,根据json的规律,把文件保存到excel中:

dataes = [] analyse = ['标题', '点击量', '点赞量', '评论量', '内容'] dataes.append(analyse) for i in range(0,18): sentences.append(json_page['result']['links'][i]['title']) sentences.append(json_page['result']['links'][i]['click']) sentences.append(json_page['result']['links'][i]['up']) sentences.append(json_page['result']['links'][i]['comment_num']) sentences.append(json_page['result']['links'][i]['description']) dataes.append(sentences) print(sentences) sentences = [] print('第{}页'.format(j)) workbook = xlsxwriter.Workbook('loldata2.xlsx') worksheet = workbook.add_worksheet() for j in range(0, 2450): worksheet.write_row('A' + str(j + 1), dataes[j]) workbook.close()

接下来是完整的代码

#-*- coding: utf-8 -*- import requests import time import re import json import xlsxwriter headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'} h = 'https://api.xiaoheihe.cn' #对句子列表网页进行获取,得到一个list包含句子链接 def get_selist(): dataes = [] analyse = ['标题', '点击量', '点赞量', '评论量', '内容'] dataes.append(analyse) for j in range(0, 141): url = 'https://api.xiaoheihe.cn/bbs/web/link/list?limit=20&offset={0}&topic_id=55058&heybox_id=17864741&sort_filter=reply&type_filter=all&os_type=web&version=999.0.0&hkey=a3af1bddb204cadd9420b0869ee0cc90&_time=1567065575'.format(j*20) sentences = [] response = requests.get(url, headers=headers) #访问所有句子列表 json_page = json.loads(response.text) print() for i in range(0,18): sentences.append(json_page['result']['links'][i]['title']) sentences.append(json_page['result']['links'][i]['click']) sentences.append(json_page['result']['links'][i]['up']) sentences.append(json_page['result']['links'][i]['comment_num']) sentences.append(json_page['result']['links'][i]['description']) dataes.append(sentences) print(sentences) sentences = [] print('第{}页'.format(j)) workbook = xlsxwriter.Workbook('loldata2.xlsx') worksheet = workbook.add_worksheet() for j in range(0, 2450): worksheet.write_row('A' + str(j + 1), dataes[j]) workbook.close() if __name__ == '__main__': get_selist()

这样就把小黑盒中的数据链接完整爬取出来,保存下来了



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3