python爬虫分析报告

2024-04-24 00:29| 来源: 网络整理| 查看: 265

在python课上布置的作业，第一次进行爬虫，走了很多弯路，也学习到了很多知识，借此记录。

1. 获取学堂在线合作院校页面

要求：

爬取学堂在线的计算机类课程页面内容。要求将课程名称、老师、所属学校和选课人数信息，保存到一个csv文件中。链接：https://www.xuetangx.com/search?query=&org=&classify=1&type=&status=&page=1

1.确定目标

打开页面，通过查看网页源代码并没有相关内容。可以猜测具体数据由前端通过ajax请求后端具体数据。在开发者工具中，捕获了如下的json数据：

alt json数据

可以看到这个就是我们要求的json数据。考虑如何获取json数据并取出来,分析一下浏览器的请求，将cURL命令转换成Python请求如下：

import requests cookies = { 'provider': 'xuetang', 'django_language': 'zh', } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', 'Accept': 'application/json, text/plain, */*', 'Accept-Language': 'zh', 'Content-Type': 'application/json', 'django-language': 'zh', 'xtbz': 'xt', 'x-client': 'web', 'Origin': 'https://www.xuetangx.com', 'Connection': 'keep-alive', 'Referer': 'https://www.xuetangx.com/search?query=&org=&classify=1&type=&status=&page=1', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'TE': 'Trailers', } params = ( ('page', '1'), ) data = '{query:,chief_org:[],classify:[1],selling_type:[],status:[],appid:10000}' response = requests.post('https://www.xuetangx.com/api/v1/lms/get_product_list/', headers=headers, params=params, cookies=cookies, data=data) #NB. Original query string below. It seems impossible to parse and #reproduce query strings 100% accurately so the one below is given #in case the reproduced version is not "correct". # response = requests.post('https://www.xuetangx.com/api/v1/lms/get_product_list/?page=1', headers=headers, cookies=cookies, data=data)

分析请求的网页是https://curl.trillworks.com/，可以在浏览器的开发工具里，选择network选项卡（chrome）或者网络选项卡（Firefox），右键点击某个请求文件，在菜单中选择复制→复制为cURL命令，然后到这个网页中粘贴转换为python的request即可

2.设计爬虫

要选取的数据为课程名称、老师、所属学校和选课人数。设计的items.py如下：

# items.py import scrapy class XuetangItem(scrapy.Item): name = scrapy.Field() teachers = scrapy.Field() school = scrapy.Field() count = scrapy.Field() pass

接下来是重头戏设计spider.py文件。因为爬取的是json数据而不是html静态页面，需要设计start_requests函数来发送请求。结合之前分析的Python request，具体代码如下：

import scrapy import json from xuetang.items import XuetangItem class mySpider(scrapy.spiders.Spider): name = "xuetang" allowed_domains = ["www.xuetangx.com/"] url = "url_pat = 'https://www.xuetangx.com/api/v1/lms/get_product_list/?page={}'" data = '{"query":"","chief_org":[],"classify":["1"],"selling_type":[],"status":[],"appid":10000}' # data由分析中得来 headers = { 'Host': 'www.xuetangx.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', 'authority': 'www.xuetangx.com', 'Accept': 'application/json,text/plain,*/*', 'Accept-Language': 'zh', 'Accept-Encoding': 'gzip, deflate, br', 'django-language': 'zh', 'xtbz': 'xt', 'content-type': 'application/json', # 如果不添加这一行可能导致爬虫失败 'x-client': 'web', 'Connection': 'keep-alive', 'Referer': 'https://www.xuetangx.com/university/all', 'Cookie': 'provider=xuetang; django_language=zh', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache' } # 直接从浏览器抄过来，防止服务器辨析到不是浏览器而导致失败 def start_requests(self): for page in range(1, 6): yield scrapy.FormRequest( url=self.url.format(page), headers=self.headers, method='POST', # 浏览器的请求是POST，而且响应头中写明只允许POST body=self.data, callback=self.parse ) def parse(self, response): j = json.loads(response.body) for each in j['data']['org_list']: item = XuetangItem() item['name'] = each['name'] item['school'] = each['org']['name'] item['count'] = each['count'] teacher_list = [] for teacher in each['teacher']: teacher_list.append(teacher['name']) # 因为有些课程有多个老师，需要逐个保存，写入一条记录里 item['teacher'] = ','.join(teacher_list) yield item

然后设计pipelines.py文件，将爬取到的数据保存为csv文件：

import csv class XuetangPipeline(object): dict_data = {'data': []} def open_spider(self, spider): try: self.file = open('data.csv', "w", encoding="utf-8", newline='') self.csv = csv.writer(self.file) except Exception as err: print(err) def process_item(self, item, spider): self.csv.writerow(list(item.values())) return item def close_spider(self, spider): self.file.close()

这样就可以就行爬虫了，当然还要在setting.py中设置ITEM_PIPELINES。之后可以命令行启动爬虫，也可以运行执行cmd命令的python文件：

from scrapy import cmdline cmdline.execute("scrapy crawl xuetang".split()) 3.数据展示

保存的csv文件内容如下，正好内容为50条，这里仅展示开头一部分：

C++语言程序设计基础,清华大学,424718,"郑莉,李超,徐明星" 数据结构(上),清华大学,411298,邓俊辉数据结构（下）,清华大学,358804,邓俊辉 …… 2. 获取链家二手房信息

要求:

爬取链家官网二手房的数据 https://bj.lianjia.com/ershoufang/ 要求爬取北京市东城、西城、海淀和朝阳四个城区的数据（每个区爬取5页），将楼盘名称、总价、平米数、单价保存到json文件中。

1.确定目标

打开网页，查看网页源代码，可以看到在源代码中间已经包含了二手房信息，说明页面由后端渲染完毕后返回到浏览器，这样可以通过Xpath来爬取相关内容。分析一下某个楼盘的信息结构：

槐柏树街南里南北通透两居室精装修必看好房槐柏树街南里 - 长椿街 2室1厅 | 60.81平米 | 南北 | 精装 | 中楼层(共6层) | 1991年建 | 板楼 226人关注 / 1个月以前发布近地铁 VR看装修房本满两年随时看房 600万单价98668元/平米关注加入对比

可以看到房子的名称在class="title"的div下的a标签内，平米数保存在class="houseInfo"的div里，但需要截取一下字符串，单价和总价均保存在class="priceInfo"的div中，有趣的是有些信息没有单价显示，即span里的元素为空，但是观察到其父元素div内有一个属性data-price，其值正好等于单价，因此提取这个即可。

2.设计爬虫

需要保存的数据为楼盘名字、平米数、总价、单价。items.py如下：

import scrapy class YijiaItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() square = scrapy.Field() price = scrapy.Field() total = scrapy.Field() pass

分析要爬虫的页面，网页提供了选择区的筛选，点击“西城区”后网页地址变为了https://bj.lianjia.com/ershoufang/xicheng/,因此可以将网页地址的变动部分用format去填充。spider.py的内容如下：

from yijia.items import YijiaItem import scrapy class mySpider(scrapy.spiders.Spider): name = 'lianjia' allowed_domains = ["bj.lianjia.com/"] url = "https://bj.lianjia.com/ershoufang/{}/pg{}/" # 第一个地方为地区，第二个为页数 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', } #抄来浏览器的header def start_requests(self): positions = ["dongceng", "xicheng", "chaoyang", "haidian"] for position in positions: for page in range(1, 6): yield scrapy.FormRequest( url=self.url.format(position, page), method="GET", headers=self.headers, callback=self.parse ) def parse(self, response): for each in response.xpath("/html/body/div[4]/div[1]/ul/li"): item = YijiaItem() item['name'] = each.xpath("div[1]/div[1]/a/text()").extract()[0] house_info = each.xpath("div[1]/div[3]/div[1]/text()").extract()[0].split('|') item['square'] = house_info[1].strip() item['total'] = each.xpath("div[1]/div[6]/div[1]/span/text()").extract()[0] + "万元" item['price'] = each.xpath("div[1]/div[6]/div[2]/@data-price").extract()[0] + "元/平米" yield item

然后是设计管道文件，将内容保存为一个json文件：

import json class YijiaPipeline(object): dict_data = {'data': []} def open_spider(self, spider): try: self.file = open('data.json', "w", encoding="utf-8") except Exception as err: print(err) def process_item(self, item, spider): dict_item = dict(item) self.dict_data['data'].append(dict_item) return item def close_spider(self, spider): self.file.write(json.dumps(self.dict_data, ensure_ascii=False, indent=4, separators=(',', ':'))) self.file.close()

最后仿照前一个样例进行爬虫即可。

3.数据展示

保存的json文件内容如下所示，这里提供前两条供展示：

{ "data":[ { "name":"此房南北通透格局，采光视野无遮挡，交通便利", "square":"106.5平米", "total":"1136万元", "price":"106667元/平米" }, { "name":"新安南里南北通透 2层本房满五年唯一", "square":"55.08平米", "total":"565万元", "price":"102579元/平米" } /*省略之后的N条数据*/ ] }

【本文地址】

python爬虫分析报告

python爬虫分析报告

今日新闻

推荐新闻