链家全国房价数据分析 : 数据获取

您所在的位置:网站首页 全国房价历史数据 链家全国房价数据分析 : 数据获取

链家全国房价数据分析 : 数据获取

2022-12-15 05:23| 来源: 网络整理| 查看: 265

最近一直在看论文,也有很久没有coding了,感觉对爬虫的技术有些生疏,我觉得即使现在手头没有在做这方面的东西,经常爬点对技术保鲜还是很重要的。所以这次我打算爬链家的房价数据,目的主要是对爬虫和Python的东西作一个巩固,然后做一个分析。

以链家广州为例查看网页结构,可以看到它是下图这样的: 看起来内容元素的结构十分清晰,分类很好,都是我们想要的东西。 链家对爬虫的容忍度挺高的,不会封IP,也没有要求登录,是我们练手的好题材(不过大家要适可而止,人家的服务器也不是无底洞) 我的环境:Python 3.6,jupyter notebook 爬虫主要有两个部分:下载模块和解析模块。

下载模块

在之前写爬虫的时候,我发现下载模块的代码重复度很高,无论对什么网址,需要解决的东西大致有三个:

User-Agent,用来模拟浏览器,爬虫本质上是一个下载器,需要通过加入一些浏览器的标识信息使得服务器以为这是一个来自浏览器的请求。IP代理,大部分的反爬虫策略都是通过屏蔽IP地址来限制爬虫的,当同一个IP短时间内访问过于频繁,就会被认为是爬虫,从而返回403 forbidden的结果。一般来说,免费IP代理都很垃圾,不是很慢就是不能用,天下没有免费的午餐,建议使用付费IP代理。Cookie,对于一些需要登录才可以查看的网页(微博,豆瓣等),需要从浏览器获取上次成功登录的cookie,携带这个cookie去访问,才能通过。 我把这些常用代码写成了一个类(这些代码在这里),但是我们这次用不到,所以我只从中摘取了一部分: import requests from lxml import etree import random import json import pandas as pd from pandas.io.json import json_normalize import math import re # 随机获取一个UserAgent def getUserAgent(): UA_list = [ "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", "Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) App leWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53", "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; QIHU 360EE)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Maxthon/3.0)", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Macintosh; U; IntelMac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"] return random.choice(UA_list) # 使用requests获取HTML页面 def getHTML(url): global invalid_ip_count headers = { 'User-Agent': getUserAgent(), 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', } try: web_data=requests.get(url,headers=headers,timeout=20)#超时时间为20秒 status_code=web_data.status_code retry_count=0 while(str(status_code)!='200' and retry_count 进入一个特定城市的首页 -> 获取最大页数 -> 爬取每一页的名称、链接、房屋信息等 -> 构造一个dataframe,保存成文件

代码如下:

def get_ershoufang(city): print('getting city:',city[0]) print('url: ',city[1]) names=[] links=[] houseInfo=[] floor=[] age=[] district=[] concern=[] tags=[] total_price=[] unit_price=[] city_link=city[1] html=getHTML(city_link) selector = etree.HTML(html) try: maxpage_str = selector.xpath('//div[@class = "page-box house-lst-page-box"]/@page-data') maxpage = maxpage_str[0].split(',')[0].split(':')[1] maxpage = int(maxpage) except: maxpage = 1 print('max page is : ',maxpage) for page in range(1,maxpage): print('fetching page ',page,'...') link = city_link+'pg'+str(page)+'/' html = getHTML(link) selector = etree.HTML(html) lis=selector.xpath('//ul[@class = "sellListContent"]//li[@class = "clear LOGCLICKDATA"]') for li in lis: names.append(li.xpath('./div[@class="info clear"]/div[@class="title"]/a/text()')[0]) links.append(li.xpath('./div[@class="info clear"]/div[@class="title"]/a/@href')[0]) positions.append(li.xpath('./div[@class="info clear"]/div[@class="address"]/div[@class="houseInfo"]/a/text()')[0]) try: houseInfo.append(' '.join(li.xpath('./div[@class="info clear"]/div[@class="address"]/div[@class="houseInfo"]//text()'))[1:-1]) except: houseInfo.append('None') try: floor.append(li.xpath('./div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]//text()')[0]) except: floor.append('None') try: age.append(li.xpath('./div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]//text()')[2]) except: age.append('None') try: district.append(li.xpath('./div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]//text()')[-1]) except: district.append('None') try: concern.append(li.xpath('./div[@class="info clear"]/div[@class="followInfo"]//text()')[0]) except: concern.append('None') try: tags.append('_'.join(li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/div[@class="tag"]//text()'))) except: tags.append('None') try: if city[0]=='北京二手房': total_price.append(' '.join(li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/div[@class="priceInfo"]/div[@class="totalPrice"]//text()'))) else: total_price.append(' '.join(li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[@class="totalPrice"]//text()'))) except: total_price.append('None') try: if city[0]=='北京二手房': unit_price.append(' '.join(li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/div[@class="priceInfo"]/div[@class="unitPrice"]//text()'))) else: unit_price.append(li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[@class="unitPrice"]//text()')[0]) except: unit_price.append('None') df_data={'name':names,'link':links,'position':positions,'house_info':houseInfo, 'floor':floor,'age':age,'district':district,'concern':concern,'tags':tags,'total_price':total_price,'unit_price':unit_price} df=pd.DataFrame(df_data) df.to_csv('./ershoufang/'+city[0]+'.csv',index=False,encoding='utf-8')

爬取全国所有二手房城市的链接代码:

# 获取全国所有的城市 url='https://gz.lianjia.com/ershoufang/' html=getHTML(url) selector=etree.HTML(html) city_name=selector.xpath('//div[@class="link-list"]/div[1]/dd//a/text()') city_links=selector.xpath('//div[@class="link-list"]/div[1]/dd//a/@href') cities=list(zip(city_name,city_links)) #('北京' , 'https://bj.lianjia.com/ershoufang/') for city in cities: get_ershoufang(city)

由于数据比较多,这段代码要跑的更久一点,大概一个多小时吧?因为中间断过,一共多久我也不记得了。 结果如图:

以上便是爬虫的部分,数据分析的部分在链家全国房价数据分析 : 数据分析及可视化



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3