python爬取中国空气质量在线监测平台分析数据【已更新】

您所在的位置：网站首页 › 空气质量在线监测平台报价 › python爬取中国空气质量在线监测平台分析数据【已更新】

python爬取中国空气质量在线监测平台分析数据【已更新】

2024-06-12 03:12| 来源: 网络整理| 查看: 265

**本文介绍如何爬取诸如北京等城市的空气污染物浓度数据，并附有完整代码,统统解决你们找不到数据的科研问题！干货满满！！！ 2021年1月12日更新看了很多小伙伴的评论，发现我的代码被官方给“制裁”了【设置反扒了】，前段时间一直在准备秋招和毕设就没有时间反反扒（解决不能爬去的问题），如今捣鼓了一下午，终于还是搞定了！刚刚爬取的程序结果如下：在这里插入图片描述

这是我刚刚更新的本地数据库的截图！在这里插入图片描述

好了，言归正传，看一下你们需要的数据应该是长这样的在这里插入图片描述这个网站有我们需要的空气污染物数据，时间跨度为2014年1月至最新日期，完美！那么如何才能不费吹灰之力的获取到这些数据呢[实际上我第三次爬取该网站才成功>_ 'source':'''Object.defineProperty(navigator, 'webdriver', { get: () =>false''' })

感兴趣的可以参考https://zhuanlan.zhihu.com/p/191033198 接下来对代码进行讲解：第一步:

def get_date(url): response = requests.get(url) dates = [] try: if response.status_code ==200: response = response.text soup = BeautifulSoup(response, 'lxml') dates_ = soup.find_all('li') for i in dates_: if i.a: # 去除空值 li = i.a.text # 提取li标签下的a标签 date = re.findall('[0-9]*', li) # ['2019', '', '12', '', ''] year = date[0] month = date[2] if month and year: # 去除不符合要求的内容 date_new = '-'.join([year, month]) dates.append(date_new) return dates except: print('数据获取失败！')

这个代码是用来获取网页中某一个城市目前所有时间，其返回结果如图：在这里插入图片描述之所以日期用这样的格式是因为该网站的url链接形式： https://www.aqistudy.cn/historydata/daydata.php?city=北京&month=2020-07 他可以被分解为 base_url+city+month 其中，base_url = ‘https://www.aqistudy.cn/historydata/daydata.php?city=’ 第二步: 在得到每个月份的url链接后，接下来就是爬取数据了，这里给大家分享一个函数，也是整个代码的核心 -------- pandas.read_html() 这个函数专门是用来解决像表格型数据的获取的，百试百灵！代码如下：

def spider(url): browser.get(url) df = pd.read_html(browser.page_source, header=0)[0] # 返回第一个Dataframe time.sleep(1) if not df.empty: # print(dfs) # df.to_csv('data.csv',mode='a',index=None) return df else: return spider(url) # 防止网络还没加载出来就爬取下一个url

整个代码也很简单，如果是仅仅获取到数据，并保存到csv文件的话，到这一步基本就结束了，只需要print一下即可，但是我这里是需要将获取到的数据保存到本地数据库中，并定时进行更新与维护！[是不是觉得这都可以作为一个小项目写到简历里啦] 主体代码如下：

for ct in range(len(city)): list_data = [] list_row = [] for date in dates: url = base_url + city[ct] + '&month=' + date df = spider(url) time.sleep(1) df['city'] = city[ct] # 添加一列 for i in range(0, df.shape[0]): # 行 for j in range(df.shape[1]): # 列 data = df.iloc[i, j] list_row.append(data) list_data.append(list_row) list_row = [] for n in range(len(list_data)): sql = 'insert ignore into aqidata (DATE,AQI,GRADE,PM25,PM10,SO2,CO,NO2,O3_8h,CITY)' \ ' VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)' x = cursor.execute(sql, (list_data[n][0], float(list_data[n][1]), list_data[n][2], float(list_data[n][3]),float(list_data[n][4]),float(list_data[n][5]),float(list_data[n][6]), float(list_data[n][7]),float(list_data[n][8]),list_data[n][9])) conn.commit() cursor.close() # 关闭cursor conn.close() # 关闭连接 browser.close()

上述代码可以实现全国各地每个时间段的数据的爬取与存储，只要你网站上有的数据我都可以爬下来，聪明的小伙伴根据这三块代码就可以实时的获取到全部数据啦！在这里插入图片描述注意：如果想存储到数据库的话，需要提前建立数据库以及表（生怕你们跑代码出问题）：有了我这代码，还愁花钱买数据吗？根本不可能好吧！彩蛋彩蛋

（1）既然数据已经获取到啦，那么如何进行定时更新与维护呢？其实可以参考我的这篇博客后半部分—>传送门（2）既然污染物浓度数据都获取到了，有没有获取气象因子的数据方法呢？——— 有！参考我的这篇博客----->传送门

综上就是这次博客讲解的全部内容，全部都是干货！如果对代码有疑惑或者不理解的地方，欢迎留言评论！如果觉得博客对你有帮助的话就点赞收藏吧！如果需要完整数据的话，可以评论/私信我哈【有偿】完整代码：

# coding=utf-8 from selenium import webdriver import pymysql import pandas as pd import time import requests import re from bs4 import BeautifulSoup from sqlalchemy.exc import IntegrityError def get_date(url): response = requests.get(url) dates = [] try: if response.status_code ==200: response = response.text soup = BeautifulSoup(response, 'lxml') dates_ = soup.find_all('li') for i in dates_: if i.a: # 去除空值 li = i.a.text # 提取li标签下的a标签 date = re.findall('[0-9]*', li) # ['2019', '', '12', '', ''] year = date[0] month = date[2] if month and year: # 去除不符合要求的内容 date_new = '-'.join([year, month]) dates.append(date_new) return dates except: print('数据获取失败！') def spider(url): browser.get(url) df = pd.read_html(browser.page_source, header=0)[0] # 返回第一个Dataframe time.sleep(1.5) if not df.empty: # print(df) # df.to_csv('data.csv', mode='a', index=None) print(url+'数据爬取已完成') return df else: return spider(url) # 防止网络还没加载出来就爬取下一个url if __name__ == '__main__': url = 'https://www.aqistudy.cn/historydata/monthdata.php?city=%E5%8C%97%E4%BA%AC' base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city=' # 声明浏览器对象 option = webdriver.ChromeOptions() option.add_argument("start-maximized") option.add_argument("--disable-blink-features=AutomationControlled") option.add_experimental_option("excludeSwitches", ["enable-automation"]) option.add_experimental_option("useAutomationExtension", False) browser = webdriver.Chrome(options=option) browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",{ 'source':'''Object.defineProperty(navigator, 'webdriver', { get: () =>false''' }) city = [ '北京', ] conn = pymysql.connect(host='localhost', user='root', db='weatherdata', passwd='12345678', charset='utf8') # 连接数据库 cursor = conn.cursor() # 获取cursor游标 dates = get_date(url)[1:] print(dates) list_data = [] list_row = [] for ct in range(len(city)): for date in dates: url = base_url + city[ct] + '&month=' + date df = spider(url) # print(df) time.sleep(1.5) df['city'] = city[ct] # 添加一列 for i in range(0, df.shape[0]): # 行 for j in range(df.shape[1]): # 列 data = df.iloc[i, j] list_row.append(data) list_data.append(list_row) list_row = [] # print(list_data) for n in range(len(list_data)): sql = 'insert ignore into aqidata (DATE,AQI,GRADE,PM25,PM10,SO2,CO,NO2,O3_8h,CITY)' \ ' VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)' try: x = cursor.execute(sql, (list_data[n][0], float(list_data[n][1]), list_data[n][2], float(list_data[n][3]),float(list_data[n][4]),float(list_data[n][5]),float(list_data[n][6]), float(list_data[n][7]),float(list_data[n][8]),list_data[n][9])) except IntegrityError: print('IntegrityError happened!') conn.commit() cursor.close() # 关闭cursor conn.close() # 关闭连接 browser.close() aqidata = pd.DataFrame(list_data, columns=['日期', 'AQI', '质量等级', 'PM2.5', 'PM10', 'SO2', 'CO', 'NO2', 'O3_8h', 'city']) print('所有数据爬取已完成！\n', aqidata)

【本文地址】

python爬取中国空气质量在线监测平台分析数据【已更新】

python爬取中国空气质量在线监测平台分析数据【已更新】

今日新闻

推荐新闻