2020东京奥运会奖牌排行 |
您所在的位置:网站首页 › 冬奥会中国代表团奖牌排名表 › 2020东京奥运会奖牌排行 |
爬取数据1
1、数据来源:https://2020.cctv.com/medal_list/index.shtml 数据为下面图片的表格数据 2.2获取网页完整代码 #自动打开chrome,获取代码 driver = webdriver.Chrome('D:\\数据分析\\chromedriver_win32\\chromedriver.exe') driver.get('https://2020.cctv.com/medal_list/index.shtml') content=driver.page_source driver.quit() #关闭chrome2.3本地创建 .csv,存储爬下来的数据 f = open('nation_data.csv', 'w' , encoding='utf-8') csv_writer = csv.writer(f) #表头 csv_writer.writerow((["排名", "国家", "金牌", "银牌", "铜牌", "总数"]))2.4定位xpath,写入文件 metree = lxml.html.etree parser = metree.HTML(content) td_list = parser.xpath("/html/body/div[3]/div/div/div/div[1]/div[3]/div[2]/div/div[3]/table/tbody[@id='medal_list1']//tr") for td_item in td_list: num_item= td_item.xpath('.//text()') csv_writer.writerow(num_item) f.close() #关闭文件2.5.csv无乱码转换,请看我的另一篇文章https://blog.csdn.net/weixin_44394124/article/details/120097063?spm=1001.2014.3001.5501 1、爬取每个国家的金、银、铜牌奖牌榜,如下图(中国和美国)
2、三字母国家代码 (http://api.cntv.cn/olympic/getOlyMedals?serviceId=pcocean&itemcode=GEN-------------------------------&t=jsonp&cb=banomedals)来源,如下图 nat_list复制到nat.txt,文本内容为 ‘USA’, ‘CHN’, ‘JPN’, ‘GBR’, ‘ROC’, ‘AUS’, ‘NED’, ‘FRA’, ‘GER’, ‘ITA’, ‘CAN’, ‘BRA’, ‘NZL’, ‘CUB’, ‘HUN’, ‘KOR’, ‘POL’, ‘CZE’, ‘KEN’, ‘NOR’, ‘JAM’, ‘ESP’, ‘SWE’, ‘SUI’, ‘DEN’, ‘CRO’, ‘IRI’, ‘SRB’, ‘BEL’, ‘BUL’, ‘SLO’, ‘UZB’, ‘GEO’, ‘TPE’, ‘TUR’, ‘GRE’, ‘UGA’, ‘ECU’, ‘ISR’, ‘IRL’, ‘QAT’, ‘KOS’, ‘BAH’, ‘UKR’, ‘BLR’, ‘ROU’, ‘VEN’, ‘IND’, ‘HKG’, ‘PHI’, ‘SVK’, ‘RSA’, ‘AUT’, ‘EGY’, ‘INA’, ‘POR’, ‘ETH’, ‘TUN’, ‘EST’, ‘THA’, ‘FIJ’, ‘LAT’, ‘BER’, ‘PUR’, ‘MAR’, ‘COL’, ‘AZE’, ‘DOM’, ‘ARM’, ‘KGZ’, ‘MGL’, ‘ARG’, ‘SMR’, ‘JOR’, ‘MAS’, ‘NGR’, ‘TKM’, ‘MKD’, ‘NAM’, ‘LTU’, ‘BRN’, ‘KSA’, ‘KAZ’, ‘MEX’, ‘FIN’, ‘KUW’, ‘CIV’, ‘GHA’, ‘SYR’, ‘BUR’, ‘GRN’, ‘MDA’, ‘BOT’ 并把这些数据插入到nations_data.csv,如图 3、`把每个国家的金银铜名单汇总成三个表 from selenium import webdriver import lxml.html import csv #先创建好gold.csv、silver.csv、bronze.csv for str in [‘gold.csv’,‘silver.csv’,‘bronze.csv’]: g = open(str, ‘w’, encoding=‘utf-8’) csv_writer = csv.writer(g) csv_writer.writerow(([“日期”, “项目”, “获得者”, ‘名次’, ‘countryid’])) g.close() file1 = open(‘D:\Olympic_Games\nat.txt’, ‘r’, encoding=‘utf-8’) file1_list=file1.read() file1.close() p = file1_list.replace(’’’, ‘’) c = p.replace(’ ‘, ‘’) ls = c.split(’,’) for i in ls: print(i) driver = webdriver.Chrome(‘D:\数据分析\chromedriver_win32\chromedriver.exe’) #获奖金银铜名单 driver.get(‘https://2020.cctv.com/medal_list/details/index.shtml?countryid={}’.format(i)) content = driver.page_source driver.quit() metree = lxml.html.etree parser = metree.HTML(content) #金牌 gold_list = parser.xpath("/html/body/div[3]/div/div/div/div[1]/div[3]/div[2]/div/div[2]/div[1]/table/tbody[@id='gold']//tr") f1 = open('gold.csv', 'a+', encoding='utf-8') csv_writer = csv.writer(f1) for td_item in gold_list: num_item = td_item.xpath('.//text()') csv_writer.writerow([num_item[0], num_item[1], num_item[2], 1, i]) f1.close() #银牌 silver_list = parser.xpath("/html/body/div[3]/div/div/div/div[1]/div[3]/div[2]/div/div[2]/div[2]/table/tbody[@id='silver']//tr") f2 = open('silver.csv', 'a+', encoding='utf-8') csv_writer = csv.writer(f2) for td_item in silver_list: num_item = td_item.xpath('.//text()') csv_writer.writerow([num_item[0], num_item[1], num_item[2], 2, i]) f2.close() #铜牌 bronze_list = parser.xpath("/html/body/div[3]/div/div/div/div[1]/div[3]/div[2]/div/div[2]/div[3]/table/tbody[@id='bronze']//tr") f3 = open('bronze.csv', 'a+', encoding='utf-8') csv_writer = csv.writer(f3) for td_item in bronze_list: num_item = td_item.xpath('.//text()') csv_writer.writerow([num_item[0], num_item[1], num_item[2], 3, i]) f3.close()表1金牌gold.csv 1、三个金银铜表,通过表拼接,形成一个大表 (csv转化xlsx,方便之间excel表处理数据) 2、大表和nations.xlsx左链接(nation_data.csv转化为nations.xlsx) import pandas as pd gold=pd.read_excel('golds.xlsx') silver=pd.read_excel('silvers.xlsx') bronze=pd.read_excel('bronzes.xlsx') #清洗空行 gold=gold.dropna(axis=0,how='any') silver=silver.dropna(axis=0,how='any') bronze=bronze.dropna(axis=0,how='any') #写进表里面 gold.to_excel('gold.xlsx') silver.to_excel('silver.xlsx') bronze.to_excel('bronze.xlsx') #三个表行拼接为 su=gold.append(silver).append(bronze) #三个表行拼接 #存进表里面 su.to_excel('./su.xlsx') #国家金牌表 nations=pd.read_excel('D:\\Olympic_Games\\nations.xlsx') #左外连接 itemsum=pd.merge(nations,su,left_on="countryid",right_on="countryid",how="left")#左表,右表,左表标识列,右表标识列,左/右 #存进一个大数据表 itemsum.to_excel('./itemtoge.xlsx')如图
生成nation-eng.xlsx 1、世界地图 准备 from pyecharts import options as opts from pyecharts.charts import Map import pandas as pd import os datas=pd.read_excel('nation-eng.xlsx') datas['总数'] = datas['总数'].astype('float') # 基础数据 value = datas['总数'] attr = datas['英文简称'] data = [] for index in range(len(attr)): city_ionfo = [attr[index], value[index]] data.append(city_ionfo) # 打开html c = ( Map() .add("世界地图", data, "world") .set_series_opts(label_opts=opts.LabelOpts(is_show=False)) .set_global_opts( title_opts=opts.TitleOpts(title="奖牌总数"), visualmap_opts=opts.VisualMapOpts(max_=200), ) .render() ) # 打开html os.system("render.html")地图展示
|
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |