【python爬虫课程设计】去哪儿旅游网

您所在的位置:网站首页 民宿网页分析 【python爬虫课程设计】去哪儿旅游网

【python爬虫课程设计】去哪儿旅游网

2024-06-08 19:51| 来源: 网络整理| 查看: 265

一、选题的背景在当今社会,人们的生活质量越来越高,精神的需求也随之提升。在忙碌的工作日之后,假期的去出也是人们思考的问题,去哪?怎么去?从而诞生了许多旅游攻略网。琳琅满目的标题与页面看的眼花缭乱,通过爬取关键的信息,集合分析数据从而得到了大多数人最优解方案,以达到便利,供人参考的目的。

二、主题式网络爬虫设计方案

首先爬取URL分页数据储存进txt,再读取txt爬取每个分页的页面信息存入csv,使用BeautifulSoup库解析获取到的网页内容,提取所需的信息,如地点、简介、出发时间、天数、人均费用、人物、玩法和浏览量等。最后使用pandas库对文件进行分析和数据可视化。

技术难点:一些网页的不规范原因,一些地名无法抓取

三、主题页面的结构特征分析        

1.主题页面的结构与特征分析

 

可以看出关键的日期,费用等信息都是在顶部一个列表中显示

2.Htmls 页面解析

 关键的信息都被包含在class,ul和li中,发现只需抓取某些关键字段即可,如:出发日期、天数、人均消费、人物、玩法等,但由于某些网页的不规范原因,一些地名无法抓取,所以用攻略代替。

3.节点(标签)查找方法与遍历方法

find查找,for循环遍历

四、网络爬虫程序设计

爬取每篇分页的URL存入url_1,再爬取分页中的内容存入Travel_first.csv中,用于生成饼图的评论存入travel_text

import requests from bs4 import BeautifulSoup import re import time import csv import random #爬取每个网址的分页 fb = open(r'url_1.txt','w') url = 'http://travel.qunar.com/travelbook/list.htm?page={}&order=hot_heat&avgPrice=1_2' #请求头,cookies在电脑网页中可以查到 headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.360', 'cookies':'JSESSIONID=5E9DCED322523560401A95B8643B49DF; QN1=00002b80306c204d8c38c41b; QN300=s%3Dbaidu; QN99=2793; QN205=s%3Dbaidu; QN277=s%3Dbaidu; QunarGlobal=10.86.213.148_-3ad026b5_17074636b8f_-44df|1582508935699; QN601=64fd2a8e533e94d422ac3da458ee6e88; _i=RBTKSueZDCmVnmnwlQKbrHgrodMx; QN269=D32536A056A711EA8A2FFA163E642F8B; QN48=6619068f-3a3c-496c-9370-e033bd32cbcc; fid=ae39c42c-66b4-4e2d-880f-fb3f1bfe72d0; QN49=13072299; csrfToken=51sGhnGXCSQTDKWcdAWIeIrhZLG86cka; QN163=0; Hm_lvt_c56a2b5278263aa647778d304009eafc=1582513259,1582529930,1582551099,1582588666; viewdist=298663-1; uld=1-300750-1-1582590496|1-300142-1-1582590426|1-298663-1-1582590281|1-300698-1-1582514815; _vi=6vK5Gry4UmXDT70IFohKyFF8R8Mu0SvtUfxawwaKYRTq9NKud1iKUt8qkTLGH74E80hXLLVOFPYqRGy52OuTFnhpWvBXWEbkOJaDGaX_5L6CnyiQPPOYb2lFVxrJXsVd-W4NGHRzYtRQ5cJmiAbasK8kbNgDDhkJVTC9YrY6Rfi2; viewbook=7562814|7470570|7575429|7470584|7473513; QN267=675454631c32674; Hm_lpvt_c56a2b5278263aa647778d304009eafc=1582591567; QN271=c8712b13-2065-4aa7-a70b-e6156f6fc216', 'referer':'http://travel.qunar.com/travelbook/list.htm?page=1&order=hot_heat&avgPrice=1'} count = 1 #共200页 for i in range(1,201): url_ = url.format(i) try: response = requests.get(url=url_,headers = headers) response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html,'lxml') #print(soup) all_url = soup.find_all('li',attrs={'class': 'list_item'}) #print(all_url[0]) ''' for i in range(len(all_url)): #p = re.compile(r'data-url="/youji/\d+">') url = re.findall('data-url="(.*?)"', str(i), re.S) #url = re.search(p,str(i)) print(url) ''' print('正在爬取第%s页' % count) for each in all_url: each_url = each.find('h2')['data-bookid'] #print(each_url) fb.write(each_url) fb.write('\n') #last_url = each.find('li', {"class": "list_item last_item"})['data-url'] #print(last_url) time.sleep(random.randint(3,5)) count+=1 except Exception as e: print(e)

爬取的数据存在这个txt中

 

爬取页面关键字内容

import requests from bs4 import BeautifulSoup import re import time import csv import random url_list = [] with open('url_1.txt','r') as f: for i in f.readlines(): i = i.strip() url_list.append(i) the_url_list = [] for i in range(len(url_list)): url = 'http://travel.qunar.com/youji/' the_url = url + str(url_list[i]) the_url_list.append(the_url) last_list = [] def spider(): headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.360', 'cookies': 'QN1=00002b80306c204d8c38c41b; QN300=s%3Dbaidu; QN99=2793; QN205=s%3Dbaidu; QN277=s%3Dbaidu; QunarGlobal=10.86.213.148_-3ad026b5_17074636b8f_-44df|1582508935699; QN601=64fd2a8e533e94d422ac3da458ee6e88; _i=RBTKSueZDCmVnmnwlQKbrHgrodMx; QN269=D32536A056A711EA8A2FFA163E642F8B; QN48=6619068f-3a3c-496c-9370-e033bd32cbcc; fid=ae39c42c-66b4-4e2d-880f-fb3f1bfe72d0; QN49=13072299; csrfToken=51sGhnGXCSQTDKWcdAWIeIrhZLG86cka; QN163=0; Hm_lvt_c56a2b5278263aa647778d304009eafc=1582513259,1582529930,1582551099,1582588666; viewdist=298663-1; uld=1-300750-1-1582590496|1-300142-1-1582590426|1-298663-1-1582590281|1-300698-1-1582514815; viewbook=7575429|7473513|7470584|7575429|7470570; QN267=67545462d93fcee; _vi=vofWa8tPffFKNx9MM0ASbMfYySr3IenWr5QF22SjnOoPp1MKGe8_-VroXhkC0UNdM0WdUnvQpqebgva9VacpIkJ3f5lUEBz5uyCzG-xVsC-sIV-jEVDWJNDB2vODycKN36DnmUGS5tvy8EEhfq_soX6JF1OEwVFXk2zow0YZQ2Dr; Hm_lpvt_c56a2b5278263aa647778d304009eafc=1582603181; QN271=fc8dd4bc-3fe6-4690-9823-e27d28e9718c', 'Host': 'travel.qunar.com' } count = 1 for i in range(len(the_url_list)): try: print('正在爬取第%s页'% count) response = requests.get(url=the_url_list[i],headers = headers) response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html,'lxml') information = soup.find('p',attrs={'class': 'b_crumb_cont'}).text.strip().replace(' ','') info = information.split('>') if len(info)>2: location = info[1].replace('\xa0','').replace('旅游攻略','') introduction = info[2].replace('\xa0','') else: location = info[0].replace('\xa0','') introduction = info[1].replace('\xa0','') #爬取内容 other_information = soup.find('ul',attrs={'class': 'foreword_list'}) when = other_information.find('li',attrs={'class': 'f_item when'}) time1 = when.find('p',attrs={'class': 'txt'}).text.replace('出发日期','').strip() howlong = other_information.find('li',attrs={'class': 'f_item howlong'}) day = howlong.find('p', attrs={'class': 'txt'}).text.replace('天数','').replace('/','').replace('天','').strip() howmuch = other_information.find('li',attrs={'class': 'f_item howmuch'}) money = howmuch.find('p', attrs={'class': 'txt'}).text.replace('人均费用','').replace('/','').replace('元','').strip() who = other_information.find('li',attrs={'class': 'f_item who'}) people = who.find('p',attrs={'class': 'txt'}).text.replace('人物','').replace('/','').strip() how = other_information.find('li',attrs={'class': 'f_item how'}) play = how.find('p',attrs={'class': 'txt'}).text.replace('玩法','').replace('/','').strip() Look = soup.find('span',attrs={'class': 'view_count'}).text.strip() if time1: Time = time1 else: Time = '-' if day: Day = day else: Day = '-' if money: Money = money else: Money = '-' if people: People = people else: People = '-' if play: Play = play else: Play = '-' last_list.append([location,introduction,Time,Day,Money,People,Play,Look]) #设置爬虫时间 time.sleep(random.randint(2,4)) count+=1 except Exception as e : print(e) #写入csv with open('Travel_first.csv', 'a', encoding='utf-8-sig', newline='') as csvFile: csv.writer(csvFile).writerow(['地点', '短评', '出发时间', '天数','人均费用','人物','玩法','浏览量']) for rows in last_list: csv.writer(csvFile).writerow(rows) if __name__ == '__main__': spider()

 

 

 

wordcloud 的分词可视化

 

import jieba import jieba.analyse import re punc = '~`!#$%^&*()_+-=|\';":/.,?>') url = re.findall('data-url="(.*?)"', str(i), re.S) #url = re.search(p,str(i)) print(url) ''' print('正在爬取第%s页' % count) for each in all_url: each_url = each.find('h2')['data-bookid'] #print(each_url) fb.write(each_url) fb.write('\n') #last_url = each.find('li', {"class": "list_item last_item"})['data-url'] #print(last_url) time.sleep(random.randint(3,5)) count+=1 except Exception as e: print(e) #爬取分页内容 import requests from bs4 import BeautifulSoup import re import time import csv import random # part2 爬虫 + 持久化 # 导入所需库 import requests from bs4 import BeautifulSoup import time import random import csv # 读取url列表 with open('url_1.txt','r') as f: for i in f.readlines(): i = i.strip() url_list.append(i) # 生成完整的url列表 the_url_list = [] for i in range(len(url_list)): url = 'http://travel.qunar.com/youji/' the_url = url + str(url_list[i]) the_url_list.append(the_url) last_list = [] # 定义爬虫函数 def spider(): headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36', 'cookies': 'QN1=00002b80306c204d8c38c41b; QN300=s%3Dbaidu; QN99=2793; QN205=s%3Dbaidu; QN277=s%3Dbaidu; QunarGlobal=10.86.213.148_-3ad026b5_17074636b8f_-44df|1582508935699; QN601=64fd2a8e533e94d422ac3da458ee6e88; _i=RBTKSueZDCmVnmnwlQKbrHgrodMx; QN269=D32536A056A711EA8A2FFA163E642F8B; QN48=6619068f-3a3c-496c-9370-e033bd32cbcc; fid=ae39c42c-66b4-4e2d-880f-fb3f1bfe72d0; QN49=13072299; csrfToken=51sGhnGXCSQTDKWcdAWIeIrhZLG86cka; QN163=0; Hm_lvt_c56a2b5278263aa647778d304009eafc=1582513259,1582529930,1582551099,1582588666; viewdist=298663-1; uld=1-300750-1-1582590496|1-300142-1-1582590426|1-298663-1-1582590281|1-300698-1-1582514815; viewbook=7575429|7473513|7470584|7575429|7470570; QN267=67545462d93fcee; _vi=vofWa8tPffFKNx9MM0ASbMfYySr3IenWr5QF22SjnOoPp1MKGe8_-VroXhkC0UNdM0WdUnvQpqebgva9VacpIkJ3f5lUEBz5uyCzG-xVsC-sIV-jEVDWJNDB2vODycKN36DnmUGS5tvy8EEhfq_soX6JF1OEwVFXk2zow0YZQ2Dr; Hm_lpvt_c56a2b5278263aa647778d304009eafc=1582603181; QN271=fc8dd4bc-3fe6-4690-9823-e27d28e9718c', 'Host': 'travel.qunar.com' } count = 1 for i in range(len(the_url_list)): try: print('正在爬取第%s页'% count) response = requests.get(url=the_url_list[i],headers = headers) response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html,'lxml') information = soup.find('p',attrs={'class': 'b_crumb_cont'}).text.strip().replace(' ','') info = information.split('>') if len(info)>2: location = info[1].replace('\xa0','').replace('旅游攻略','') introduction = info[2].replace('\xa0','') else: location = info[0].replace('\xa0','') introduction = info[1].replace('\xa0','') #爬取内容 other_information = soup.find('ul',attrs={'class': 'foreignword_list'}) when = other_information.find('li',attrs={'class': 'f_item when'}) time1 = when.find('p',attrs={'class': 'txt'}).text.replace('出发日期','').strip() howlong = other_information.find('li',attrs={'class': 'f_item howlong'}) day = howlong.find('p', attrs={'class': 'txt'}).text.replace('天数','').replace('/','').replace('天','').strip() howmuch = other_information.find('li',attrs={'class': 'f_item howmuch'}) money = howmuch.find('p', attrs={'class': 'txt'}).text.replace('人均费用','').replace('/','').replace('元','').strip() who = other_information.find('li',attrs={'class': 'f_item who'}) people = who.find('p',attrs={'class': 'txt'}).text.replace('人物','').replace('/','').strip() how = other_information.find('li',attrs={'class': 'f_item how'}) play = how.find('p',attrs={'class': 'txt'}).text.replace('玩法','').replace('/','').strip() Look = soup.find('span',attrs={'class': 'view_count'}).text.strip() if time1: Time = time1 else: Time = '-' if day: Day = day else: Day = '-' if money: Money = money else: Money = '-' if people: People = people else: People = '-' if play: Play = play else: Play = '-' last_list.append([location,introduction,Time,Day,Money,People,Play,Look]) #设置爬虫时间 time.sleep(random.randint(2,4)) count+=1 except Exception as e : print(e) #写入csv with open('Travel_first.csv', 'a', encoding='utf-8-sig', newline='') as csvFile: csv.writer(csvFile).writerow(['地点', '短评', '出发时间', '天数','人均费用','人物','玩法','浏览量']) for rows in last_list: csv.writer(csvFile).writerow(rows) if __name__ == '__main__': spider() #数据可视化加数据分析代码 import pandas as pd df = pd.read_csv('Travel_first.csv',encoding = 'gb18030') df df.info() from pyecharts import Pie m1 = df['人物'].value_counts().index.tolist() n1 = df['人物'].value_counts().values.tolist() pie =Pie('出游结伴方式',background_color = 'white',width = 800,height = 500,title_text_size = 20) pie.add('',m1,n1,is_label_show = True,is_legend_show= True,radius=[40, 75]) pie.render('出行结伴方试.html') from pyecharts import Bar m2 = df['地点'].value_counts().head(10).index.tolist() n2 = df['地点'].value_counts().head(10).values.tolist() bar = Bar('目的地Top10',width = 800,height = 500,title_text_size = 20) bar.add('',m2,n2,is_label_show = True,is_legend_show= True) bar.render('目的地.html') def Month(e): m = str(e).split('/')[2] if m=='01': return '一月' if m=='02': return '二月' if m=='03': return '三月' if m=='04': return '四月' if m=='05': return '五月' if m=='06': return '六月' if m=='07': return '七月' if m=='08': return '八月' if m=='09': return '九月' if m=='10': return '十月' if m=='11': return '十一月' if m=='12': return '十二月' df['旅行月份'] = df['出发时间'].apply(Month) df['出发时间']=pd.to_datetime(df['出发时间']) df import re def Look(e): if '万' in e: num1 = re.findall('(.*?)万',e) return float(num1[0])*10000 else: return float(e) df['浏览次数'] = df['浏览量'].apply(Look) df.drop(['浏览量'],axis = 1,inplace = True) df['浏览次数'] = df['浏览次数'].astype(int) df.head() #遍历10行10列 df.head(10) df.info() data = df data['地点'].value_counts() loc = data['地点'].value_counts().head(10).index.tolist() print(loc) loc_data = data[df['地点'].isin(loc)] price_mean = round(loc_data['人均费用'].groupby(loc_data['地点']).mean(),1) print(price_mean) #生成人均消费前十旅游地 from pyecharts import Bar bar = Bar('人均消费前十旅游地',width = 800,height = 500,title_text_size = 20) bar.add('',loc,price_mean,is_label_show = True,is_legend_show = True) bar.render('人均消费前十旅游地.html') df['天数'].value_counts() df['旅游时长'] = df['天数'].apply(lambda x:str(x) + '天') df['人物'].value_counts() t = df['浏览次数'].sort_values(ascending=False).index[:].tolist() df = df.loc[t] df = df.reset_index(drop = True) df['旅行月份'].value_counts() word_list = [] for i in df['玩法']: t = re.split('\xa0',i) word_list.append(t) dict = {} for j in range(len(word_list)): for i in word_list[j]: if i not in dict: dict[i] = 1 else: dict[i]+=1 print(dict) list = [] for item in dict.items(): list.append(item) for i in range(1,len(list)): for j in range(0,len(list)-1): if list[j][1]


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3