Python数据采集 |
您所在的位置:网站首页 › python采集数据代码 › Python数据采集 |
一、采集豆瓣电影 Top 250的数据采集
1.进入豆瓣 Top 250的网页
豆瓣电影 Top 250 2.进入开发者选项![]() 右键----检查---- pycharm中 File----右键----settings----Python Interpreter----+----(添加bs4、requests、lxml等安装包) 注意:如果安装库出现错误,即可能是对应的pip版本太低,需要更新 在 cmd 命令窗口查看pip版本 pip -V更新pip版本 python -m pip install --upgrade pip反反爬处理–——伪装浏览器 1.定义变量geaders={用户代理} 2.向服务器发送请求时携带上代理信息:response=requests.get(url,headrs = h) 6、bs4库中beautifulSoup类的使用1.导入:from bs4 import BeautifulSoup 2.定义soup对象:soup = BeautifulSoup(html,lxml) 3.使用soup对象调用类方法select(css选择器) 4.提取数据: get_text() text string 7、储存到CSV中1.先将数据存放到列表里 1.创建csv文档with open(‘xxx.csv’,‘a’,newlie=’’) as f: 3.调用CSV模块中的write()方法:w=csv.write(f 4.将列表中的数据写入文档:w.writerows(listdata) import requests,csv from bs4 import BeautifulSoup def getHtml(url): # 数据采集 h = { 'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64)' } response = requests.get(url,headers = h) html = response.text # print(html) # getHtml('https://movie.douban.com/top250') # 数据解析:正则,BeautifulSoup,Xpath soup = BeautifulSoup(html,'lxml') filmtitle = soup.select('div.hd > a > span:nth-child(1)') ct = soup.select('div.bd > p:nth-child(1)') score = soup.select('div.bd > div > span.rating_num') evalue = soup.select('div.bd > div > span:nth-child(4)') print(score) filmlist = [] for t,c,s,e in zip(filmtitle,ct,score,evalue): title = t.text content = c.text filmscore = s.text num = e.text.strip('人评价') director = content.strip().split()[1] if "主演:" in content: actor = content.strip().split('主演:')[1].split()[0] else: actor = None year = content.strip().split('/')[-3].split()[-1] area = content.strip().split('/')[-2].strip() filmtype = content.strip().split('/')[-1].strip() # print(num) listdata = [title,director,actor,year,area,filmtype,filmscore,num] filmlist.append(listdata) print(filmlist) # 存储数据 with open('douban250.csv','a',encoding='utf-8',newline='') as f: w = csv.writer(f) w.writerows(filmlist) # 函数调用 listtitle = ['title','director','actor','year','area','type','score','evalueate'] with open('douban250.csv','a',encoding='utf-8',newline='') as f: w = csv.writer(f) w.writerow(listtitle) for i in range(0,226,25): getHtml('https://movie.douban.com/top250?start=%s&filter='%(i)) # getHtml('https://movie.douban.com/top250?start=150&filter=') 备注h 是到网站的开发者页面去寻找这句话就是 安居客 2.导入from lxml import etree 3.将采集的字符串转换为html格式:etree.html 4.转换后的数据调用xPath(xpath的路径):data.xpath(路径表达式) import requests from lxml import etree def getHtml(url): h = { 'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64)' } response = requests.get(url,headers=h) html = response.text # 数据解析 data = etree.HTML(html) print(data) name = data.xpath('//span[@class="items-name"]/text()') print(name) getHtml("https://bj.fang.anjuke.com/?from=AF_Home_switchcity") 三、拉勾网的数据采集 (一)requests数据采集 1.导入库 2.向服务器发送请求 3.下载数据 (二)post方式请求(参数不在url里) 1.通过网页解析找到post参数(请求数据Fromdata)并且参数定义到字典里 2.服务器发送请求调用requests.post(data=fromdata) (三)数据解析——json 1.导入 import——json 2. 将采集的json数据格式转换为python字典:json.load(json格式数据) 3.通过字典中的键访问值 (四)数据存储到mysql中 1.下载并导入pymysql import pymysql 2.建立链接:conn=pymysql.Connenct(host,port,user,oassword,db,charset=‘utf8’) 3.写sql语句 4.定义游标:cursor=conn.cursor() 5.使用定义的游标执行sql语句:cursor.execute()sql 6.向数据库提交数据:conn.commit() 7.关闭游标:cursor.close() 8.关闭链接:conn.close() 这次的爬虫是需要账户的所以h是使用的Request Headers中的数据![]() 360疫情数据 import requests,json,csv,pymysql def getHtml(url): h = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } response = requests.get(url,headers=h) joinhtml = response.text directro = joinhtml.split('(')[1] directro2 = directro[0:-2] dictdata = json.loads(directro2) ct = dictdata['country'] positionlist = [] for i in range(0,196): provinceName = ct[i]['provinceName'] cured = ct[i]['cured'] diagnosed = ct[i]['diagnosed'] died = ct[i]['died'] diffDiagnosed = ct[i]['diffDiagnosed'] datalist = [provinceName,cured,diagnosed,died,diffDiagnosed] positionlist.append(datalist) print(diagnosed) conn = pymysql.Connect(host = 'localhost',port = 3306,user = 'root',passwd='277877061#xyl',db='lagou',charset='utf8') sql = "insert into yq values('%s','%s','%s','%s','%s');" %(provinceName,cured,diagnosed,died,diffDiagnosed) cursor = conn.cursor() try: cursor.execute(sql) conn.commit() except: conn.rollback() cursor.close() conn.close() with open('360疫情数据.csv','a',encoding="utf-8",newline="") as f: w = csv.writer(f) w.writerows(positionlist) title = ['provinceName','cured','diagnosed','died','diffDiagnosed'] with open('360疫情数据.csv', 'a', encoding="utf-8", newline="") as f: w = csv.writer(f) w.writerow(title) getHtml('https://m.look.360.cn/events/feiyan?sv=&version=&market=&device=2&net=4&stype=&scene=&sub_scene=&refer_scene=&refer_subscene=&f=jsonp&location=true&sort=2&_=1626165806369&callback=jsonp2') 五、scrapy 数据采集 (一)创建爬虫步骤 1.创建项目,在命令行输入:scrapy startproject 项目名称 2.创建爬虫文件:在命令行定位到spiders文件夹下:scrapy genspider 爬虫文件名 网站域名 3.运行程序:scrapy crawl 爬虫文件名 (二)scrapy文件结构 1.spiders文件夹:编写解析程序 2.init.py:包标志文件,只有存在这个python文件的文件夹才是包 3.items.py:定义数据结构,用于定义爬取的字段 4.middlewares.py:中间件 5.pipliness.py:管道文件,用于定义数据存储方向 6.settings.py:配置文件,通常配置用户代理,管道文件开启 (三)scrapy爬虫框架使用流程 1.配置settings.py文件:(1)User-Agent:“浏览器参数” (2)ROBOTS协议:Fasle (3)修改ITEM_PIPLINES:打开管道 2.在item.py中定义爬取的字段 字段名 = scrapy.Field() 3.在spiders文件夹中的parse()方法中写解析代码并将解析结果提交给item对象 4.在piplines.py中定义存储路径 5.分页采集:1)查找url的规律 2)在爬虫文件中先定义页面变量,使用if判断和字符串格式化方法生成新url 3)yield scrapy Requests(url,callback=parse) 案例 (一)安居客数据采集 其中的文件venv----右键----New----Directory----安居库数据采集 venv----右键----Open in----terminal----命令行输入: scrapy startproject anjuke回车 查看自动生成的文件 在terminal里面进去spiders目录: cd anjuke\anjuke\spiders在命令行中输入:(创建ajk.py文件) scrapy genspider ajk bj.fang.anjuke.com/?from=navigation 运行:(命令行输入) scrapy crawl ajk
1.ajk.py中 import scrapy from anjuke.items import AnjukeItem class AjkSpider(scrapy.Spider): name = 'ajk' #allowed_domains = ['bj.fang.anjuke.com/?from=navigation'] start_urls = ['http://bj.fang.anjuke.com/loupan/all/p1'] pagenum = 1 def parse(self, response): #print(response.text) item = AnjukeItem() # 解析数据 name = response.xpath('//span[@class="items-name"]/text()').extract() temp = response.xpath('//span[@class="list-map"]/text()').extract() place = [] district = [] #print(temp) for i in temp: placetemp ="".join(i.split("]")[0].strip("[").strip().split()) districttemp = i.split("]")[1].strip() # print(districttemp) place.append(placetemp) district.append(districttemp) # apartment = response.xpath('//a[@class="huxing"]/span[not(@class)]/text()').extract() area1 = response.xpath('//span[@class="building-area"]/text()').extract() area = [] for j in area1: areatemp = j.strip("建筑面积:").strip('㎡') area.append(areatemp) price = response.xpath('//p[@class="price"]/span/text()').extract() # print(name) # 将处理后的数据传入item中 item['name'] = name item['place'] = place item['district'] = district #item['apartment'] = apartment item['area'] = area item['price'] = price yield item #print(type(item['name'])) for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']): print(a, b, c, d, e) # 分页爬虫 if self.pagenum < 5: self.pagenum += 1 newurl = "https://bj.fang.anjuke.com/loupan/all/p{}/".format(str(self.pagenum)) print(newurl) yield scrapy.Request(newurl, callback=self.parse) # print(type(dict(item)))2.pipelines.py中 # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter # 存放到csv中 import csv,pymysql class AnjukePipeline: def open_spider(self, spider): self.f = open("安居客北京.csv", "w", encoding='utf-8', newline="") self.w = csv.writer(self.f) titlelist = ['name', 'place', 'distract', 'area', 'price'] self.w.writerow(titlelist) def process_item(self, item, spider): # writerow() [1,2,3,4] writerows() [[第一条记录],[第二条记录],[第三条记录]] # 数据处理 k = list(dict(item).values()) self.listtemp = [] for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]): self.temp = [a, b, c, d, e] self.listtemp.append(self.temp) # print(listtemp) self.w.writerows(self.listtemp) return item def close_spider(self,spider): self.f.close() # 存储到mysql中 class MySqlPipeline: def open_spider(self,spider): self.conn = pymysql.Connect(host="localhost",port=3306,user='root',password='277877061#xyl',db="anjuke",charset='utf8') def process_item(self,item,spider): self.cursor = self.conn.cursor() for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']): sql = 'insert into ajk values("%s","%s","%s","%s","%s");'%(a,b,c,d,e) self.cursor.execute(sql) self.conn.commit() return item def close_spider(self,spider): self.cursor.close() self.conn.close()3.items.py中 # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class AnjukeItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() place = scrapy.Field() district = scrapy.Field() #apartment = scrapy.Field() area = scrapy.Field() price = scrapy.Field()4.settings.py中 # Scrapy settings for anjuke project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'anjuke' SPIDER_MODULES = ['anjuke.spiders'] NEWSPIDER_MODULE = 'anjuke.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'anjuke.middlewares.AnjukeSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'anjuke.middlewares.AnjukeDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'anjuke.pipelines.AnjukePipeline': 300, 'anjuke.pipelines.MySqlPipeline': 301 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'5.middlewares.py中 # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter class AnjukeSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class AnjukeDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)6、Mysql中 1)创建anjuke数据库 mysql> create database anjuke;2)使用库 mysql> use anjuke;3)创建表 mysql> create table ajk( -> name varchar(100),place varchar(100),distarct varchar(100),area varchar(100),price varchar(50)); Query OK, 0 rows affected (0.03 sec)4)查看表结构 mysql> desc ajk; +----------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+--------------+------+-----+---------+-------+ | name | varchar(100) | YES | | NULL | | | place | varchar(100) | YES | | NULL | | | distarct | varchar(100) | YES | | NULL | | | area | varchar(100) | YES | | NULL | | | price | varchar(50) | YES | | NULL | | +----------+--------------+------+-----+---------+-------+5)查看表数据 mysql> select * from ajk; +-------------------------------------+-----------------------+-----------------------------------------------------------+--------------+--------+ | name | place | distarct | area | price | +-------------------------------------+-----------------------+-----------------------------------------------------------+--------------+--------+ | 和锦华宸品牌馆 | 顺义马坡 | 和安路与大营二街交汇处东北角 | 95-180 | 400 | | 保利建工•和悦春风 | 大兴庞各庄 | 永兴河畔 | 76-109 | 36000 | | 华樾国际·领尚 | 朝阳东坝 | 东坝大街和机场二高速交叉路口西方向3... | 96-196.82 | 680 | | 金地·璟宸品牌馆 | 房山良乡 | 政通西路 | 97-149 | 390 | | 中海首钢天玺 | 石景山古城 | 北京市石景山区古城南街及莲石路交叉口... | 122-147 | 900 | | 傲云品牌馆 | 朝阳孙河 | 顺黄路53号 | 87288 | 2800 | | 首开香溪郡 | 通州宋庄 | 通顺路 | 90-220 | 39000 | | 北京城建·北京合院 | 顺义顺义城 | 燕京街与通顺路交汇口东800米(仁和公... | 93-330 | 850 | | 金融街武夷·融御 | 通州政务区 | 北京城市副中心01组团通胡大街与东六环... | 99-176 | 66000 | | 北京城建·天坛府品牌馆 | 东城天坛 | 北京市东城区景泰路与安乐林路交汇处 | 60-135 | 800 | | 亦庄金悦郡 | 大兴亦庄 | 新城环景西一路与景盛南二街交叉口 | 72-118 | 38000 | | 金茂·北京国际社区 | 顺义顺义城 | 水色时光西路西侧 | 50-118 | 175 | (二)太平洋汽车数据采集1.tpy.py中 import scrapy from taipingyangqiche.item import TaipingyangqicheItem class TpyqcSpider(scrapy.Spider): name = 'tpyqc' allowed_domains = ['price.pcauto.com.cn/top'] start_urls = ['http://price.pcauto.com.cn/top/'] pagenum = 1 def parse(self, response): item = TaipingyangqicheItem() name = response.xpath('//p[@class="sname"]/a/text()').extract() temperature = response.xpath('//p[@class="col rank"]/span[@class="fl red rd-mark"]/text()').extract() price = response.xpath('//p/em[@class="red"]/text()').extract() brand2 = response.xpath('//p[@class="col col1"]/text()').extract() brand = [] for j in brand2: areatemp = j.strip('品牌:').strip('排量:').strip('\r\n') brand.append(areatemp) brand = [i for i in brand if i != ''] rank2 = response.xpath('//p[@class="col"]/text()').extract() rank = [] for j in rank2: areatemp = j.strip('级别:').strip('变速箱:').strip('\r\n') rank.append(areatemp) rank = [i for i in rank if i != ''] item['name'] = name item['temperature'] = temperature item['price'] = price item['brand'] = brand item['rank'] = rank yield item for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']): print(a, b, c, d, e) if self.pagenum < 6: self.pagenum += 1 newurl = "https://price.pcauto.com.cn/top/k0-p{}.html".format(str(self.pagenum)) print(newurl) yield scrapy.Request(newurl, callback=self.parse) # print(type(dict(item)))2.pipelines.py中 # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import csv,pymysql class TaipingyangqichePipeline: def open_spider(self, spider): self.f = open("太平洋.csv", "w", encoding='utf-8', newline="") self.w = csv.writer(self.f) titlelist = ['name', 'temperature', 'price', 'brand', 'rank'] self.w.writerow(titlelist) def process_item(self, item, spider): k = list(dict(item).values()) self.listtemp = [] for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]): self.temp = [a, b, c, d, e] self.listtemp.append(self.temp) print(self.listtemp) self.w.writerows(self.listtemp) return item def close_spider(self, spider): self.f.close() class MySqlPipeline: def open_spider(self, spider): self.conn = pymysql.Connect(host="localhost", port=3306, user='root', password='277877061#xyl', db="taipy",charset='utf8') def process_item(self, item, spider): self.cursor = self.conn.cursor() for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']): sql = 'insert into tpy values("%s","%s","%s","%s","%s");' % (a, b, c, d, e) self.cursor.execute(sql) self.conn.commit() return item def close_spider(self, spider): self.cursor.close() self.conn.close()3.items.py中 # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class TaipingyangqicheItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() temperature = scrapy.Field() price = scrapy.Field() brand = scrapy.Field() rank = scrapy.Field()4.settings.py中 # Scrapy settings for taipingyangqiche project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'taipingyangqiche' SPIDER_MODULES = ['taipingyangqiche.spiders'] NEWSPIDER_MODULE = 'taipingyangqiche.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'taipingyangqiche.middlewares.TaipingyangqicheSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'taipingyangqiche.middlewares.TaipingyangqicheDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'taipingyangqiche.pipelines.TaipingyangqichePipeline': 300, 'taipingyangqiche.pipelines.MySqlPipeline': 301 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'5.middlewares.py中 # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter class TaipingyangqicheSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class TaipingyangqicheDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |