python爬虫篇7

#python爬虫篇7| 来源: 网络整理| 查看: 265

在写这篇文章之前，先给大家介绍一下Scrapy框架基本知识。Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛。框架的力量，用户只需要定制开发几个模块就可以轻松的实现一个爬虫，用来抓取网页内容以及各种图片，非常方便。

scrapy架构图

crapy Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。

Scheduler(调度器): 它负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎。

Downloader（下载器）：负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spider来处理，

Spider（爬虫）：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入Scheduler(调度器)，

Item Pipeline(管道)：它负责处理Spider中获取到的Item，并进行进行后期处理（详细分析、过滤、存储等）的地方.

Downloader Middlewares（下载中间件）：你可以当作是一个可以自定义扩展下载功能的组件。

Spider Middlewares（Spider中间件）：你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件（比如进入Spider的Responses;和从Spider出去的Requests）

Scrapy框架安装步骤

1. python -m pip install --upgrade pip 更新pip版本到最新

2. pip install wheel 安装wheel

3. 依次安装Lxml，twisted，Pywin32

4. pip install scrapy 安装scrapy

Lxml，twisted，Pywin32三个文件我已经整理到一起来，大家需要的可以直接下载，我使用的是python3.7版本。

Scrapy基本命令

scrapy startproject ITCast 创建一个scrapy项目 scrapy 获取功能列表 scrapy genspider itcast "http://www.itcast.cn" 创建一个爬虫，itcast爬虫名称，网址为主地址 scrapy check itcast 检查网页状态 scrapy crawl itcast 启动爬虫 scray shell "ht0tp://www.itcast.com" 测试响应数据功能和IPython类似基础知识 xpath()：传入xpath表达式，返回该表达式所对应的所有节点的selector list列表 extract()：序列化该节点为Unicode字符串并返回list css()：传入css表达式，返回表达式所对应的所有节点的selector list列表，语法同BeautiSoup4 re()：根据传入的正则表达式对数据进行提取，返回Unicode字符串list列表爬虫过程 1.scrapy startproject XXX 2.scrapy genspider xxx "http: //www. xxx. com" 3.编写items.py，明确需要提取的数据 4.编写spiders/xxx.py, 编写爬虫文件，处理请求和响应，以及提取数据（yield.item） 5.编写pipelines.py，编写管道文件，处理spider返回的item数据，比如本地持久化存储等.. 6.编写settings.py, 启动管道组件ITEM_PIPELINES {}，以及其他相关设置 7.执行爬虫scrapy crawl xxx

Scrapy实战使用——爬取某鱼直播平台房间主播数据

首先我们按照上述步骤创建一个scrapy项目，scrapy startproject douyuLOL

自动生成的项目目录如下

接下来我们确定目标地址，这里我已经在分析网站找到了，帮大家节约时间，详细见后面代码。

目标网址返回数据

通过分析数据，确定我们想要提取的数据，编写以下item.py文件

# -*- coding: utf-8 -*- import scrapy class DouyulolItem(scrapy.Item): # define the fields for your item here like: # 标题照片地址 imgurl = scrapy.Field() # 房间标题 roomtitle = scrapy.Field() # 游戏名称 gamename = scrapy.Field() # 主播名字 zbname = scrapy.Field() # 房间热度 hot = scrapy.Field() # 主播介绍 introduce = scrapy.Field() # 房间地址 roomurl = scrapy.Field()

下一步我们来编写爬虫文件douyulol.py

# -*- coding: utf-8 -*- import json import scrapy from ..items import DouyulolItem class DouyulolSpider(scrapy.Spider): name = 'douyulol' # 目标网站页数 offset = 1 # 目标网站前缀 basturl = 'https://www.douyu.com/gapi/rkc/directory/1_1/' # 拼接目标网站地址 start_urls = [basturl + str(offset)] def parse(self, response): # 获取返回的json数据 jsondata = json.loads(response.body_as_unicode()) if jsondata['msg'] == 'success': # 解析数据 data = jsondata['data'] # 解析获取总页数 pgcnt = data['pgcnt'] for node in data['rl']: # 将数据存在item中 item = DouyulolItem() item["imgurl"] = node['rs1'] item["roomtitle"] = node['rn'] item["gamename"] = node['c2name'] item["zbname"] = node['nn'] item["hot"] = node['ol'] if len(node['od']): item["introduce"] = node['od'] else: item["introduce"] = '无介绍' item["roomurl"] = node['url'] # 提交管道文件执行数据处理工作 yield item if self.offset < pgcnt: # 页数不断累加 self.offset += 1 # 更新地址再次执行scrapy爬取数据 yield scrapy.Request(self.basturl + str(self.offset), callback=self.parse)

然后我们编写管道文件pipelines.py，处理解析发送过来的数据（注明：MysqlHelper数据库辅助类在之前的文章有，这里不再赘述）

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from datetime import datetime from .MysqlHelper import MysqlHelper class DouyulolPipeline(object): def __init__(self): # 连接本地mysql服务 self.mysql = MysqlHelper('localhost', 8088, 'root', '123', 'douyu', 'utf8') # 主播房间地址前缀 self.baseurl = "https://www.douyu.com" def process_item(self, item, spider): # 标题照片地址 imgurl = item.__getitem__('imgurl') # 房间标题 roomtitle = item.__getitem__('roomtitle') # 游戏名称 gamename = item.__getitem__('gamename') # 主播名字 zbname = item.__getitem__('zbname') # 房间热度 hot = item.__getitem__('hot') # 主播介绍 introduce = item.__getitem__('introduce') # 房间地址 roomurl = self.baseurl + item.__getitem__('roomurl') # 更新时间 updatetime = datetime.now().strftime('%Y-%m-%d %H:%M:%S') # 判断当前数据是否已存在数据库，存在则更新否则就插入到数据库 sql = "select count(*) from douyulol where zbname = %s" point = self.mysql.all(sql, [zbname]) if point[0][0] == 0: sql1 = "insert into douyulol(imgurl,roomtitle,gamename,zbname,hot,introduce,roomurl,updatetime)values (%s,%s,%s,%s,%s,%s,%s,%s)" self.mysql.execute(sql1, [imgurl, roomtitle, gamename, zbname, hot, introduce, roomurl, updatetime]) else: print("该主播数据已存在，更新数据中...") sql2 = "update douyulol set imgurl = %s, roomtitle = %s,gamename = %s , hot = %s , introduce = %s , roomurl = %s , updatetime = %s where zbname = %s" self.mysql.execute(sql2, [imgurl, roomtitle, gamename, hot, introduce, roomurl, updatetime, zbname]) return item # def close_spider(self): # pass mysql数据库创建 CREATE TABLE `douyulol` ( `id` int NOT NULL AUTO_INCREMENT, `imgurl` varchar(200) DEFAULT NULL, `roomtitle` varchar(200) DEFAULT NULL, `gamename` varchar(200) DEFAULT NULL, `zbname` varchar(200) DEFAULT NULL, `hot` varchar(200) DEFAULT NULL, `introduce` varchar(200) DEFAULT NULL, `roomurl` varchar(200) DEFAULT NULL, `updatetime` datetime DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=15724 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

最后，我们来编写一下scrapy配置文件settings.py

# -*- coding: utf-8 -*- # Scrapy settings for douyuLOL project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douyuLOL' SPIDER_MODULES = ['douyuLOL.spiders'] NEWSPIDER_MODULE = 'douyuLOL.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douyuLOL (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'douyuLOL.middlewares.DouyulolSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'douyuLOL.middlewares.DouyulolDownloaderMiddleware': 543, } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'douyuLOL.pipelines.DouyulolPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

大功告成，接下来我们可以愉快地爬取想要的数据了。

打开pycharm命令行。进入douyuLOL目录，执行scrapy crawl douyulol启动爬虫程序，就可以看到目标数据飞一般地钻进我们的数据库中啦！

打开mysql数据库查看爬取的数据，嚯！快12000了，还在不断增加中...

我们可以看到，使用scrapy框架技术可以快速实现高性能的爬虫，大家可以好好学习一下这个框架，真的很实用！

写在最后

代码若存在bug，欢迎大家评论区留言，大家交流才能更快的成长。

至此，此文完结，感谢大家阅览。

【本文地址】

python爬虫篇7

python爬虫篇7

今日新闻

推荐新闻