scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)

您所在的位置:网站首页 spyder怎么修改文件保存路径 scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)

2023-06-04 15:19| 来源: 网络整理| 查看: 265

前期准备--创建一个scrapy爬虫 (以上海热线-热点新闻为例:https://hot.online.sh.cn/node/node_65634.htm) 1.安装scrapy,scrapyd,scrapydweb pip install scrapy pip install scrapyd pip install scrapyd-client pip install scrapydweb 2.创建工程 scrapy startproject newsspider 3.创建news爬虫 scrapy genspider news news.com 4.项目开发与完善 1)settings文件的配置 注释爬虫协议

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd

注册管道

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapy_02

配置日志文件

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_03

完整代码:

# Scrapy settings for newsspider project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = "newsspider" SPIDER_MODULES = ["newsspider.spiders"] NEWSPIDER_MODULE = "newsspider.spiders" # 配置log LOG_LEVEL = 'WARNING' # 日志的级别 LOG_FILE = './log.log' # 输出日志的文件路径 # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = "newsspider (+http://www.yourdomain.com)" # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) # COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", # "Accept-Language": "en", # } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # "newsspider.middlewares.NewsspiderSpiderMiddleware": 543, # } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # "newsspider.middlewares.NewsspiderDownloaderMiddleware": 543, # } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # "scrapy.extensions.telnet.TelnetConsole": None, # } # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "newsspider.pipelines.NewsspiderPipeline": 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html # AUTOTHROTTLE_ENABLED = True # The initial download delay # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # HTTPCACHE_ENABLED = True # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = "httpcache" # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8" 2)news.py文件的编写 修改爬虫的域和修改爬虫的请求地址

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapy_04

导入两个模块(log模块用来输出日志,cmdline模块方便用来调试)

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_05

解析响应

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_06

构造items进行数据传递

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd_07

完整代码:

import scrapy from scrapy.utils import log from scrapy import cmdline from urllib.parse import urljoin # curl http://localhost:6800/schedule.json -d project=newsspider -d spider news class NewsSpider(scrapy.Spider): name = "news" # allowed_domains = ["hot.online.sh.cn"] start_urls = ["https://hot.online.sh.cn/node/node_65634.htm"] def parse(self, response): log.logger.warning(response.url) """数据的解析""" news_list = response.css('div.list_thread') for news in news_list: title = news.xpath('./h2/a/text()').extract_first() abstract = news.xpath('./p/text()').extract_first() pub_time = news.xpath('./h3/text()').extract_first() news_data = [title, abstract, pub_time] print(news_data) """构造items进行数据传递""" item = {'title': title, 'abstract': abstract, 'pub_time': pub_time} yield item # 处理翻页 next_page = response.xpath("//center/a[text()='下一页']/@href").extract_first() base_url = 'https://hot.online.sh.cn/node/' if next_page: yield scrapy.Request(urljoin(base_url, next_page), callback=self.parse) if __name__ == '__main__': cmdline.execute('scrapy crawl news'.split()) 3)items.py文件进行数据建模 # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class NewsspiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() abstract = scrapy.Field() pub_time = scrapy.Field() 4)pipelines.py文件的书写

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_08

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import os, json class NewsspiderPipeline: def open_spider(self, spider): """书写新闻数据的保存路径""" self.os_path = os.getcwd() + '/上海热线新闻/' if not os.path.exists(self.os_path): os.mkdir(self.os_path) def process_item(self, item, spider): """ 保存数据的主逻辑 :param item: spider传递过来的字典数据 :param spider: scrapy框架的引擎 :return: """ title = item['title'] abstract = item['abstract'] pub_time = item['pub_time'] # 字典格式转成JSON格式 dict_data = {'title': title, 'abstract': abstract, 'pub_time': pub_time} data = json.dumps(dict_data, ensure_ascii=False) # 进行数据的保存 with open(self.os_path + 'data.json', 'a', encoding='utf-8') as f: f.write(data) f.write('\n') return item 至此一个简单的scrapy项目就完成啦!!! scrapyd部署scrapy项目

参考链接:https://blog.csdn.net/qq_46092061/article/details/119958992

1)修改项目中scrapy.cfg

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd_09

代码如下:

# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = newsspider.settings [deploy:wwww] # wwww就是部署名 url = http://localhost:6800/ project = newsspider # newsspider就是项目名

说明:先把原先注释掉的url那一行取消注释,这个就是我们要部署到目标服务器的地址,然后,把[deploy]这里改为[deploy:wwww],这里是命名为wwww,命名可以任意怎么都行,只要能标识出来项目就可以。下面的project就是我们的工程名,至此配置文件更改完成。

2)创建scrapyd-deploy.bat文件(应该是老版本的问题,可略)

因为scrapy-deploy文件没有后缀名,所以在windows中无法直接运行,在python环境(python安装目录或虚拟环境目录)中进入Scripts文件夹,新建scrapyd-deploy.bat文件,写入:

@echo off "D:\python\python.exe" "D:\python\Scripts\scrapyd-deploy" %*

改完以后可以通过"scrapyd-deploy -l"命令查看到如下:

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapy_10

说明之前的Scrapyd及Scrapydweb安装没问题!!!

3)查询状态接口

这个接口负责查看Scrapyd当前服务和任务的状态,我们可以永curl命令来请求这个接口,命令如下:

curl http://127.0.0.1:6800/daemonstatus.json

输入命令行,显示效果如下:

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_11

4)开始部署 scrapy项目根目录下运行scrapyd-deploy 部署名 -p 项目名,将项目上传至scrapyd服务器。(说明部署名和项目名):

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd_12

(演示):项目根目录下输入命令“scrapyd-deploy wwww -p newsspider”

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_13

且部署成功后的页面效果为:

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd_14

项目根目录下输入启动爬虫的命令:

curl http://localhost:6800/schedule.json -d project=newsspider -d spider=news

效果如下:

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapy_15

取消爬虫的话,使用命令:

curl http://localhost:6800/cancel.json -d project=newsspider -d job=f0a0bb8efd4a11ed9048d8c0a6a39bc0

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd_16

查看当前项目的名字,使用命令:

curl http://localhost:6800/listprojects.json

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_17

查看爬虫任务详情信息,使用命令:

curl http://localhost:6800/listjobs.json?project=newsspider

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapy_18

问题:不能提示出爬虫的开始时间和结束时间以及日志(待解决)

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapyd_19

scrapydweb部署scrapy项目 1)创建文件夹:mkdir scrapydweb; cd scrapydweb 2)执行命令:scrapydweb (会在当前目录下生成配置文件 scrapydweb_settings_v10.py)

问题(待解决):

scrapy+scrapyd+scrapydweb的使用(采取一个案例演示)_scrapydweb_20

参考链接:https://blog.csdn.net/weixin_42486623/article/details/123235312?ops_request_misc=&request_id=&biz_id=102&utm_term=scrapy+scrapyd+scrapydweb&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-4-123235312.142^v88^insert_down1,239^v2^insert_chatgpt&spm=1018.2226.3001.4187

欢迎批评指正,谢谢


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3