Scrapy框架入门：轻松爬取网页数据 ️

您所在的位置：网站首页 › 诗词网页代码编写方法视频教程大全下载 › Scrapy框架入门：轻松爬取网页数据 ️

Scrapy框架入门：轻松爬取网页数据 ️

#Scrapy框架入门：轻松爬取网页数据 ️| 来源: 网络整理| 查看: 265

在这篇文章中，我们将深入探讨Scrapy框架的基础入门，帮助你快速上手Python中的这一强大的网络爬虫工具。Scrapy不仅提供了一种快速抓取网页的方式，还让数据的提取和存储变得异常简单。我们将通过一个实际的例子——爬取4399小游戏网站的游戏信息，来展示Scrapy的基本使用方法。🎮🕷️

Scrapy框架介绍 🌐

Scrapy是一个用Python实现的应用框架，专为网页数据抓取、处理和存储而生。它基于Twisted异步网络框架，能够处理大量的数据以及复杂的网络请求。Scrapy的高效和易用性，使其成为许多数据科学家和网络爬虫开发者的首选工具。🛠️

安装Scrapy 🔧

安装Scrapy非常简单，前提是你的计算机上已经安装了Python环境。Scrapy支持Python 3.6及以上版本，安装命令如下：

pip install scrapy 创建Scrapy项目 📦

创建一个Scrapy项目也非常直观。只需要在终端中运行以下命令，就可以生成一个名为game的项目：

scrapy startproject game

这个命令会创建一个包含Scrapy项目基本结构的目录，包括用于放置爬虫代码的spiders目录。

编写爬虫 🕸️

Scrapy的核心在于它的爬虫（Spider），它负责从网页中提取所需的信息。创建一个爬虫的命令如下：

cd game scrapy genspider xiaoyouxi 4399.com

这会在game/spiders目录下创建一个名为xiaoyouxi的爬虫文件。

提取数据 📝

Scrapy使用选择器（Selectors）来提取数据，支持XPath和CSS选择器。例如，提取游戏名称和类别的代码如下：

name = response.xpath("//ul[@class='n-game cf']/li/a/b/text()").extract_first() category = response.xpath("./em/a/text()").extract_first() 存储数据 💾

Scrapy提供了多种存储解决方案。最简单的方式是直接将爬取的数据导出到JSON或CSV文件中：

scrapy crawl xiaoyouxi -o games.json 运行爬虫 🚀

最后，运行爬虫只需要一个简单的命令：

scrapy crawl xiaoyouxi 运行结果

在这里插入图片描述

实现思路 🧠

通过以上步骤，我们可以总结Scrapy爬虫的基本实现思路：

项目和爬虫的创建：使用Scrapy的命令行工具快速创建项目框架和爬虫文件。编写爬虫逻辑：在爬虫文件中定义爬取的URLs、解析方法和提取数据的逻辑。数据提取：利用XPath或CSS选择器从HTML中提取所需的数据。数据存储：将提取的数据存储到文件或数据库中。爬虫运行：通过命令行启动Scrapy爬虫，开始数据的爬取和存储过程。

Scrapy的强大之处在于它提供了一套完整的框架来处理网页爬取、数据提取和存储等一系列过程，极大地提高了开发效率。

具体代码 xiaoyouxi.py import scrapy class XiaoyouxiSpider(scrapy.Spider): name = "xiaoyouxi" # 爬虫名 allowed_domains = ["4399.com"] # 允许的域名 start_urls = ["https://www.4399.com/flash/"] # 起始url def parse(self, response): # 解析函数 # game_name = response.xpath("//ul[@class='n-game cf']/li/a/b/text()").extract() # 从网页中提取游戏名 extract # # 一个Selector对象 # print(game_name) li_list = response.xpath("//ul[@class='n-game cf']/li") for item in li_list: name = item.xpath("./a/b/text()").extract_first() # 提取游戏名 category = item.xpath("./em/a/text()").extract_first() # 提取游戏类别 date = item.xpath("./em/text()").extract_first() # 时间 dic = { "name": name, "category": category, "date": date } yield dic # 返回数据 yield: 生成器作用：返回数据 pipelines.py # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter ''' 管道文件默认关闭需要在settings.py中开启作用：对爬虫爬取的数据进行处理 1. 数据清洗 2. 数据存储 3. 数据去重 4. 数据验证 5. 数据筛选 6. 数据转换 ''' class GamePipeline: def process_item(self, item, spider): # 处理item print(item) # print(spider.name) return item class NewPipeline: def process_item(self, item, spider): # 处理item # 网站名 item['webname'] = "4399小游戏" return item settings.py # Scrapy settings for game project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html ''' 爬虫配置文件作用：配置爬虫的一些参数 1. 请求头 2. 代理 3. IP 4. 爬取速度 5. 爬取深度 6. 爬取范围 ''' BOT_NAME = "game" # 爬虫名 SPIDER_MODULES = ["game.spiders"] # 爬虫模块 NEWSPIDER_MODULE = "game.spiders" # 新爬虫模块 LOG_LEVEL = "WARNING" # 日志的级别：CRITICAL、ERROR、WARNING、INFO、DEBUG # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = "game (+http://www.yourdomain.com)" # 用户代理 # Obey robots.txt rules ROBOTSTXT_OBEY = True # 是否遵守robots协议 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # 并发请求数 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # 下载延迟 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 同一个域名的并发请求数 #CONCURRENT_REQUESTS_PER_IP = 16 # 同一个IP的并发请求数 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # 是否开启cookie # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # 是否开启telnet控制台 # Override the default request headers: # 请求头 #DEFAULT_REQUEST_HEADERS = { # 请求头 # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", # 接受的数据类型 # "Accept-Language": "en", # 接受的语言 #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 爬虫中间件 # "game.middlewares.GameSpiderMiddleware": 543, # 默认开启 #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 下载中间件 # "game.middlewares.GameDownloaderMiddleware": 543, # 默认开启 #} # Enable or disable extensions # 扩展文件 # See https://docs.scrapy.org/en/latest/topics/extensions.html # 扩展文件 #EXTENSIONS = { # 扩展文件 # "scrapy.extensions.telnet.TelnetConsole": None, # 默认开启 #} # Configure item pipelines # 管道文件 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # 管道文件 ITEM_PIPELINES = { # 管道文件 # key 是管道的优先级 0-1000 之间数字越小优先级越高 # 值是管道的路径 "game.pipelines.GamePipeline": 300, "game.pipelines.NewPipeline": 299, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # 自动限速 # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # 初始下载延迟 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # 在高延迟的情况下，设置的最大下载延迟 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # 平均每秒并发请求数 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # 是否开启自动限速的debug模式 # Enable and configure HTTP caching (disabled by default) # 是否开启缓存 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True # 是否开启缓存 #HTTPCACHE_EXPIRATION_SECS = 0 # 缓存过期时间 #HTTPCACHE_DIR = "httpcache" # 缓存目录 #HTTPCACHE_IGNORE_HTTP_CODES = [] # 忽略的HTTP状态码 #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" # default: 'scrapy.utils.request_fingerprint.request_fingerprint' TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" # default: 'twisted.internet.selectreactor.SelectReactor' FEED_EXPORT_ENCODING = "utf-8" # default: 'utf-8' 结语 🌟

Scrapy框架的学习曲线相对平缓，对于初学者来

说，通过实践中不断地尝试和错误，可以快速上手并掌握。而对于经验丰富的开发者来说，Scrapy的高度可定制性和强大的功能，能够帮助他们构建复杂的爬虫项目。

希望这篇文章能帮助你入门Scrapy，开启你的爬虫之旅！如果你觉得本文有用，请不吝点赞和分享。有任何问题或想法，欢迎在评论区留言交流。🌈💬

免责声明 ⚠️

本文提供的信息和代码示例仅供学习和研究目的使用。使用Scrapy框架进行网页爬取时，请始终遵守目标网站的robots.txt文件规则，尊重网站版权和数据使用政策。未经网站明确允许，不得使用爬虫收集数据进行任何形式的商业利用或违法活动。

作者不对因使用本文内容而导致的任何直接或间接后果负责。使用者应自行承担使用本文提供的代码和技术时的风险。在执行任何网络爬取任务之前，建议详细了解相关法律法规，以免触犯法律。

希望这篇文章和免责声明能够帮助你更好地理解和使用Scrapy框架，同时也提醒大家在使用技术时要有法律意识，确保自己的行为合法合规。祝你在爬虫的世界中探索愉快！🚀🌟

【本文地址】

Scrapy框架入门：轻松爬取网页数据 ️

Scrapy框架入门：轻松爬取网页数据 ️

今日新闻

推荐新闻