【爬虫】4.3 Scrapy 爬取与存储数据

您所在的位置：网站首页 › 爬取京东评论 › 【爬虫】4.3 Scrapy 爬取与存储数据

【爬虫】4.3 Scrapy 爬取与存储数据

2023-06-13 13:24| 来源: 网络整理| 查看: 265

1. 建立 Web 网站

2. 编写数据项目类

3. 编写爬虫程序 MySpider

4. 编写数据管道处理类

5. 设置 Scrapy 的配置文件

从一个网站爬取到数据后，往往要存储数据到数据库中，scrapy 框架有十分方便的存储方法，为了说明这个存储过程，首先建立一个简单的网站，然后写一个 scrapy 爬虫程序爬取数据，最后存储数据。

1. 建立 Web 网站

这个网站有一个网页，返回基本计算机教材数据，Flask程序

服务器 server.py 如下：

import flask app = flask.Flask(__name__) @app.route("/") def index(): html = """ Python程序设计 James 清华大学出版社 Java程序设计 Robert 人民邮电出版社 MySQL数据库 Steven 高等教育出版社 """ return html if __name__ == "__main__": app.run()

访问这个网站时返回 xml 的数据，包含教材的名称、作者、与出版社

2. 编写数据项目类

程序要爬取的数据是多本教材，每本教材有名称与作者，因此要建立一个教材的类，类中包含教材名称title、作者author与出版社 publisher。在 scrapy 框架中有的.\example\Test\Test 目录下有一个文件 items.py 就是用来设计数据项目类的，打开这个文件，改造文件成如下形式：

改造前：

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class TestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass

改造后：

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class BookItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() author = scrapy.Field() publish = scrapy.Field()

其中 BookItme 是我们设计的教材类，这个类必须从 scrapy.Item 类继承，在类中定义教材的字段项目，每个字段项目都是一个 scrapy.Field 对象，这里定义了3个字段项目，用来存储教材名称 title、作者 author、出版社 publisher。

如果item是一个BooItem的对象，那么可以通过item["title"]、 item["author"]、item["publisher"]来获取与设置各个字段的值，例如：

item=BookItem()

item["title"]="Python程序设计"

item["author"]="James"

item["publisher"]="清华大学出版社"

print(item["title"])

print(item["author"])

print(item["publisher"])

3. 编写爬虫程序 MySpider

数据的项目设计好后就可以编写爬虫程序（.\example\Test\Test\spiders\MySpider.py）

爬虫程序 MySpider.py 如下：

import scrapy from ..items import BookItem class MySpider(scrapy.Spider): name = "mySpider" start_urls = ['http://127.0.0.1:5000'] # 回调函数 def parse(self, response, **kwargs): try: data = response.body.decode() # 爬取数据 selector = scrapy.Selector(text=data) books = selector.xpath("//book") for book in books: item = BookItem() item["title"] = book.xpath("./title/text()").extract_first() item["author"] = book.xpath("./author/text()").extract_first() item["publisher"] = book.xpath("./publisher/text()").extract_first() yield item except Exception as err: print(err)

这个程序访问 http://127.0.0.1:5000 的网站，得到的网页包含教材信息，程序过程如下：

(1)

from ..items import BookItem

从Test文件夹的items.py文件中引入BookItem类的定义。

(2)

data=response.body.decode()

selector=scrapy.Selector(text=data)

books=selector.xpath("//book")

得到网站数据并建立Selector对象，搜索到所有的节点的元素。

(3)

for book in books:

item=BookItem()

item["title"]=book.xpath("./title/text()").extract_first()

item["author"] = book.xpath("./author/text()").extract_first()

item["publisher"] = book.xpath("./publisher/text()").extract_first()

yield item

对于每个节点，在它下面搜索到节点，取出它的文本即教材名称，其中注意使用book.xpath("./title/text()")搜索到下面的节点的文本，一定不能少"./"的部分，它表示从当前节点往下搜索。同样道理搜索、节点的文本，它们组成一个BookItem对象，这个对象通过语句： yield item 向上一级调用函数返回，接下来scrapy会把这个对象推送给与items.py同目录下的 pipelines.py文件中的数据管道，执行类取处理数据。

4. 编写数据管道处理类

在我 scrapy框架中有的 .\example\Test\Test 目录下有一个文件 pipelines.py 就是用来数据管道处理类文件，打开这个文件可以看到一个默认的管道类，

默认数据管道类 pipelines.py如下：

修改并设计数据管道类 pipelines.py如下：

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter class BookPipeline(object): count = 0 def process_item(self, item, spider): BookPipeline.count += 1 try: if BookPipeline.count == 1: fobj = open("books.txt", "wt") else: fobj = open("books.txt", "at") print(item["title"], item["author"], item["publisher"]) fobj.write(item["title"] + "," + item["author"] + "," + item["publisher"] + "\n") fobj.close() except Exception as err: print(err) return item

这个类我们命名为 BookPipeline，它继承自object类，类中最重要的函数是process_item函数，scrapy 爬取数据开始时会建立一个 BookPipeline 类对象，然后每爬取一个数据类BookItem项目item，MySpider程序会把这个对象推送给BookPipeline对象，同时调用process_item函数一次。 process_item 函数的参数中的item就是推送来的数据，于是,便可以在这个函数中保存爬取的数据了。注意scrapy要求process_item函数最后返回这个item对象。

在这个程序中采用文件存储爬取的数据，BookPipeline 类中先定义一个类成员count=0，用它来记录process_item调用的次数。如果是第一次调用(count=1)那么就使用语句fobj=open("books.txt","wt") 新建立一个books.txt的文件，然后把item的数据写到文件中。如果不是第一次调用(count>1)，就使用语句fobj=open("books.txt","at")打开已经存在的文件books.txt，把item的数据追加到文件中。这样我们反复执行爬虫程序的过程,保证每次清除掉上次的数据，记录本次爬取的数据。

5. 设置 Scrapy 的配置文件

MySpider爬虫程序执行后每爬取一个 item 项目都会推送到 BookPipelines类并调用的process_item 函数，那么 scrapy 怎么样知道要这样做呢？前提是我们必须设置这样一个通道。在Test 文件夹中有一个 settings.py 的设置文件，打开这个文件可以看到很多设置项目，大部分是用#注释的语句，找到语句ITEM_PIPLINES的项目，把它设置成如下形式：

# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "Test.pipelines.TestPipeline": 300, }

其中 ITEM_PIPLINES 是一个字典，把关键字改成 Test.pipelines.BookPipeline'，而BookPipelines 就是在 pipelines.py 文件中设计的数据管道类的名称，后面的300是一个默认的整数，实际上它可以不是300，它可以是任何整数。

设置完成后就连通了爬虫程序 MySpider 数据管道处理程序 pipelines.py 的通道，scrapy工作时会把 MySpider 爬虫程序通过yield返回的每项数据推送给 pipelines.py 程序的 BookPipeline 类，并执行 process_item 函数，这样就可以保存数据了。

总结：

scrapy把数据爬取与数据存储分开处理，它们都是异步执行的， MySpider.py 每爬取到一个数据项目 item，就 yield 推送给 pipelines.py 程序存储；等待存储完毕后，又再次爬取另外一个数据项目 item，再次 yield 推送到 pipelines.py 程序，然后再次存储， ......，这个过程一直进行下去，直到爬取过程结束，文件 books.txt 中就存储了所有的爬取数据了。

【本文地址】

【爬虫】4.3 Scrapy 爬取与存储数据

【爬虫】4.3 Scrapy 爬取与存储数据

今日新闻

推荐新闻