Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

您所在的位置：网站首页 › 抓取网页数据存入csv › Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

2023-10-15 16:45| 来源: 网络整理| 查看: 265

1. items.py代码内容,定义需要爬取数据字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

from scrapy import Item

from scrapy import Field

class JianshuHotTopicItem(scrapy.Item):

'''

@scrapy.item，继承父类scrapy.Item的属性和方法，该类用于定义需要爬取数据的子段

'''

collection_name = Field()

collection_description = Field()

collection_article_count = Field()

collection_attention_count = Field()

2. piders/jianshu_hot_topic_spider.py代码内容，实现数据获取的代码逻辑，通过xpath实现

[root@HappyLau jianshu_hot_topic]# cat spiders/jianshu_hot_topic_spider.py

#_*_ coding:utf8 _*_

import random

from time import sleep

from scrapy.spiders import CrawlSpider

from scrapy.selector import Selector

from scrapy.http import Request

from jianshu_hot_topic.items import JianshuHotTopicItem

class jianshu_hot_topic(CrawlSpider):

'''

简书专题数据爬取，获取url地址中特定的子段信息

'''

name = "jianshu_hot_topic"

start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]

def parse(self,response):

'''

@params:response,提取response中特定字段信息

'''

item = JianshuHotTopicItem()

selector = Selector(response)

collections = selector.xpath('//div[@class="col-xs-8"]')

for collection in collections:

collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip()

collection_description = collection.xpath('div/a/p/text()').extract()[0].strip()

collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','')

collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'')

item['collection_name'] = collection_name

item['collection_description'] = collection_description

item['collection_article_count'] = collection_article_count

item['collection_attention_count'] = collection_attention_count

yield item

urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(3,11)]

for url in urls:

sleep(random.randint(2,7))

yield Request(url,callback=self.parse)

3. pipelines文件内容，定义数据存储的方式，此处定义数据存储的逻辑，可以将数据存储载MySQL数据库，MongoDB数据库，文件，CSV，Excel等存储介质中，如下以存储载CSV为例：

[root@HappyLau jianshu_hot_topic]# cat pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv

class JianshuHotTopicPipeline(object):

def process_item(self, item, spider):

f = file('/root/zhuanti.csv','a+')

writer = csv.writer(f)

writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count']))

return item

4. 修改settings文件，

ITEM_PIPELINES = {

'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline': 300,

}

【本文地址】

Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

今日新闻

推荐新闻