Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

您所在的位置:网站首页 抓取网页数据存入csv Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

2023-10-15 16:45| 来源: 网络整理| 查看: 265

1. items.py代码内容,定义需要爬取数据字段

# -*- coding: utf-8 -*-

 

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

 

import scrapy

from scrapy import Item

from scrapy import Field

 

class JianshuHotTopicItem(scrapy.Item):

    '''

    @scrapy.item,继承父类scrapy.Item的属性和方法,该类用于定义需要爬取数据的子段

    '''

    collection_name = Field()

    collection_description = Field()

    collection_article_count = Field()

    collection_attention_count = Field()

 

2. piders/jianshu_hot_topic_spider.py代码内容,实现数据获取的代码逻辑,通过xpath实现

[root@HappyLau jianshu_hot_topic]# cat spiders/jianshu_hot_topic_spider.py

#_*_ coding:utf8 _*_

 

import random

from time import sleep

from scrapy.spiders import CrawlSpider

from scrapy.selector import Selector

from scrapy.http import Request

from jianshu_hot_topic.items import JianshuHotTopicItem

 

class jianshu_hot_topic(CrawlSpider):

    '''

    简书专题数据爬取,获取url地址中特定的子段信息

    '''

    name = "jianshu_hot_topic"

    start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]

 

    def parse(self,response):

        '''

        @params:response,提取response中特定字段信息

        '''

        item = JianshuHotTopicItem()

        selector = Selector(response)

        collections = selector.xpath('//div[@class="col-xs-8"]')   

        for collection in collections:

            collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip()

                    collection_description = collection.xpath('div/a/p/text()').extract()[0].strip()

                    collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','')

                    collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'')

            item['collection_name'] = collection_name

            item['collection_description'] = collection_description

            item['collection_article_count'] = collection_article_count

            item['collection_attention_count'] = collection_attention_count

 

            yield item

         

         

        urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(3,11)]

        for url in urls:

            sleep(random.randint(2,7))

            yield Request(url,callback=self.parse)

 

3. pipelines文件内容,定义数据存储的方式,此处定义数据存储的逻辑,可以将数据存储载MySQL数据库,MongoDB数据库,文件,CSV,Excel等存储介质中,如下以存储载CSV为例:

[root@HappyLau jianshu_hot_topic]# cat pipelines.py

# -*- coding: utf-8 -*-

 

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 

 

import csv

 

class JianshuHotTopicPipeline(object):

    def process_item(self, item, spider):

        f = file('/root/zhuanti.csv','a+')

    writer = csv.writer(f)

    writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count']))

        return item

 

4. 修改settings文件,

ITEM_PIPELINES = {

    'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline': 300,

}



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3