Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4) |
您所在的位置:网站首页 › 抓取网页数据存入csv › Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4) |
1. items.py代码内容,定义需要爬取数据字段 # -*- coding: utf-8 -*-
# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html
import scrapy from scrapy import Item from scrapy import Field
class JianshuHotTopicItem(scrapy.Item): ''' @scrapy.item,继承父类scrapy.Item的属性和方法,该类用于定义需要爬取数据的子段 ''' collection_name = Field() collection_description = Field() collection_article_count = Field() collection_attention_count = Field()
2. piders/jianshu_hot_topic_spider.py代码内容,实现数据获取的代码逻辑,通过xpath实现 [root@HappyLau jianshu_hot_topic]# cat spiders/jianshu_hot_topic_spider.py #_*_ coding:utf8 _*_
import random from time import sleep from scrapy.spiders import CrawlSpider from scrapy.selector import Selector from scrapy.http import Request from jianshu_hot_topic.items import JianshuHotTopicItem
class jianshu_hot_topic(CrawlSpider): ''' 简书专题数据爬取,获取url地址中特定的子段信息 ''' name = "jianshu_hot_topic" start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]
def parse(self,response): ''' @params:response,提取response中特定字段信息 ''' item = JianshuHotTopicItem() selector = Selector(response) collections = selector.xpath('//div[@class="col-xs-8"]') for collection in collections: collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip() collection_description = collection.xpath('div/a/p/text()').extract()[0].strip() collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','') collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'') item['collection_name'] = collection_name item['collection_description'] = collection_description item['collection_article_count'] = collection_article_count item['collection_attention_count'] = collection_attention_count
yield item
urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(3,11)] for url in urls: sleep(random.randint(2,7)) yield Request(url,callback=self.parse)
3. pipelines文件内容,定义数据存储的方式,此处定义数据存储的逻辑,可以将数据存储载MySQL数据库,MongoDB数据库,文件,CSV,Excel等存储介质中,如下以存储载CSV为例: [root@HappyLau jianshu_hot_topic]# cat pipelines.py # -*- coding: utf-8 -*-
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
class JianshuHotTopicPipeline(object): def process_item(self, item, spider): f = file('/root/zhuanti.csv','a+') writer = csv.writer(f) writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count'])) return item
4. 修改settings文件, ITEM_PIPELINES = { 'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline': 300, } |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |