python 小白爬虫实战：使用 scrapy 爬取微博热搜并发送邮箱

您所在的位置：网站首页 › 热搜的数据来自哪 › python 小白爬虫实战：使用 scrapy 爬取微博热搜并发送邮箱

python 小白爬虫实战：使用 scrapy 爬取微博热搜并发送邮箱

2024-03-16 15:02| 来源: 网络整理| 查看: 265

文章目录环境爬取内容和思路实现文件结构具体实现后记参考资料

环境

我的环境是：python3.5 + scrapy 2.0.0

爬取内容和思路

爬取内容：微博热搜的关键词，链接，以及导语，即简要概述热搜内容的一小段话

思路：

对于热搜链接：通过热搜关键词所在标签的属性再加上前缀即可（如图1）对于关键词：进入关键词所在链接，一般会有一个如图2所示的位置，根据标签解析出内容；如果没有，存入“无”对于导语：也是在关键词链接里边（如图3），通过解析获得；如果没有，爬取页面中的一条微博截取对于推荐类（如图4）：一般是广告之类，不在所爬取范围之内，可以在提取关键词链接时，通过标签最后位置是否为 “荐” 进行一个过滤关于文件保存，先将所爬取内容根据关键词，导语，链接的方式写入本地 txt关于邮箱发送，在 pipelines 文件中重写 close_spider 函数，将保存在本地的 txt 文件发送给你想要发送的邮箱

关于最后一点，真滴是让我最头疼的一部分，我这次使用的 scrapy 自带的 mail 模块进行发送，踩坑无数，这部分调试占用了很大部分时间

实现文件结构

在这里插入图片描述

具体实现

主要工作在 weiboresou.py items.py 以及pipelines.py

# weiboresou.py # -*- coding: utf-8 -*- import scrapy import time from ..items import WeiboItem header = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' class WeiboresouSpider(scrapy.Spider): name = 'weiboresou' allowed_domains = ['s.weibo.com'] start_urls = ['https://s.weibo.com/top/summary?cate=realtimehot'] def parse(self, response): base_url = 'https://s.weibo.com/' resouurlList = [] templist = response.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr') #将广告删掉 #广告判断标准：后边标志是‘荐’ for temp in templist[:40]: if(len(temp.xpath('./td[3]//text()').extract()) and temp.xpath('./td[3]/i/text()').extract()[0] == '荐'): pass else: resouurlList.append(temp.xpath('./td[2]/a/@href').extract()[0]) for resouurl in resouurlList: url = base_url + resouurl yield scrapy.Request(url=url,callback=self.parse_html,dont_filter=True,headers={"User-Agent":header}) time.sleep(3) def parse_html(self,response): url = str(response.url) leadNews = '' if(len(response.xpath('//*[@id="pl_feedlist_index"]/div[1]/div[1]/div/p/text()'))>0): leadNews = response.xpath('//*[@id="pl_feedlist_index"]/div[1]/div[1]/div/p/text()').extract()[0] else: if(len(response.xpath('//*[@id="pl_feedlist_index"]//div[2]/div[1]/div[2]/p[1]'))>0): temps = response.xpath('//*[@id="pl_feedlist_index"]//div[2]/div[1]/div[2]/p[1]')[0] for i in temps.xpath('.//text()').extract(): leadNews += i leadNews = leadNews.strip()[:-10] #leadNews = 'i dont know' if(len(response.xpath('//*[@id="pl_topic_header"]/div[1]/div/div[1]/h1/a//text()').extract())>0): title = response.xpath('//*[@id="pl_topic_header"]/div[1]/div/div[1]/h1/a//text()').extract()[0] else: title = '我也不知道为什么没有' item = WeiboItem() item['url'] = url item['leadNews'] = leadNews item['title'] = title yield item # items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class WeiboItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() ''' url：链接 title：标题 leadNews：导语 ''' url = scrapy.Field() title = scrapy.Field() leadNews = scrapy.Field() # pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy import signals from scrapy.mail import MailSender class WeiboPipeline(object): def __init__(self): pass def process_item(self, item, spider): fpath = 'E:\\Compile Tools\\python\\virEnvProject\\test\\Scripts\\weibo\\weibo.txt' with open(fpath,'a',encoding='utf-8') as f: f.write('标题: '+str(item['title'])+'\n') f.write('导语：'+str(item['leadNews'])+'\n') f.write('链接：'+str(item['url'])+'\n') f.write('\n') return item def close_spider(self,spider): mailer = MailSender( smtphost='smtp.qq.com', mailfrom='发送人', smtpuser='接收人', smtppass='授权码', smtpport=端口号，默认是 25 ) body = '' fpath = 'E:\\Compile Tools\\python\\virEnvProject\\test\\Scripts\\weibo\\weibo.txt' with open(fpath,'r',encoding='utf-8') as f: body = f.read() subject = u'微博热搜' mailer.send(to=['你要发送给的邮箱'],subject=subject, body=body)

完整代码：https://github.com/tonggongzhiqiu/weibo-

后记

最近在学习使用 scrapy 框架，这算是一次不错的实践。但是还有一些部分，像中间件和一些反爬措施没有用上，导致还是有一些不足，以后在继续更新吧。

参考资料

崔庆才《python3网络爬虫开发实战》

【本文地址】

python 小白爬虫实战：使用 scrapy 爬取微博热搜并发送邮箱

python 小白爬虫实战：使用 scrapy 爬取微博热搜并发送邮箱

今日新闻

推荐新闻