python微信公众号文章爬虫存成PDF版（目前唯一可行办法）

您所在的位置：网站首页 › python爬公众号图片 › python微信公众号文章爬虫存成PDF版（目前唯一可行办法）

python微信公众号文章爬虫存成PDF版（目前唯一可行办法）

2023-08-25 15:46| 来源: 网络整理| 查看: 265

微信公众号爬虫

主要通过搜狗微信进行文章搜索，然后依次模拟浏览获取文章网页源码（主要搜狗微信有反爬虫机制，这是我认为唯一可行的办法），通过pdfkit存到本地，每一篇文章大概需要2-4分钟，公众号文章较多的话建议下班跑脚本

1.导入基本库（主要用selenium） from selenium import webdriver import pandas as pd import numpy as np import random from time import sleep base = r'//*[@id="sogou_vr_11002601_title_' #每个公众号可能不一样，自己复制一个文章标题的xpath def drop_e(s):#对html源码加工，否则爬下来不显示图片 string_error= r'?wx_fmt=jpeg;tp=webp;wxfrom=5;wx_lazy=1;wx_co=1' string_error2 = r'?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1' string_error3= r'?wx_fmt=jpeg' string_error4 = r'?wx_fmt=png' string_error5 = r'?wx_fmt=png;tp=webp;wxfrom=5;wx_lazy=1;wx_co=1' string_error6 = r'?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1' v = s.replace(string_error,'') v = v.replace(string_error2,'') v = v.replace(string_error5,'') v = v.replace(string_error6,'') v = v.replace(string_error3,'') v = v.replace(string_error4,'') return v 2.将页面转为PDF需要用到pdfkit（相关配置需要百度） import pdfkit path_wkthmltopdf = r'D:\wkhtmltopdf\bin\wkhtmltopdf.exe' #这个位置根据你的配置来就行 config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf) 3.模拟操作浏览器进入搜狗微信官网并点击登录，此步需要人工扫码 driver = webdriver.Chrome() driver.get('https://weixin.sogou.com/') driver.find_element_by_xpath(r'//*[@id="loginBtn"]').click() 4.扫码结束后运行以下代码（进行输入并搜索公众号文章） driver.find_element_by_xpath(r'//*[@id="query"]').send_keys('PK霍工作室') #随便找个公众号就好，这是作者当时爬的一个工作号，大概三百多篇文章 driver.find_element_by_xpath(r'//*[@id="scroll-header"]/form/div/input[1]').click() 5.开始爬文章（思路是依次点进去，保存，当页10篇文章存完就翻页） page = 10 #爬多少页，每页应该是10篇文章 for j in range(page-1): print('第',j+1,'页') t = BeautifulSoup(driver.page_source,'html.parser').find_all('a',id=re.compile(r'sogou_vr_11002601_title'),target='_blank') for i in range(10): driver.find_element_by_xpath(base+str(i)+"\"]").click() sleep(5) driver.switch_to.window(driver.window_handles[-1]) for y in range(100): #模拟向下滑动 js='window.scrollBy(0,100)' driver.execute_script(js) sleep(0.2) for y in range(100): #模拟向上滑动 js='window.scrollBy(0,-100)' driver.execute_script(js) sleep(0.2) print('第',i+1,'篇') try: pdfkit.from_string(drop_e(driver.page_source),t[i].text+'.pdf', configuration=config) except: pass driver.close() driver.switch_to.window(driver.window_handles[0]) driver.find_element_by_xpath(r'//*[@id="sogou_next"]').click() 注：1.如果个别报错就手动翻页就好再运行就好 2.碰到输验证码（翻20页可能会被检测），人工输入后从那一页开始爬就行了优化思路：没有匹配公众号名字，可能会有少许几篇不是该公众号的文章

结果展示

有其他问题可以留言在下方觉得不错可以点赞+收藏一下！

【本文地址】

python微信公众号文章爬虫存成PDF版（目前唯一可行办法）

python微信公众号文章爬虫存成PDF版（目前唯一可行办法）

今日新闻

推荐新闻

python微信公众号文章爬虫 存成PDF版（目前唯一可行办法）

python微信公众号文章爬虫 存成PDF版（目前唯一可行办法）

今日新闻

推荐新闻

python微信公众号文章爬虫存成PDF版（目前唯一可行办法）

python微信公众号文章爬虫存成PDF版（目前唯一可行办法）