道客巴巴爬虫
原创
奇点_python_nlp
2022-04-28 23:12:50
©著作权
文章标签
python
django
list
tornado
html
文章分类
Python
后端开发
©著作权归作者所有:来自51CTO博客作者奇点_python_nlp的原创作品,请联系作者获取转载授权,否则将追究法律责任
![道客巴巴爬虫_list](https://s2.51cto.com/images/blog/202204/28011505_62697a19c35af54803.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=/resize,m_fixed,w_1184) 使用xpathhelp控件import requests, re, json, pandas as pd, timefrom selenium import webdriver # selenium2.48.0 支持phantomjsfrom lxml import etreeimport timeimport os, time# 页 https://www.doc88.com/list-8308-0-1.html# 文件 https://www.doc88.com/p-9139147359378.htmldriver = webdriver.PhantomJS(executable_path=r'C:\Users\wang\Desktop\phantomjs-2.1.1-windows (1)\bin\phantomjs.exe')file_urls_list=[]for i in range(1,30,1): time.sleep(3) url = "https://www.doc88.com/list-8308-0-"+str(i)+"1.html" driver.get(url=url) tree = etree.HTML(driver.page_source) file_urls = tree.xpath(".//h3[@class='sd-type-title']/a/@href") file_urls=[ "https://www.doc88.com/"+str(i) for i in file_urls ] file_urls_list.extend(file_urls) print(file_urls)with open("url.txt","w",encoding="utf-8") as f: for i in file_urls: if len(i)==len("https://www.doc88.com//p-7367816610215.html"): f.write(i) f.write("\n")f.close()![道客巴巴爬虫_django_02](https://s2.51cto.com/images/blog/202204/28011506_62697a1a108b211019.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=/resize,m_fixed,w_1184)
赞
收藏
评论
分享
举报
上一篇:python中os关于目录创建和文件移动操作
下一篇:图神经网络的模型图
|