动态网页爬取样例（WebCollector+selenium+phantomjs）

您所在的位置：网站首页 › findelementbycssselector › 动态网页爬取样例（WebCollector+selenium+phantomjs）

动态网页爬取样例（WebCollector+selenium+phantomjs）

2023-10-18 18:39| 来源: 网络整理| 查看: 265

目标：动态网页爬取

说明：这里的动态网页指几种可能：1）须要用户交互，如常见的登录操作；2）网页通过JS / AJAX动态生成。如一个html里有，通过JS生成aaa。

这里用了WebCollector 2进行爬虫，这东东也方便，只是要支持动态关键还是要靠另外一个API -- selenium 2（集成htmlunit 和 phantomjs）.

1）须要登录后的爬取，如新浪微博

import java.util.Set; import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler; import cn.edu.hfut.dmic.webcollector.model.Links; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.net.HttpRequesterImpl; import org.openqa.selenium.Cookie; import org.openqa.selenium.WebElement; import org.openqa.selenium.htmlunit.HtmlUnitDriver; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /* * 登录后爬取 * Refer: http://nutcher.org/topics/33 * https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md * Lib required: webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends DeepCrawler { public WebCollector1(String crawlPath) { super(crawlPath); /*获取新浪微博的cookie，账号密码以明文形式传输。请使用小号*/ try { String cookie=WebCollector1.WeiboCN.getSinaCookie("yourAccount", "yourPwd"); HttpRequesterImpl myRequester=(HttpRequesterImpl) this.getHttpRequester(); myRequester.setCookie(cookie); } catch (Exception e) { e.printStackTrace(); } } @Override public Links visitAndGetNextLinks(Page page) { /*抽取微博*/ Elements weibos=page.getDoc().select("div.c"); for(Element weibo:weibos){ System.out.println(weibo.text()); } /*假设要爬取评论，这里能够抽取评论页面的URL。返回*/ return null; } public static void main(String[] args) { WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo"); crawler.setThreads(3); /*对某人微博前5页进行爬取*/ for(int i=0;i

【本文地址】

动态网页爬取样例（WebCollector+selenium+phantomjs）

动态网页爬取样例（WebCollector+selenium+phantomjs）

今日新闻

推荐新闻