java爬虫代码 json java实现爬虫抓取数据

您所在的位置：网站首页 › 小米商城数据库代码 › java爬虫代码 json java实现爬虫抓取数据

java爬虫代码 json java实现爬虫抓取数据

2023-07-09 07:21| 来源: 网络整理| 查看: 265

一，什么是网络爬虫？

网络爬虫（web crawer），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。从功能上来讲，爬虫一般分为数据采集，处理，储存三个部分。

1，入门程序

环境准备

（1）jdk1.8 （2）idea环境（3）maven

（4）需要导入httpClient的依赖。（去官网找用的最多的一个版本，不要找最新的）

org.apache.httpcomponents httpclient 4.5.2 2，写一个爬虫小例子带你初次体验爬虫

这里写一个测试类，把传智播客官网首页的代码全都爬出来。

public class CrawerFirst { public static void main(String[] args) throws IOException { //1，打开浏览器，创建HTTPClient对象 CloseableHttpClient httpClient = HttpClients.createDefault(); //2，输入网址，发起get请求，创建httpGet对象 HttpGet httpGet=new HttpGet("http://www.itcast.cn"); //3，按回车发送请求，返回响应，使用HttpClient对象发起请求 CloseableHttpResponse response = httpClient.execute(httpGet); //4，解析响应，获取数据 //判断状态码是否为200 if(response.getStatusLine().getStatusCode()==200){ HttpEntity httpEntity=response.getEntity(); String content = EntityUtils.toString(httpEntity, "utf-8"); System.out.println(content); } } }

然后就可以打印出content，即首页的所有html代码信息。

3，HttpClient

这里我们使用Java的Http协议客户端HttpClient这个技术，来实现抓取网页数据。

3.1 Get请求public static void main(String[] args) throws IOException { //创建HttpClient对象 CloseableHttpClient httpClient = HttpClients.createDefault(); //创建HttpGet对象，设置url访问地址 HttpGet httpGet=new HttpGet("http://www.itcast.cn"); //使用httpClient发起请求，获取response CloseableHttpResponse response = null; try{ response=httpClient.execute(httpGet); //解析响应 if(response.getStatusLine().getStatusCode()==200){ //得到响应体，并把结果通过EntityUtils工具类把结果转换为字符串 String content= EntityUtils.toString(response.getEntity(),"utf8"); System.out.println(content.length()); } }catch (Exception e){ e.printStackTrace(); }finally { //关闭response response.close(); httpClient.close(); } }3.2 带参数的Get请求

通过URIBuilder来设置参数。

public class HttpGetTest { public static void main(String[] args) throws Exception { //创建HttpClient对象 CloseableHttpClient httpClient = HttpClients.createDefault(); //设置请求地址是：http://yun.itheima.com/search?keys=Java //创建URIBuilder URIBuilder uriBuilder=new URIBuilder("http://yun.itheima.com/search"); //设置参数 uriBuilder.setParameter("keys","Java"); //创建HttpGet对象，设置url访问地址 HttpGet httpGet=new HttpGet(uriBuilder.build()); System.err.println("发送的请求是"+httpGet); //使用httpClient发起请求，获取response CloseableHttpResponse response = null; try{ response=httpClient.execute(httpGet); //解析响应 if(response.getStatusLine().getStatusCode()==200){ //得到响应体，并把结果通过EntityUtils工具类把结果转换为字符串 String content= EntityUtils.toString(response.getEntity(),"utf8"); System.out.println(content.length()); } }catch (Exception e){ e.printStackTrace(); }finally { //关闭response response.close(); httpClient.close(); } } }3.3 不带参数的 Post请求

不带参数的post请求和get请求的区别只有一个，就是请求的声明。

//get请求 HttpGet httpGet=new HttpGet("url路径地址"); //post请求 HttpPost httpPost=new HttpPost("url路径地址");3.4 带参数的Post请求

带参的话，使用post请求，url地址没有参数，参数keys=Java放在表单中进行提交。

public static void main(String[] args) throws Exception { //创建HttpClient对象 CloseableHttpClient httpClient = HttpClients.createDefault(); //设置请求地址是：http://yun.itheima.com/search?keys=Java //创建HttpPost对象，设置url访问地址 HttpPost httpPost=new HttpPost("http://yun.itheima.com/search"); //声明list集合，封装表单中的参数 List params=new ArrayList(); //设置参数 params.add(new BasicNameValuePair("keys","Java")); //创建表单的Entity对象，第一个参数就是封装好的表单数据，第二个参数就是编码 UrlEncodedFormEntity formEntity=new UrlEncodedFormEntity(params,"utf8"); //设置表单的Entity对象到post请求中 httpPost.setEntity(formEntity); //使用httpClient发起请求，获取response CloseableHttpResponse response = null; try{ response=httpClient.execute(httpPost); //解析响应 if(response.getStatusLine().getStatusCode()==200){ //得到响应体，并把结果通过EntityUtils工具类把结果转换为字符串 String content= EntityUtils.toString(response.getEntity(),"utf8"); System.out.println(content.length()); } }catch (Exception e){ e.printStackTrace(); }finally { //关闭response response.close(); httpClient.close(); } }3.5 连接池

如果每次请求都要创建HttpClient，会有频繁创建和销毁的问题，可以使用连接池来解决这个问题。

public class HttpClientPoolTest { public static void main(String[] args) throws Exception { //创建连接池管理器 PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager(); //设置最大连接数 cm.setMaxTotal(100); //设置每个主机的最大连接数 cm.setDefaultMaxPerRoute(10); //使用连接池管理器发送请求 doGet(cm); } private static void doGet(PoolingHttpClientConnectionManager cm) throws Exception { //不是每次都创建新的HttpClient，而是从连接池中获取HttpClient对象 CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build(); HttpGet httpGet=new HttpGet("http://www.itcast.cn"); CloseableHttpResponse response=null; try{ response=httpClient.execute(httpGet); if(response.getStatusLine().getStatusCode()==200){ String content = EntityUtils.toString(response.getEntity(), "utf8"); System.out.println(content.length()); } }catch (Exception e){ throw new Exception("发生异常"); }finally { if(response!=null){ response.close(); } //不能关闭HttpClient,由连接池管理HttpClient //httpClient.close(); } } }4,请求参数（配置请求信息RequestConfig）

有时候因为网络，或者目标服务器的原因，请求需要更长的时间才能完成，我么需要自定义相关时间。

public class HttpConfigTest { public static void main(String[] args) { //创建HttpClient对象 CloseableHttpClient httpClient = HttpClients.createDefault(); //创建httpGet对象，设置url访问地址 HttpGet httpGet=new HttpGet("http://www.itcast.cn"); //配置请求信息 RequestConfig config=RequestConfig.custom().setConnectTimeout(1000) //创建连接的最长时间,单位是毫秒 .setConnectionRequestTimeout(500) //设置获取连接的最长时间，单位是毫秒 .setSocketTimeout(10*1000) //设置数据传输的最长时间，单位是毫秒 .build(); //给请求设置请求信息 httpGet.setConfig(config); } }二，Jsoup

我们抓取到页面之后，还需要对页面进行解析，可以使用字符串处理工具解析页面，也可以使用正则表达式，但是这些方法都会带来很大的开发成本，所以我们需要使用一款专门解析html页面的技术。

2.1 Jsoup介绍

jsoup是一款java的html解析器，可直接解析某个url地址，html文本等内容，它提供了一套非常省力的api，可通过dom，css以及类似于jquery的操作方法来取出和操作数据。

Jsonp的主要功能如下：

1，从一个url，文件或字符串中解析html；

2，使用dom或css选择器来查找、取出数据。

2.2 使用Jsoup需要导入的依赖 org.jsoup jsoup 1.10.2 junit junit 4.12 test commons-io commons-io 2.4 org.apache.commons commons-lang3 3.9 2.3 Jsoup解析URL

这里写一个小例子，抓取黑马官网主页的title的内容。

@Test public void testUrl() throws Exception{ //解析url地址，第一个参数是访问的url地址，第二个参数是访问时候的超时时间。 //返回类型是一个dom对象，可以理解为抓取到的html页面。 Document doc = Jsoup.parse(new URL("http://www.itcast.cn"), 1000); //使用标签选择器，获取title标签中的内容 String title = doc.getElementsByTag("title").first().text();//第一个的文本内容 System.out.println(title); }

[外链图片转存失败(img-szceJTJD-1567139877894)(D:\文件笔记\image\1566883592525.png)]

说明：

2.4 Jsoup解析字符串@Test public void testString() throws Exception{ //使用工具类读取文件，获取字符串 String content=FileUtils.readFileToString(new File("D:\\IdeaProjects\\党建项目 \\client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"),"utf8"); //解析字符串 Document doc = Jsoup.parse(content); //获取title内容 String title = doc.getElementsByTag("title").first().text(); System.out.println(title); }

[外链图片转存失败(img-6UmUVM3Q-1567139877896)(D:\文件笔记\image\1566885700972.png)]

2.5 Jsoup解析文件@Test public void testFile() throws Exception{ //解析文件 Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\党建项目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8"); String title=doc.getElementsByTag("title").first().text(); System.out.println(title); }

[外链图片转存失败(img-fhj4dlUk-1567139877896)(D:\文件笔记\image\1566885760260.png)]

2.6 使用dom的方式获取元素@Test public void testDom() throws Exception{ //解析文件，获取document对象 Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\党建项目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8"); //获取元素 //根据id获取 /*Element a = doc.getElementById("a"); System.out.println(a.text());*/ //根据标签获取 Element element = doc.getElementsByTag("td").last(); System.out.println(element); //根据class类获取 Element test = doc.getElementsByClass("test").first(); //根据属性获取 Elements abc = doc.getElementsByAttribute("abc"); //通过指定的属性名和属性值指定获取 Elements href = doc.getElementsByAttributeValue("href", "www.baidu.com"); }2.7 获取元素中的数据

上一步已经获取到了元素，怎么获取到元素中的诸多数据呢？

1，从元素中获取id

2，从元素中获取className

3，从元素中获取属性的值attr

4，从元素中获取所有属性attributes

5，从元素中获取文本内容text

@Test public void testData() throws Exception{ //解析文件，获取document对象 Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\党建项目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8"); Element element = doc.getElementsByTag("td").last(); //获取元素的id值 String id = element.id(); //获取元素的class类的值（className） String className = element.className(); System.out.println(className); //如果className的值是有多个class组成，这里获取每一个className，把它们拆分开 Set strings = element.classNames(); for(String s:strings){ System.out.println(s); } //从元素中获取class属性的值attr String aClass = element.attr("class"); //从元素中获取文本内容text String text = element.text(); }2.8 使用组合选择器获取元素@Test public void testSelectors() throws Exception{ //解析文件，获取document对象 Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\党建项目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8"); //元素 + ID Element element = doc.select("p#lese").first(); //元素 + class Element ele = doc.select("p.lese").first(); //元素 + 属性名 Elements select = doc.select("p[abc]"); //任意组合（元素+class+id+属性名的任意组合） Element first = doc.select("p[abc].lese").first(); //查找某个元素下的子元素比如 .city li Element first1 = doc.select(".city li").first(); //查找某个元素下的直接子元素比如 .city>li Element first2 = doc.select(".city>ul>li").first(); //parent > * 查找某个父元素下的所有直接子元素 Element first3 = doc.select(".city>ul>*").first(); System.out.println(first); }三，案例–抓取京东的商品信息

这里只抓取京东的一部分数据就行了，商品的图片，价格，颜色等信息。

3.1 先在数据库建表

[外链图片转存失败(img-d6avaMTn-1567139877896)(D:\文件笔记\image\1566984565304.png)]

3.2 添加依赖

使用springboot+spring Data JPA和定时任务完成开发。

需要创建maven工程并添加以下依赖。

org.springframework.boot spring-boot-starter-web 2.1.3.RELEASE mysql mysql-connector-java 8.0.13 org.apache.httpcomponents httpclient 4.5.2 org.jsoup jsoup 1.10.2 junit junit 4.12 test commons-io commons-io 2.4 org.apache.commons commons-lang3 3.9 org.springframework.boot spring-boot-starter-data-jpa 2.1.4.RELEASE 3.3 添加配置文件

加入application.properties配置文件

# DB 配置 spring.datasource.driver-class-name=com.mysql.jdbc.Driver spring.datasource.url=jdbc:mysql://127.0.0.1:3306/jsoup spring.datasource.username=root spring.datasource.password=1234 # JPA 配置 spring.jpa.database=mysql spring.jpa.show-sql=true3.4 代码实现

先写pojo类

@Entity @Table(name = "jd_item") public class item { private Long id; private Long spu; private Long sku; private String title; private double price; private String pic; private String url; private Date created; private Date updated; }3.5 封装HttpClient

我们经常要使用HttpClient，所以需要进行封装，方便使用。

package com.qianlong.jd.util; import org.apache.http.client.config.RequestConfig; import org.apache.http.clienthods.CloseableHttpResponse; import org.apache.http.clienthods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.util.EntityUtils; import org.springframework.stereotype.Component; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.util.UUID; @Component //创建实例 public class HttpUtils { //使用连接池 private PoolingHttpClientConnectionManager cm; //需要声明构造方法，因为参数不需要从外面传进来，所以不需要参数 //为什么需要构造方法，是因为声明的这个连接池需要赋于属性的值 public HttpUtils() { this.cm = new PoolingHttpClientConnectionManager(); //设置最大连接数 this.cm.setMaxTotal(100); //设置每个主机的最大连接数 this.cm.setDefaultMaxPerRoute(10); } //这里使用get请求获取页面数据，返回类型是string字符串类型 /** * 根据请求地址下载页面数据 * @param url * @return */ public String doGetHTML(String url){ //获取HttpClient对象 CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build(); //创建httpGet对象，设置url地址 HttpGet httpGet=new HttpGet(url); //设置请求信息 httpGet.setConfig(this.getConfig()); CloseableHttpResponse response=null; try { //使用httpClient发起请求，获取响应 response=httpClient.execute(httpGet); //解析响应，返回结果 if(response.getStatusLine().getStatusCode()==200){ //判断响应体Entity是否为空，如果不为空就可以使用HttpUtils if(response.getEntity()!=null){ String content = EntityUtils.toString(response.getEntity(), "utf8"); } } } catch (IOException e) { e.printStackTrace(); }finally { //关闭response if(response!=null){ try { response.close(); } catch (IOException e) { e.printStackTrace(); } } } return ""; } //设置请求的信息 private RequestConfig getConfig() { RequestConfig config=RequestConfig.custom() .setConnectTimeout(1000)//创建连接的最长时间 .setConnectionRequestTimeout(500)//获取连接的最长时间 .setSocketTimeout(500)//数据传输的最长时间 .build(); return config; } /** * 下载图片 * @param url * @return */ public String doGetImage(String url){ //获取HttpClient对象 CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build(); //创建httpGet对象，设置url地址 HttpGet httpGet=new HttpGet(url); //设置请求信息 httpGet.setConfig(this.getConfig()); CloseableHttpResponse response=null; try { //使用httpClient发起请求，获取响应 response=httpClient.execute(httpGet); //解析响应，返回结果 if(response.getStatusLine().getStatusCode()==200){ //判断响应体Entity是否为空，如果不为空就可以使用HttpUtils if(response.getEntity()!=null){ //下载图片 //获取图片的后缀 String extName=url.substring(url.lastIndexOf(".")); //创建图片名，重命名图片 String picName= UUID.randomUUID().toString()+extName; //下载图片 //声明OutputStream OutputStream outputStream=new FileOutputStream(new File("D:\\suibian\\image")+picName); response.getEntity().writeTo(outputStream); //图片下载完成，返回图片名称 return picName; } } } catch (IOException e) { e.printStackTrace(); }finally { //关闭response if(response!=null){ try { response.close(); } catch (IOException e) { e.printStackTrace(); } } } return ""; } }3.6 实现数据抓取

使用定时任务，可以定时抓取最新的数据。

先写好springboot的启动类（这里就不仔细说明启动类文件的位置了，和包同级）

//使用定时任务，需要先开启定时任务，需要添加注解 @EnableScheduling @SpringBootApplication public class Application { public static void main(String[] args) { SpringApplication.run(Application.class,args); } }

然后就开始写主角了，开始抓取数据

package com.qianlong.jd.task; import com.qianlong.jd.pojo.Item; import com.qianlong.jd.service.ItemService; import com.qianlong.jd.service.ItemServiceImpl; import com.qianlong.jd.util.HttpUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.scheduling.annotation.Scheduled; import org.springframework.stereotype.Component; import java.util.List; @Component public class ItemTask { @Autowired private HttpUtils httpUtils; @Autowired private ItemService itemService; //当下载任务完成后，间隔100秒进行下一次的任务 @Scheduled(fixedDelay = 100*1000) public void itemTask() throws Exception{ //声明需要解析的初始地址 String url="https://search.jd.com/Search?keyword=iphone&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=iphone&page=1&s=1&click="; //这里下载前五页（1，3，5，7---顺序） //按照页面对手机的搜索结果进行遍历解析 for(int i=1;i0){ continue; } //设置商品的spu item.setSpu(spu); //获取商品的详情的url String itemUrl="https://item.jd.com/"+sku+".html"; item.setUrl(itemUrl); //获取商品的图片 String picUrl = "https:"+skuEle.select("img[data-sku]").first().attr("data-lazy-img"); String picName=httpUtils.doGetImage(picUrl); item.setPic(picName); //保存数据到数据库中 itemService.save(item); } } } }

来到这里案例基本已经结束了，接下来就是处理dao的数据了，插入数据到数据库，这里省略。

到这里爬虫已经结束了，上面是Java爬虫的基础，可以实现一些小的demo，比如爬取一个网站的部分数据，但是在实际的爬虫项目中使用的都是爬虫框架，例如WebMagic框架，底层使用的就是HttpClient和Jsoup，更方便的开发爬虫，同时内置了一些常用的组件，便于爬虫开发。如果你想更深的学习爬虫的话，你必须深入学习那些更优秀的框架才行，以上是实现爬虫的基础内容。

如果想看源码的话就自己下载，如果觉得还不错的话就留下你的足迹吧！

项目链接：https://pan.baidu.com/s/1ArXk_QlmtbhzW_wfMrerFw 提取码：sqms

【本文地址】

java爬虫代码 json java实现爬虫抓取数据

java爬虫代码 json java实现爬虫抓取数据

今日新闻

推荐新闻