【python爬虫】闲鱼爬虫，可以爬取商品

您所在的位置：网站首页 › 闲鱼的视频怎么下载 › 【python爬虫】闲鱼爬虫，可以爬取商品

【python爬虫】闲鱼爬虫，可以爬取商品

2024-07-10 06:01| 来源: 网络整理| 查看: 265

前言

一、介绍

二、爬虫流程

1. 确定关键词并构造URL

2. 发送网络请求

3. 解析HTML并提取数据

4. 保存数据

三、使用代理IP

四、完整代码

五、总结

前言

闲鱼是一个很受欢迎的二手交易平台，但是由于没有开放API，我们需要使用爬虫来获取数据。本文将介绍如何使用Python爬虫爬取闲鱼上的商品信息，包括构造URL、发送网络请求、解析HTML并提取数据以及使用代理IP来进行爬取。如果您需要抓取闲鱼的其他数据，本文也提供了一些参考。

一、介绍

随着电子商务的兴起，二手交易平台也变得越来越受欢迎。作为淘宝旗下的二手交易平台，闲鱼的日活跃用户已经超过了1亿。因此，对于一些商家和买家来说，闲鱼是一个极具吸引力的平台。

对于我们开发者来说，有时候我们需要从闲鱼上抓取一些数据，比如价格走势，热门商品，关键词排名等等。但是，闲鱼并没有开放API，这就需要我们使用爬虫来获取数据。

本文将详细介绍如何使用Python爬虫爬取闲鱼上的商品信息。我们将主要使用requests库和BeautifulSoup库来完成这个任务。此外，为了避免被闲鱼封禁IP地址，我们还将使用代理IP来进行爬取。

二、爬虫流程

要完成我们的闲鱼爬虫，我们需要掌握以下几个步骤：

1. 确定关键词并构造URL

在爬取闲鱼数据之前，首先我们需要确定要搜索的关键词。这个关键词可以是任何你想要的内容，比如“二手手机”、“二手电脑”等等。

根据我们选择的关键词，我们需要构造一个URL，即闲鱼商品搜索的URL。URL的构造方法如下：

url = "https://2.taobao.com/search/index.htm?q={}&search_type=item&app=shopsearch".format(keyword)

其中，keyword为我们选择的关键词。

2. 发送网络请求

我们使用requests库来发送网络请求：

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'} response = requests.get(url, headers=headers)

在发送请求之前，我们需要设置请求头。这个请求头包含了我们浏览器的信息，这样可以避免被服务器轻易识别为爬虫。

3. 解析HTML并提取数据

我们使用BeautifulSoup库来解析HTML并提取数据：

soup = BeautifulSoup(response.text, 'html.parser') goods_list = soup.find_all('div', {'class': 'J_MouserOnverReq'})

解析完HTML之后，我们需要找出包含商品信息的标签。我们可以通过查看闲鱼网页的源代码，找到所有商品信息都被包含在一个class为“J_MouserOnverReq”的div中。

4. 保存数据

最后一步，我们需要将爬取到的数据保存下来。这里我们使用csv库将数据保存到csv文件中。

with open('goods_info.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['商品名称', '商品价格', '商品链接']) for goods in goods_list: title = goods.find('p', {'class': 'item-title'}).text.strip() price = goods.find('p', {'class': 'price'}).text.strip() link = goods.find('a', {'class': 'item-link'}).get('href') writer.writerow([title, price, link])

通过使用以上四个步骤，我们可以完成闲鱼商品信息的爬虫。

三、使用代理IP

由于频繁的发送网络请求会使服务器怀疑我们是爬虫，并且封禁我们的IP地址，所以我们需要使用代理IP来隐藏我们的真实IP地址。

我们可以从代理IP网站上获取代理IP。这里我们使用站大爷的API，可以通过以下的代码来获取代理IP：

def get_proxies(): response = requests.get("http://ip.zdaye.com/dayProxy.html") soup = BeautifulSoup(response.text, 'html.parser') trs = soup.find_all('tr') proxies = [] for tr in trs[1:]: tds = tr.find_all('td') ip = tds[0].text.strip() port = tds[1].text.strip() protocol = tds[3].text.strip().lower() proxies.append("{}://{}:{}".format(protocol, ip, port)) return proxies

该函数会返回一个代理IP池。

我们可以在发送网络请求的时候使用代理IP，代码如下：

proxies = { "http": random.choice(get_proxies()), "https": random.choice(get_proxies()) } response = requests.get(url, headers=headers, proxies=proxies)

在构造requests对象的时候，我们传入proxies参数，代表我们使用一个代理IP来发送网络请求。

四、完整代码 import csv import random import requests from bs4 import BeautifulSoup def get_proxies(): """ 获取代理IP """ response = requests.get("http://ip.zdaye.com/dayProxy.html") soup = BeautifulSoup(response.text, 'html.parser') trs = soup.find_all('tr') proxies = [] for tr in trs[1:]: tds = tr.find_all('td') ip = tds[0].text.strip() port = tds[1].text.strip() protocol = tds[3].text.strip().lower() proxies.append("{}://{}:{}".format(protocol, ip, port)) return proxies def get_goods_info(keyword): """ 爬取商品信息 """ url = "https://2.taobao.com/search/index.htm?q={}&search_type=item&app=shopsearch".format(keyword) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/58.0.3029.96 Safari/537.36'} proxies = { "http": random.choice(get_proxies()), "https": random.choice(get_proxies()) } response = requests.get(url, headers=headers, proxies=proxies) soup = BeautifulSoup(response.text, 'html.parser') goods_list = soup.find_all('div', {'class': 'J_MouserOnverReq'}) with open('goods_info.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['商品名称', '商品价格', '商品链接']) for goods in goods_list: title = goods.find('p', {'class': 'item-title'}).text.strip() price = goods.find('p', {'class': 'price'}).text.strip() link = goods.find('a', {'class': 'item-link'}).get('href') writer.writerow([title, price, link]) if __name__ == '__main__': get_goods_info('二手手机') 五、总结

本文介绍了如何使用Python爬虫爬取闲鱼上的商品信息，并且使用代理IP防止被封禁IP地址。如果您还需要爬取其他数据，比如评论、店铺信息等等，您可以根据本文提到的方法进行尝试。

【本文地址】

【python爬虫】闲鱼爬虫，可以爬取商品

【python爬虫】闲鱼爬虫，可以爬取商品

今日新闻

推荐新闻