requests库的使用之京东商品页面的爬取

您所在的位置:网站首页 京东爬虫跳过登录页 requests库的使用之京东商品页面的爬取

requests库的使用之京东商品页面的爬取

2023-10-21 20:54| 来源: 网络整理| 查看: 265

按如下代码所示爬取京东商品页面信息: 在这里插入图片描述 上面代码:

import requests url="https://item.jd.com/10021098268175.html?extension_id=eyJhZCI6IiIsImNoIjoiIiwic2hvcCI6IiIsInNrdSI6IiIsInRzIjoiIiwidW5pcWlkIjoie1wiY2xpY2tfaWRcIjpcImQ2MTQ0MzcwLWM5MzktNDYwMC04NDdkLTQ1OTJmOTg3NmRjZlwiLFwicG9zX2lkXCI6XCIxNTBcIixcInNpZFwiOlwiNzk1YjFkZDAtMGE4MC00MWQ3LWE3MjgtZTdhNDczNGY4ZGI0XCIsXCJza3VfaWRcIjpcIjEwMDIxMDk4MjY4MTc1XCJ9In0=&jd_pop=d6144370-c939-4600-847d-4592f9876dcf&abt=3" try: r=requests.get(url) r.raise_for_status() r.encoding=r.apparent_encoding print(r.text[:1000]) except: print("Failed")

但实际上不会这么顺利,会报一串代码:

在这里插入图片描述 提示需要登陆。 使用response.request.headers查看报头,发现返回了:

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

说明程序诚实的告诉了京东这个请求是由一个爬虫发起的,因为headers报头没有设置,京东识别到这不是有浏览器发出的请求而是爬虫发出的,所以拒绝了请求。 解决办法参考:https://www.v2ex.com/t/540449 以下是该文章中核心内容:

不会编 headers 可以先 f12,找到第一个请求右键复制成 cURL,然后去 curl.trillworks.com 转换成 python 或者别的语言。 京东搜索页我记得挺松的,只需要 headers 就可以了,cookie 可以不用,下面的 params 只保留关键词就行。

以下是搜 “1000x” 复制来的例子,cookie 我已经删了

import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.5,zh;q=0.3', 'Referer': 'https://www.jd.com/', 'DNT': '1', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'TE': 'Trailers', } params = ( ('keyword', '1000x'), ('enc', 'utf-8'), ('wq', '1000x'), ('pvid', '70b2126fcf3246ce9f32710d41799ede'), ) response = requests.get('https://search.jd.com/Search', headers=headers, params=params)

以爬取小米手机为例:按完f12之后,点击刷新页面按钮, 在这里插入图片描述 右键蓝色的URL地址,save as—cURL(cmd),在curl.trillworks.com一转换,就得到了python代码:

import requests headers = { 'authority': 'item.jd.com', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'referer': 'https://item.jd.com/100006583459.html?extension_id=eyJhZCI6IiIsImNoIjoiIiwic2hvcCI6IiIsInNrdSI6IiIsInRzIjoiIiwidW5pcWlkIjoie1wiY2xpY2tfaWRcIjpcIjlmODJmNmZiLWI4MDgtNGFmYy1iOTZmLTNlYzBlNzk4MTM5ZVwiLFwicG9zX2lkXCI6XCIxNTBcIixcInNpZFwiOlwiOWZmOGYyNmMtZTRhNC00OWIyLWEwNDMtMDQ3ZjcwZDRjYWRlXCIsXCJza3VfaWRcIjpcIjEwMDAxMTgzMzU0MlwifSJ9^&jd_pop=9f82f6fb-b808-4afc-b96f-3ec0e798139e^&abt=3', 'accept-language': 'zh-CN,zh;q=0.9', 'cookie': 'shshshfpa=1699fdb8-c825-dbe8-175f-b6070a2ef456-1589040291; shshshfpb=f7tRmJKoKH9YNyftnNTVWSQ^%^3D^%^3D; pinId=KuVVRsT25E7U0bvxcy7HBg; ceshi3.com=103; __jdu=1589040288896797007706; user-key=12ce421c-faa7-4887-949c-9473927d4f85; cn=1; PCSYCityID=CN_440000_440300_440305; pin=^%^E6^%^B1^%^87^%^E9^%^A3^%^9E^%^E7^%^9A^%^84^%^E5^%^BF^%^83; unick=^%^E6^%^B1^%^87^%^E9^%^A3^%^9E^%^E7^%^9A^%^84^%^E5^%^BF^%^83; _tp=tk6dcGT^%^2Fmm9jMbSV9dw5ecCQA3^%^2FCtpAq9iyr8bENLiRqS94pda8u4u2j3yy^%^2BsCa9; _pst=^%^E6^%^B1^%^87^%^E9^%^A3^%^9E^%^E7^%^9A^%^84^%^E5^%^BF^%^83; TrackID=1o28ACN7o_LveAsydgMr6E3_wCej7_VXM-H0bK-VgnYsQTUtpW2EtOn7rYXYzCxD-SggiTLfWDSbPTqliaRnq-FQTRB5WEn0pJuK1l6h126U; areaId=19; ipLoc-djd=19-1607-4773-0; shshshfp=ad674eab8ec195f957ddba196d6bb5c6; __jda=122270672.1589040288896797007706.1589040289.1597658070.1597843575.35; __jdc=122270672; CCC_SE=ADC_0lOVRU14V^%^2foeoNXS0dQ^%^2b8lAn6OCnhUlxlJMjgSIFpXZt0u1UCg^%^2fk7X9x11AndBf6tuiUmqYa5Z^%^2bYEtt3sdnMvkeQoc^%^2bjp^%^2bZq^%^2bFcoeQOejOokc^%^2fFgjRWCdj4R4axuBpOpUrPYOm^%^2bqTJFm2gh2KjpWdr5b5Sw1HVEEEqBPKVqwd6UkACgUmezTiZ4JaRYDSuk9dTJzCrwR1sV0Ia^%^2fwqd70ZnjupwudCoTchD8753MQ3ziChiZeNE^%^2bwBffMC0KY2RM7czc5Fm4ANuCG9XYBIiVuwv4gAXG^%^2fy^%^2b1Jo8DFcKihsK9Evz^%^2fo0qmjY33vkts7YtWhmsEggx9hKESHMPgWPd^%^2bUBxvVyXGrf8xq76wqL4hvJ8AN30xjCUpeyW89VwdzaVKrin9UkLSx5dFx7Lgxaid8jkT0iSDCxpqRme9J51JEckBAhuXKNEMkB2ODC5koPAZjZSeOxABXzswrimoSkVgjHbF12b5VJkO0eqlKVIyr2KPDEwdEQeJ8T6QEwCt^%^2b5E9wcEuE9iXTBLg1Y9tHHfAV6bgV2n3d^%^2bo9HQXko2Z7JkS^%^2bU8MjEMe^%^2b0Pt9TMeuxy7kINS032Go5NpklaeP1aUdyt^%^2f1xGqMzEzdq^%^2f8PjdL5^%^2baBwXI5OIt51sEBJUeIjaggTGL6HV08oCSuxt8cWI^%^2fqprHW^%^2f^%^2biUfz^%^2fmI1hrbJ^%^2fQP1QKgHQcvYxQE3mghProY3SVqnJdn9S09Sw4wTOoQSJ^%^2bGTO4XjcQwm3J9e688Sf8jFjg1IAlKgm9ghxrjum1C8HkVa; unpl=V2_ZzNtbUFSSxxxXBFRfB1VDWIDElVKVUpAfFxFAHIcXlU1BBVbclRCFnQUR11nGVQUZAMZX0pcQhBFCEdkexhdBGYGEF9KU3MJdThPXHsQWgJgACJeQmdCJXUMT1x6HF0GbgobVUVRSxN0CUBSeh5sNWcLFm1CVkIUdQhPVnkdWQZXc1sZAQM^%^2fRTdNEhdLHFwCZwsbX0VfcxRFCXYCFRldBGYCE1hCVEFYdQxPXHocXQZuChtVRVFLE3QJQFJ6HmwEVwA^%^3d; __jdv=122270672^|www.fanqianbb.com^|t_1000043395_-1^|tuiguang^|24985ef465894119938d8e2e843ac667^|1597844345722; mt_xid=V2_52007VwMWW1VZV1scQBBVDWAFGlRcWVRcHk4pVFJuAEVVCgtOCRZIEEAAN1RATg5QVQ0DS00OBDMFGltdWloOL0oYXwR7AhtOXVBDWR9CG1QOZgUiUG1bYl4cTxlZAlcDFlM^%^3D; 3AB9D23F7A4B3C9B=TQNGWUBNJTSKZXWFOUIZAJZD7F6J77476K4B6LSKGNVWB5GFENS7LKQKNUJCX3WZALQR66KKDBTH3D7QWVLT2RUBDE; __jdb=122270672.11.1589040288896797007706^|35.1597843575; shshshsID=dd39a5df93860d64365f7d08e2a729b9_9_1597844655096', 'if-modified-since': 'Wed, 19 Aug 2020 13:44:10 GMT', } params = ( ('extension_id', 'eyJhZCI6IiIsImNoIjoiIiwic2hvcCI6IiIsInNrdSI6IiIsInRzIjoiIiwidW5pcWlkIjoie1wiY2xpY2tfaWRcIjpcIjlmODJmNmZiLWI4MDgtNGFmYy1iOTZmLTNlYzBlNzk4MTM5ZVwiLFwicG9zX2lkXCI6XCIxNTBcIixcInNpZFwiOlwiOWZmOGYyNmMtZTRhNC00OWIyLWEwNDMtMDQ3ZjcwZDRjYWRlXCIsXCJza3VfaWRcIjpcIjEwMDAxMTgzMzU0MlwifSJ9^'), ('jd_pop', '9f82f6fb-b808-4afc-b96f-3ec0e798139e^'), ('abt', '3'), ) response = requests.get('https://item.jd.com/100006583455.html', headers=headers, params=params) #NB. Original query string below. It seems impossible to parse and #reproduce query strings 100% accurately so the one below is given #in case the reproduced version is not "correct". # response = requests.get('https://item.jd.com/100006583455.html?extension_id=eyJhZCI6IiIsImNoIjoiIiwic2hvcCI6IiIsInNrdSI6IiIsInRzIjoiIiwidW5pcWlkIjoie1wiY2xpY2tfaWRcIjpcIjlmODJmNmZiLWI4MDgtNGFmYy1iOTZmLTNlYzBlNzk4MTM5ZVwiLFwicG9zX2lkXCI6XCIxNTBcIixcInNpZFwiOlwiOWZmOGYyNmMtZTRhNC00OWIyLWEwNDMtMDQ3ZjcwZDRjYWRlXCIsXCJza3VfaWRcIjpcIjEwMDAxMTgzMzU0MlwifSJ9^&jd_pop=9f82f6fb-b808-4afc-b96f-3ec0e798139e^&abt=3', headers=headers)

这里的response就是r.text中的r,所以想打印出结果就可以用print(response.text)语句,由于结果太多,可以只打印出前1000行,可以用print(response.text[:1000]),结果正常。



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3