爬虫案例

2023-07-15 08:04| 来源: 网络整理| 查看: 265

一、数据获取

使用PyCharm(引用requests库、lxml库、json库、time库、openpyxl库和pymysql库)爬取京东网页相关数据（品牌、标题、价格、店铺等）

数据展示（片段）：

京东网页有反爬措施，需要自己在网页登录后，获取cookie,加到请求的header中（必要时引入time库，设置爬取睡眠时间降低封号概率）

爬取代码（片段）： ###获取每一页的商品数据 def getlist(url,brand): global count #定义一个全局变量，主要是为了确定写入第几行 # url="https://search.jd.com/search?keyword=笔记本&wq=笔记本&ev=exbrand_联想%5E&page=9&s=241&click=1" res = requests.get(url,headers=headers) res.encoding = 'utf-8' # text = (res.text).replace("") text = res.text selector = etree.HTML(text) list = selector.xpath('//*[@id="J_goodsList"]/ul/li')#获取数据所在 for i in list: title = i.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()')[0].strip()#商品名称 price = i.xpath('.//div[@class="p-price"]/strong/i/text()')[0]#商品价格 shop = i.xpath('.//div[@class="p-shop"]/span/a/text()')[0] #获取店铺名称 #获取评论数的id值 # product_id = i.xpath('.//[@class="p-commit"]/strong/a/@id')[0].replace("J_comment_","") # comment_count = commentcount(product_id) # print("目前条数="+str(count)) 爬取完后直接存入数据库： # 实现将数据写入到数据库中吗，提前将库和表创建好,创建表结构如下: """CREATE TABLE jd.shuju( id INT PRIMARY KEY AUTO_INCREMENT, brand VARCHAR(100) CHARACTER SET utf8, title VARCHAR(100) CHARACTER SET utf8, price VARCHAR(100) CHARACTER SET utf8, shop VARCHAR(100) CHARACTER SET utf8, comment_count VARCHAR(100) CHARACTER SET utf8); """ conn = pymysql.connect( host='127.0.0.1', user='root', passwd='', port=3306, db='jd', charset='utf8', use_unicode=True ) # print("连接成功") cursor = conn.cursor() # 执行完毕返回的结果集默认以元组显示 # 向sql中插入数据 try: sql = f"insert ignore into shuju111(brand,title,price,shop) values('{brand}','{title}','{price}','{shop}')" cursor.execute(sql) # 执行SQL语句 # print("插入完一条语句") cursor.close() # 关闭光标对象 conn.commit() # 提交 conn.close() # 关闭数据库 except: print("跳过1次插入") continue #向表中插入数据 outws.cell(row=count, column=1, value=str(count - 1)) # 从第一行开始 outws.cell(row=count, column=2, value=str(brand)) outws.cell(row=count, column=3, value=str(title)) outws.cell(row=count, column=4, value=str(price)) outws.cell(row=count, column=5, value=str(shop)) # outws.cell(row=count, column=6, value=str(CommentCount)) count = count + 1 # 自动跳入下一行二、数据处理

对所爬取数据中无关数据进行清洗（符号的替换）

代码部分： # 调用函数模拟请求获取评论数 def commentcount(product_id): url = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds="+str(product_id)+"&callback=jQuery8827474&_=1615298058081" res = requests.get(url, headers=headers) res.encoding = 'utf-8' #字符转换 text = (res.text).replace("jQuery8827474(","").replace(");","") #替换掉前面出现的jQuery5597865 text = json.loads(text) #将字符串转换为json格式 comment_count = text['CommentsCount'][0]['CommentCountStr'] comment_count = comment_count.replace("+", "") #对万进行操作,数据清洗 if "万" in comment_count: comment_count = comment_count.replace("万","") comment_count = str(int(comment_count)*10000) return comment_count 三、数据可视化

调用数据库数据利用matplotlib.pyplot进行图像绘制（主要根据品牌、店铺、平均价格进行相关数据可视化）

品牌—数量可视化效果：

代码部分： #画品牌和数量的图表 plt.title('品牌-数量') plt.xlabel('品牌') plt.ylabel('品牌数量') x = ['联想（lenovo）', 'Apple', '宏碁（acer）', '华为（HUAWEI）', 'ThinkPad', '戴尔（DELL）', '小米（MI）'] y = [count_pp[item] for item in x] plt.bar(x, y) plt.show() 店铺—数量可视化效果：

代码部分： #店铺及数量表 count_dp = {'联想京东自营旗舰店':0,'联想京东自营官方旗舰店':0,'联想商用丽邦专卖店':0, '联想商用融合汇通专卖店':0,'联想扬天京东自营授权旗舰店':0} sql1 = "select * from shuju where brand = '联想（lenovo）' and id

【本文地址】

爬虫案例

爬虫案例

今日新闻

推荐新闻