福利彩票

2023-10-01 09:01| 来源: 网络整理| 查看: 265

Ⅰ.爬取数据的初衷：心血来潮，纯属娱乐,实战的过程其实也是探索的过程，难免冗余，多多包涵！

备注：

双色球是中国福利彩票的一种玩法。中国福利彩票“双色球”是一种由中国福利彩票发行管理中心统一组织发行，在全国销售联合发行的“乐透型”福利彩票。2003年2月16日起在全国联网销售。采用计算机网络系统发行销售，定期电视开奖。参与“双色球”销售的省级福利彩票发行中心在中福彩中心的直接领导下，负责对本地区的“双色球”销售活动实施具体的组织和管理。“双色球”彩票实行自愿购买，凡购买者均被视为同意并遵守本规则----来源：百度文库

，也是是说数据只从2003年开始喽！

Ⅱ.工具：语言: python | sql工具包： urllib | pymysql | re | zlib 数据库：mysql Ⅲ.网站页面分析百度到这个网站数据http://kaijiang.500.com/ssq.shtml页面分析，需提取的数据球号及期数项目2

发现网页的规律：http://kaijiang.500.com/shtml/ssq/（期数）.shtml如何获取所有的期数呢？F12发现有个动态加载的div块在这里插入图片描述

这就好办了，直接复制这一整个div块，放到本地存为文件caipiao.html

总：网页规律和需要提取的数据也有了，接下来就是爬取和分析数据了

Ⅳ.数据库建表及技巧-mysql CREATE TABLE `tb_dou_color` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键', `red_one` int(2) DEFAULT NULL, `red_two` int(2) DEFAULT NULL, `red_three` int(2) DEFAULT NULL, `red_four` int(2) DEFAULT NULL, `red_five` int(2) DEFAULT NULL, `red_six` int(2) DEFAULT NULL, `blue` int(2) DEFAULT NULL, `period` varchar(100) DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `peroid_uniq_index` (`period`) USING HASH COMMENT '期数唯一索引' ) ENGINE=InnoDB AUTO_INCREMENT=4613 DEFAULT CHARSET=utf8mb4 COMMENT='双色球数据'; 解释一下： 1.主键自增 2.六个红色球号，一个蓝色球号，数据类型int，这里不设为varchar是为了防止网页结构不一致，爬到错误数据，让数据库协助去除错误数据 3.period 字段设置唯一索引是为补救丢失数据，在重复爬取的时候去重 Ⅴ.爬虫代码伺候： #读取本地之前保存的div块信息，提取所有的期数页面 fh=open("c:/Users/nec/Desktop/caipiao.html","r") data=fh.read() #正则匹配 pat='href="(.*?)"' import re #拿到页面list res=re.compile(pat).findall(data) len(res)

爬取过程中遇到ip被封杀，所以采用代理ip爬取

import urllib.request #定义代理函数 def use_proxy(url,proxy_addr): proxy=urllib.request.ProxyHandler({"http":proxy_addr}) print(proxy_addr) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data=urllib.request.urlopen(url,timeout=8).read() return data

百度找到免费的代理ip，不是很稳定，但也还可以用在这里插入图片描述测试代理ip是否可用：

proxy_addr="112.85.128.47:9999" html=use_proxy("http://www.baidu.com",proxy_addr) 112.85.128.47:9999

测试链接数据库：

import pymysql conn=pymysql.connect(host="192.168.1.188",user="root",passwd="roota123",db="douball") sql="show tables" conn.query(sql)

爬取网页发现，网页优化做了gzip压缩，所以网页爬取后先需要解压缩第一轮尝试爬取-异常处理后继续

import zlib for url in res: try: print(url) #使用代理 html=use_proxy(url,proxy_addr) #解压缩 decompressed_data = zlib.decompress(html ,16+zlib.MAX_WBITS) #正则匹配 pattern1='(.*?)' pattern2='(.*?)' pattern3='(.*?)' pattern4='(.*?)' reds=re.compile(pattern1).findall(str(decompressed_data)) blues=re.compile(pattern2).findall(str(decompressed_data)) periods=re.compile(pattern3).findall(str(decompressed_data)) print(reds[0]) print(blues[0]) print(periods[0]) sql="insert into tb_dou_color (red_one,red_two,red_three,red_four,red_five,red_six,blue,period) values ("+reds[0]+","+reds[1]+","+reds[2]+","+reds[3]+","+reds[4]+","+reds[5]+","+blues[0]+","+periods[0]+")" conn.query(sql) #批处理提交 if (periods[0]-1)%100==0: conn.commit() except Exception as err: print (err)

这个过程中出现了一些异常，丢失了部分数据

代理服务不可用 HTTP Error 503: Service Unavailable http://kaijiang.500.com/shtml/ssq/19055.shtml 112.80.41.86:8888 ip被封 http://kaijiang.500.com/shtml/ssq/19055.shtml 112.85.128.209:9999 [WinError 10054] 远程主机强迫关闭了一个现有的连接。 http://kaijiang.500.com/shtml/ssq/18127.shtml 浏览器请求头异常等 Error -3 while decompressing data: incorrect header check http://kaijiang.500.com/shtml/ssq/19054.shtml

所以需求适时切换代理ip，重复爬取数据，利用数据库唯一索引去重

(1062, "Duplicate entry '19052' for key 'peroid_uniq_index'") http://kaijiang.500.com/shtml/ssq/19051.shtml 112.80.41.86:8888

适时手动提交一下数据

conn.commit()

代理请求函数添加浏览器伪装及指定网页压缩类型：

def use_proxy(url,proxy_addr): proxy=urllib.request.ProxyHandler({"http":proxy_addr}) print(proxy_addr) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0",'Accept-Encoding':'gzip, deflate'} #opener=urllib.request.build_opener() #opener.addheaders=[headers] #data=urllib.request.urlopen(url,timeout=8).read() req=urllib.request.Request(url,headers=headers) data=urllib.request.urlopen(req).read() return data

优化后继续爬取数据：

for url in res: try: print(url) html=use_proxy(url,proxy_addr) try: decompressed_data = zlib.decompress(html ,zlib.MAX_WBITS|32) except Exception as err1: print (err1) try: decompressed_data = zlib.decompress(html ,zlib.MAX_WBITS|16) except Exception as err2: print (err2) try: decompressed_data = zlib.decompress(html) except Exception as err3: print(err3) decompressed_data = zlib.decompress(html,-zlib.MAX_WBITS) pattern1='(.*?)' pattern2='(.*?)' pattern3='(.*?)' pattern4='(.*?)' reds=re.compile(pattern1).findall(str(decompressed_data)) blues=re.compile(pattern2).findall(str(decompressed_data)) periods=re.compile(pattern3).findall(str(decompressed_data)) print(reds[0]) print(blues[0]) print(periods[0]) sql="insert into tb_dou_color (red_one,red_two,red_three,red_four,red_five,red_six,blue,period) values ("+reds[0]+","+reds[1]+","+reds[2]+","+reds[3]+","+reds[4]+","+reds[5]+","+blues[0]+","+periods[0]+")" conn.query(sql) if (int(periods[0])-1)%100==0: conn.commit() except Exception as err: print (err)

爬到一大半，ip又被封了！

http://kaijiang.500.com/shtml/ssq/07098.shtml 112.85.171.116:9999 http://kaijiang.500.com/shtml/ssq/07097.shtml

切换ip，过滤已经爬取的部分，接着爬取

for url in res: patt="http://kaijiang.500.com/shtml/ssq/(.*?).shtml" index=re.compile(patt).findall(url) if int(index[0])>7098:continue try: print(url) html=use_proxy(url,proxy_addr) try: decompressed_data = zlib.decompress(html ,zlib.MAX_WBITS|32) except Exception as err1: print (err1) try: decompressed_data = zlib.decompress(html ,zlib.MAX_WBITS|16) except Exception as err2: print (err2) try: decompressed_data = zlib.decompress(html) except Exception as err3: print(err3) decompressed_data = zlib.decompress(html,-zlib.MAX_WBITS) pattern1='(.*?)' pattern2='(.*?)' pattern3='(.*?)' pattern4='(.*?)' reds=re.compile(pattern1).findall(str(decompressed_data)) blues=re.compile(pattern2).findall(str(decompressed_data)) periods=re.compile(pattern3).findall(str(decompressed_data)) print(reds[0]) print(blues[0]) print(periods[0]) sql="insert into tb_dou_color (red_one,red_two,red_three,red_four,red_five,red_six,blue,period) values ("+reds[0]+","+reds[1]+","+reds[2]+","+reds[3]+","+reds[4]+","+reds[5]+","+blues[0]+","+periods[0]+")" conn.query(sql) if (int(periods[0])-1)%100==0: conn.commit() except Exception as err: print (err) conn.commit()

数据库查询数据量：

select count(*) from tb_dou_color

发现少了5条数据筛选出遗漏的5条数据先查出数据库已有的数据

sql1="select period from tb_dou_color" #获取游标 cur=conn.cursor() #执行查询 cur.execute(sql1) #提取数据 rows=cur.fetchall()

将数据库提取到数据库封装到集合 arrs 中：

arrs=[] for i in rows: print(i[0]) arrs.append(i[0])

提取所有需要爬取的所有期数

arrays=[] for i in res: patt="http://kaijiang.500.com/shtml/ssq/(.*?).shtml" print(i) ind=re.compile(patt).findall(i) b=int(ind[0]) print(str(b))

将两个集合取差集，得到丢失了数据

div=list(set(array).difference(set(arrs))) #['17001', '15081', '13091', '10119', '3069']

再次对丢失的数据爬取一次

for url in res: patt="http://kaijiang.500.com/shtml/ssq/(.*?).shtml" index=re.compile(patt).findall(url) if str(int(index[0])) not in div:continue try: print(url) html=use_proxy(url,proxy_addr) try: decompressed_data = zlib.decompress(html ,zlib.MAX_WBITS|32) except Exception as err1: print (err1) try: decompressed_data = zlib.decompress(html ,zlib.MAX_WBITS|16) except Exception as err2: print (err2) try: decompressed_data = zlib.decompress(html) except Exception as err3: print(err3) decompressed_data = zlib.decompress(html,-zlib.MAX_WBITS) pattern1='(.*?)' pattern2='(.*?)' pattern3='(.*?)' pattern4='(.*?)' reds=re.compile(pattern1).findall(str(decompressed_data)) blues=re.compile(pattern2).findall(str(decompressed_data)) periods=re.compile(pattern3).findall(str(decompressed_data)) print(reds[0]) print(blues[0]) print(periods[0]) sql="insert into tb_dou_color (red_one,red_two,red_three,red_four,red_five,red_six,blue,period) values ("+reds[0]+","+reds[1]+","+reds[2]+","+reds[3]+","+reds[4]+","+reds[5]+","+blues[0]+","+periods[0]+")" conn.query(sql) if (int(periods[0])-1)%100==0: conn.commit() except Exception as err: print (err) conn.commit()

最后还剩一条网站丢失的数据，手动去查询其他网站，添加到数据库在这里插入图片描述

(1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '{Result2},{Result3},{Result4},{Result5},{Result6},,17001)' at line 1") http://kaijiang.500.com/shtml/ssq/15081.shtml 182.108.44.47:808

至此，数据全部爬取完成

Ⅵ.数据处理分析 import pandas as pd k=pd.read_sql("select * from tb_dou_color",conn) df=k.describe() id red_one ... red_six blue count 2415.000000 2415.000000 ... 2415.000000 2415.000000 mean 1594.991304 4.751553 ... 28.925052 8.649689 std 1219.295142 3.701862 ... 3.777546 4.617859 min 1.000000 1.000000 ... 11.000000 1.000000 25% 606.500000 2.000000 ... 27.000000 5.000000 50% 1210.000000 4.000000 ... 30.000000 9.000000 75% 2749.500000 7.000000 ... 32.000000 13.000000 max 4610.000000 24.000000 ... 33.000000 16.000000 df.to_csv('e:/python/code/res1.csv',index=False,header=False)

利用excel处理csv文件，进行进一步概率分析，利用excel 的rank等函数处理，得到最终结果：

在这里插入图片描述未经允许，切勿扩散，Thanks♪(･ω･)ﾉ

【本文地址】

福利彩票

福利彩票

今日新闻

推荐新闻