【大数据实战】招聘网站职位分析

您所在的位置：网站首页 › 爬取招聘网站大数据职位信息 › 【大数据实战】招聘网站职位分析

【大数据实战】招聘网站职位分析

2024-05-21 15:57| 来源: 网络整理| 查看: 265

通过采集招聘网站大数据职位信息、利用数据清洗、数据分析、jieba分词、数据挖掘完成整体项目的开发工作。任务包含爬取招聘网站大数据职位信息、使用BeautifulSoup清洗职位信息网页、使用PySpark对智联数据进行分析、对招聘职位信息进行探索分析、使用结巴分词对岗位描述进行分词并将关键词统计、利用Echarts将职位分析结果进行可视化、建立职位模型对应聘人员进行相似度的计算。

目录1 爬取招聘网站大数据职位信息1.1 知识前述1.2 代码详解1.3 完整代码2 对招聘职位信息进行探索分析2.1 知识前述2.2 代码详解2.3 运行结果2.4 结果分析3 建立职位模型对应聘人员进行相似度的计算3.1 知识前述3.2 代码详解

1 爬取招聘网站大数据职位信息

爬取智联招聘网页

1.1 知识前述

1.网络爬虫是捜索引擎抓取系统的重要组成部分。爬虫的主要目的是将互联网上的网页下载到本地形成一个互联网内容的镜像备份。

【大数据实战】招聘网站职位分析_html

网络爬虫的基本工作流程如下：

(1)首先选取目标URL；

(2)将目标URL放入待抓取URL队列；

(3)从待抓取URL队列中取出待抓取在URL，解析DNS，并且得到主机的ip，并将URL对应的网页下载下来，存储进已下载网页库中。此外，将这些URL放进已抓取URL队列。

(4)分析已抓取URL队列中的URL，分析其中的其他URL，并且将URL放入待抓取URL队列，从而进入下一个循环。

2.在爬虫系统中，待抓取URL队列是很重要的一部分。待抓取URL队列中的URL以什么样的顺序排列也是一个很重要的问题，因为这涉及到先抓取那个页面，后抓取哪个页面。而决定这些URL排列顺序的方法，叫做抓取策略。

常见的抓取策略：

(1)深度优先遍历策略

深度优先遍历策略是指网络爬虫会从起始页开始，一个链接一个链接跟踪下去，处理完这条线路之后再转入下一个起始页，继续跟踪链接。它的遍历的路径：A-F-G E-H-I B C D，如下图：

【大数据实战】招聘网站职位分析_大数据_02 (2)宽度优先遍历策略

宽度优先遍历策略的基本思路是，将新下载网页中发现的链接直接插入待抓取URL队列的末尾。也就是指网络爬虫会先抓取起始网页中链接的所有网页，然后再选择其中的一个链接网页，继续抓取在此网页中链接的所有网页。它的遍历路径：A-B-C-D-E-F G H I

(3)反向链接数策略

反向链接数是指一个网页被其他网页链接指向的数量。反向链接数表示的是一个网页的内容受到其他人的推荐的程度。因此，很多时候搜索引擎的抓取系统会使用这个指标来评价网页的重要程度，从而决定不同网页的抓取先后顺序。

(4)Partial PageRank策略

Partial PageRank算法借鉴了PageRank算法的思想：对于已经下载的网页，连同待抓取URL队列中的URL，形成网页集合，计算每个页面的PageRank值，计算完之后，将待抓取URL队列中的URL按照PageRank值的大小排列，并按照该顺序抓取页面。

(5)OPIC策略

该算法实际上也是对页面进行一个重要性打分。在算法开始前，给所有页面一个相同的初始现金（cash）。当下载了某个页面P之后，将P的现金分摊给所有从P中分析出的链接，并且将P的现金清空。对于待抓取URL队列中的所有页面按照现金数进行排序。

(6)大站优先策略

对于待抓取URL队列中的所有网页，根据所属的网站进行分类。对于待下载页面数多的网站，优先下载。这个策略也因此叫做大站优先策略。

1.2 代码详解

导入程序所用的外包

import urllibfrom urllib.parse import *from bs4 import BeautifulSoupimport stringimport randomimport pandas as pdimport os

将爬虫伪装成浏览器，防止网站针对爬虫的限制：

headers = [ "Mozilla/5.0 (Windows NT 6.1; Win64; rv:27.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:27.0) Gecko/20100101 Firfox/27.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:10.0) Gecko/20100101 Firfox/10.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/21.0.1180.110 Safari/537.36" "Mozilla/5.0 (X11; Ubuntu; Linux i686 rv:10.0) Gecko/20100101 Firfox/27.0" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/34.0.1838.2 Safari/537.36" "Mozilla/5.0 (X11; Ubuntu; Linux i686 rv:27.0) Gecko/20100101 Firfox/27.0" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' ]

模拟登陆获取网址

def get_content(url, headers,str): ''''' @url：需要登录的网址 @headers：模拟的登陆的终端 *********************模拟登陆获取网址******************** ''' random_header = random.choice(headers) req = urllib.request.Request(url) req.add_header("User-Agent", random_header) req.add_header("Get", url) req.add_header("Host", "{0}.zhaopin.com".format(str)) req.add_header("refer", "http://{0}.zhaopin.com/".format(str)) try: html = urllib.request.urlopen(req) contents = html.read() # print(contents) # 判断输出内容contents是否是字节格式 if isinstance(contents, bytes): # 转成字符串格式 contents = contents.decode('utf-8') else: print('输出格式正确，可以直接输出') ##输出的是字节格式，需要将字节格式解码转成’utf-8‘ return (contents) except Exception as e: print(e)

获取全部子网页地址

def get_links_from(job, city, page): ''''' @job:工作名称 @city:网址中城市名称 @page：表示第几页信息 @urls：所有列表的超链接，即子页网址 ****************此网站需要模拟登陆********************** 返回全部子网页地址 ''' urls = [] for i in range(page): url='http://sou.zhaopin.com/jobs/searchresult.ashx?jl={0}&kw={1}&p={2}&isadv=0'.format(str(city),str(job),i) url = quote(url, safe=string.printable) info = get_content(url, headers,'sou') soup = BeautifulSoup(info, "lxml") # 设置解析器为“lxml” # print(soup) link_urls = soup.select('td.zwmc a') for url in link_urls: urls.append(url.get('href')) # print(urls)

获取招聘网页信息，并保存

def get_recuite_info(job, city, page): ''''' 获取招聘网页信息 ''' urls = get_links_from(job, city, page) path='/data/zhilian/' if os.path.exists(path)==False: os.makedirs(path) for url in urls: print(url) file=url.split('/')[-1] print(file) str=url.split('/')[2].split('.')[0] html = get_content(url, headers, str) if html!=None and file!='': with open(path+file,'w') as f: f.write(html) 1.3 完整代码import urllibfrom urllib.parse import *from bs4 import BeautifulSoupimport stringimport randomimport pandas as pdimport osheaders = [ "Mozilla/5.0 (Windows NT 6.1; Win64; rv:27.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:27.0) Gecko/20100101 Firfox/27.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:10.0) Gecko/20100101 Firfox/10.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/21.0.1180.110 Safari/537.36" "Mozilla/5.0 (X11; Ubuntu; Linux i686 rv:10.0) Gecko/20100101 Firfox/27.0" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/34.0.1838.2 Safari/537.36" "Mozilla/5.0 (X11; Ubuntu; Linux i686 rv:27.0) Gecko/20100101 Firfox/27.0" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' ]def get_content(url, headers,str): ''''' @url：需要登录的网址 @headers：模拟的登陆的终端 *********************模拟登陆获取网址******************** ''' random_header = random.choice(headers) req = urllib.request.Request(url) req.add_header("User-Agent", random_header) req.add_header("Get", url) req.add_header("Host", "{0}.zhaopin.com".format(str)) req.add_header("refer", "http://{0}.zhaopin.com/".format(str)) try: html = urllib.request.urlopen(req) contents = html.read() # print(contents) # 判断输出内容contents是否是字节格式 if isinstance(contents, bytes): # 转成字符串格式 contents = contents.decode('utf-8') else: print('输出格式正确，可以直接输出') ##输出的是字节格式，需要将字节格式解码转成’utf-8‘ return (contents) except Exception as e: print(e)def get_links_from(job, city, page): ''''' @job:工作名称 @city:网址中城市名称 @page：表示第几页信息 @urls：所有列表的超链接，即子页网址 ****************此网站需要模拟登陆********************** 返回全部子网页地址 ''' urls = [] for i in range(page): url='http://sou.zhaopin.com/jobs/searchresult.ashx?jl={0}&kw={1}&p={2}&isadv=0'.format(str(city),str(job),i) url = quote(url, safe=string.printable) info = get_content(url, headers,'sou') soup = BeautifulSoup(info, 'lxml') # 设置解析器为“lxml” # print(soup) link_urls = soup.select('td.zwmc a') for url in link_urls: urls.append(url.get('href')) # print(urls) return (urls)def get_recuite_info(job, city, page): ''''' 获取招聘网页信息 ''' urls = get_links_from(job, city, page) path='/data/zhilian/' if os.path.exists(path)==False: os.makedirs(path) for url in urls: print(url) file=url.split('/')[-1] print(file) str=url.split('/')[2].split('.')[0] html = get_content(url, headers, str) if html!=None and file!='': with open(path+file,'w') as f: f.write(html)'''*********************获取招聘信息***************************'''if __name__ == '__main__': city='北京%2b上海%2b广州%2b深圳' get_recuite_info('大数据', city, 100) 2 对招聘职位信息进行探索分析 2.1 知识前述

1.matplotlib是基于Python语言的开源项目，旨在为Python提供一个数据绘图包。我将在这篇文章中介绍matplotlib API的核心对象，并介绍如何使用这些对象来实现绘图。实际上，matplotlib的对象体系严谨而有趣，为使用者提供了巨大的发挥空间。用户在熟悉了核心对象之后，可以轻易的定制图像。matplotlib的对象体系也是计算机图形学的一个优秀范例。即使你不是Python程序员，你也可以从文中了解一些通用的图形绘制原则。

matplotlib使用numpy进行数组运算，并调用一系列其他的Python库来实现硬件交互。matplotlib的核心是一套由对象构成的绘图API。

2.为项目设置matplotlib参数

在代码执行过程中，有两种方式更改参数：使用参数字典(rcParams)、调用matplotlib.rc()命令通过传入关键字元祖，修改参数如果不想每次使用matplotlib时都在代码部分进行配置，可以修改matplotlib的文件参数。可以用matplot.get_config()命令来找到当前用户的配置文件目录。

配置文件包括以下配置项：

axex: 设置坐标轴边界和表面的颜色、坐标刻度值大小和网格的显示 backend: 设置目标暑促TkAgg和GTKAgg figure: 控制dpi、边界颜色、图形大小、和子区( subplot)设置 font: 字体集（font family）、字体大小和样式设置 grid: 设置网格颜色和线性 legend: 设置图例和其中的文本的显示 line: 设置线条（颜色、线型、宽度等）和标记 patch: 是填充2D空间的图形对象，如多边形和圆。控制线宽、颜色和抗锯齿设置等。 savefig: 可以对保存的图形进行单独设置。例如，设置渲染的文件的背景为白色。 verbose: 设置matplotlib在执行期间信息输出，如silent、helpful、debug和debug-annoying。 xticks和yticks: 为x,y轴的主刻度和次刻度设置颜色、大小、方向，以及标签大小。

2.2 代码详解

一、实例一

import pandas as pdimport matplotlib.pyplot as pltimport numpy as npimport matplotlib.font_manager as fmfontPath ="/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc"font = fm.FontProperties(fname=fontPath, size=10)data=pd.read_csv('/data/python_pj3/bigdata',)print(data.shape,data.columns)data.loc[(data.经验=='3年以上'),'经验']='3-5年'#公司规模分布情况plt.figure(figsize=(12,10))plt.subplot2grid((2,3),(0,0))a=data['公司规模'].value_counts().plot(kind='barh',title='公司规模分布情况',color='pink')a.xaxis.get_label().set_fontproperties(font)a.yaxis.get_label().set_fontproperties(font)a.legend(loc='best',prop=font)for label in ([a.title]+a.get_xticklabels()+a.get_yticklabels()): label.set_fontproperties(font)#公司性质分布情况plt.subplot2grid((2,3),(0,1))b=data['公司性质'].value_counts().plot(kind='barh',title='公司性质分布情况',color='red')b.xaxis.get_label().set_fontproperties(font)b.yaxis.get_label().set_fontproperties(font)b.legend(loc='best',prop=font)for label in ([b.title]+b.get_xticklabels()+b.get_yticklabels()): label.set_fontproperties(font)#经验分布情况# plt.subplot2grid((2,2),(1,0),colspan=2)plt.subplot2grid((2,3),(0,2))c=data['经验'].value_counts().plot(kind='barh',title='经验分布情况',color='lightskyblue')c.xaxis.get_label().set_fontproperties(font)c.yaxis.get_label().set_fontproperties(font)c.legend(loc='best',prop=font)for label in ([c.title]+c.get_xticklabels()+c.get_yticklabels()): label.set_fontproperties(font)#公司行业分布情况plt.subplot2grid((2,3),(1,0))d=data['公司行业'].value_counts().sort_values(ascending=False).head(10).plot(kind='barh',title='公司行业分布情况',color='yellowgreen')d.xaxis.get_label().set_fontproperties(font)d.yaxis.get_label().set_fontproperties(font)d.legend(loc='best',prop=font)for label in ([d.title]+d.get_xticklabels()+d.get_yticklabels()): label.set_fontproperties(font)#职位类别分布情况plt.subplot2grid((2,3),(1,1))d=data['职位类别'].value_counts().sort_values(ascending=False).head(10).plot(kind='barh',title='职位类别分布情况',color='green')d.xaxis.get_label().set_fontproperties(font)d.yaxis.get_label().set_fontproperties(font)d.legend(loc='best',prop=font)for label in ([d.title]+d.get_xticklabels()+d.get_yticklabels()): label.set_fontproperties(font)#工作地点分布情况plt.subplot2grid((2,3),(1,2))d=data['工作地点'].str.split('-',expand=True)[0].value_counts().plot(kind='bar',title='工作地点分布情况',color='yellow',label='工作地点')d.xaxis.get_label().set_fontproperties(font)d.yaxis.get_label().set_fontproperties(font)d.legend(loc='best',prop=font)# print(d.get_legend_handles_labels())for label in ([d.title]+d.get_xticklabels()+d.get_yticklabels()): label.set_fontproperties(font)plt.show()

二、实例二

import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport numpy as npimport matplotlib.font_manager as fmfontPath ="/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc"font = fm.FontProperties(fname=fontPath, size=10)data=pd.read_csv('/data/python_pj3/bigdata')print(data.shape)print(data.columns)# print([data.职位类别.value_counts().index if data.职位类别.value_counts()=6000) & (data.月工资_max0.99: print(data[index:index+1][['工作名称','公司名称','岗位描述']]) # print(index,'---',simi)test_data=pd.read_csv('/data/python_pj6/test_data')main(test_data)

【大数据实战】招聘网站职位分析_数据分析_08

【本文地址】

【大数据实战】招聘网站职位分析

【大数据实战】招聘网站职位分析

今日新闻

推荐新闻