本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。
以下文章来源于青灯编程 ,作者:清风
前言
本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。
基本开发环境
- Python 3.6
- Pycharm
相关模块使用
- 爬虫模块
<span><a href="https://www.gaodaima.com/tag/import" title="查看更多关于import的文章" target="_blank">import</a></span><span> requests </span><span>import</span><span> re </span><span>import</span><span> parsel </span><span>import</span> csv
www#gaodaima.com来源gao!daima.com搞$代!码网搞代码
- 词云模块
<span>import</span><span> jieba </span><span>import</span> wordcloud
目标网页分析
通过开发者工具可以看到,获取返回数据之后,数据是
在 window.__SEARCH_RESULT__ 里面,可以使用正则匹配数据。
如下图所示
<span>"</span><span>https://jobs.51job.com/beijing-ftq/127676506.html?s=01&t=0</span><span>"</span>
每一个招聘信息的详情页都是有对应的ID,只需要正则匹配提取ID值,通过拼接URL,然后再去招聘详情页提取招聘数据即可。
response = requests.get(url=url, headers=<span>headers) lis </span>= re.findall(<span>"</span><span>"jobid":"(d+)"</span><span>"</span><span>, response.text) </span><span>for</span> li <span>in</span><span> lis: page_url </span>= <span>"</span><span>https://jobs.51job.com/beijing-hdq/{}.html?s=01&t=0</span><span>"</span>.format(li)
虽然网站是静态网页,但是网页编码是乱码,在爬取的过程中需要转码。
f = open(<span>"</span><span>招聘.csv</span><span>"</span>, mode=<span>"</span><span>a</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span>, newline=<span>""</span><span>) csv_writer </span>= csv.DictWriter(f, fieldnames=[<span>"</span><span>标题</span><span>"</span>, <span>"</span><span>地区</span><span>"</span>, <span>"</span><span>工作经验</span><span>"</span>, <span>"</span><span>学历</span><span>"</span>, <span>"</span><span>薪资</span><span>"</span>, <span>"</span><span>福利</span><span>"</span>, <span>"</span><span>招聘人数</span><span>"</span>, <span>"</span><span>发布日期</span><span>"</span><span>]) csv_writer.writeheader() response </span>= requests.get(url=page_url, headers=<span>headers) response.encoding </span>=<span> response.apparent_encoding selector </span>=<span> parsel.Selector(response.text) title </span>= selector.css(<span>"</span><span>.cn h1::text</span><span>"</span>).get() <span>#</span><span> 标题</span> salary = selector.css(<span>"</span><span>div.cn strong::text</span><span>"</span>).get() <span>#</span><span> 薪资</span> welfare = selector.css(<span>"</span><span>.jtag div.t1 span::text</span><span>"</span>).getall() <span>#</span><span> 福利</span> welfare_info = <span>"</span><span>|</span><span>"</span><span>.join(welfare) data_info </span>= selector.css(<span>"</span><span>.cn p.msg.ltype::attr(title)</span><span>"</span>).get().split(<span>"</span><span> | </span><span>"</span><span>) area </span>= data_info[0] <span>#</span><span> 地区</span> work_experience = data_info[1] <span>#</span><span> 工作经验</span> educational_background = data_info[2] <span>#</span><span> 学历</span> number_of_people = data_info[3] <span>#</span><span> 招聘人数</span> release_date = data_info[-1].replace(<span>"</span><span>发布</span><span>"</span>, <span>""</span>) <span>#</span><span> 发布日期</span> all_info_list = selector.css(<span>"</span><span>div.tCompany_main > div:nth-child(1) > div p span::text</span><span>"</span><span>).getall() all_info </span>= <span>"</span><span> </span><span>"</span><span>.join(all_info_list) dit </span>=<span> { </span><span>"</span><span>标题</span><span>"</span><span>: title, </span><span>"</span><span>地区</span><span>"</span><span>: area, </span><span>"</span><span>工作经验</span><span>"</span><span>: work_experience, </span><span>"</span><span>学历</span><span>"</span><span>: educational_background, </span><span>"</span><span>薪资</span><span>"</span><span>: salary, </span><span>"</span><span>福利</span><span>"</span><span>: welfare_info, </span><span>"</span><span>招聘人数</span><span>"</span><span>: number_of_people, </span><span>"</span><span>发布日期</span><span>"</span><span>: release_date, } csv_writer.writerow(dit) with open(</span><span>"</span><span>招聘信息.txt</span><span>"</span>, mode=<span>"</span><span>a</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) as f: f.write(all_info)</span>
以上步骤即可完成关于招聘的相关数据爬取。
简单粗略的数据清洗
- 薪资待遇
content = pd.read_csv(r<span>"</span><span>D:<a href="https://www.gaodaima.com/tag/python" title="查看更多关于python的文章" target="_blank">python</a>demo数据分析招聘招聘.csv</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) salary </span>= content[<span>"</span><span>薪资</span><span>"</span><span>] salary_1 </span>=<span> salary[salary.notnull()] salary_count </span>= pd.value_counts(salary_1)
- 学历要求
content = pd.read_csv(r<span>"</span><span>D:pythondemo数据分析招聘招聘.csv</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) educational_background </span>= content[<span>"</span><span>学历</span><span>"</span><span>] educational_background_1 </span>=<span> educational_background[educational_background.notnull()] educational_background_count </span>=<span> pd.value_counts(educational_background_1).head() </span><span>print</span><span>(educational_background_count) bar </span>=<span> Bar() bar.add_xaxis(educational_background_count.index.tolist()) bar.add_yaxis(</span><span>"</span><span>学历</span><span>"</span><span>, educational_background_count.values.tolist()) bar.render(</span><span>"</span><span>bar.html</span><span>"</span>)
显示招聘人数为无要求
- 工作经验
content = pd.read_csv(r<span>"</span><span>D:pythondemo数据分析招聘招聘.csv</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) work_experience </span>= content[<span>"</span><span>工作经验</span><span>"</span><span>] work_experience_count </span>=<span> pd.value_counts(work_experience) </span><span>print</span><span>(work_experience_count) bar </span>=<span> Bar() bar.add_xaxis(work_experience_count.index.tolist()) bar.add_yaxis(</span><span>"</span><span>经验要求</span><span>"</span><span>, work_experience_count.values.tolist()) bar.render(</span><span>"</span><span>bar.html</span><span>"</span>)
词云分析,技术点要求
py = imageio.imread(<span>"</span><span>python.png</span><span>"</span><span>) f </span>= open(<span>"</span><span>python招聘信息.txt</span><span>"</span>, encoding=<span>"</span><span>utf-8</span><span>"</span><span>) re_txt </span>=<span> f.read() result </span>= re.findall(r<span>"</span><span>[a-zA-Z]+</span><span>"</span><span>, re_txt) txt </span>= <span>"</span> <span>"</span><span>.join(result) </span><span>#</span><span> jiabe 分词 分割词汇</span> txt_list =<span> jieba.lcut(txt) string </span>= <span>"</span> <span>"</span><span>.join(txt_list) </span><span>#</span><span> 词云图设置</span> wc =<span> wordcloud.WordCloud( width</span>=1000, <span>#</span><span> 图片的宽</span> height=700, <span>#</span><span> 图片的高</span> background_color=<span>"</span><span>white</span><span>"</span>, <span>#</span><span> 图片背景颜色</span> font_path=<span>"</span><span>msyh.ttc</span><span>"</span>, <span>#</span><span> 词云字体</span> mask=py, <span>#</span><span> 所使用的词云图片</span> scale=15<span>, stopwords</span>={<span>"</span> <span>"</span><span>}, </span><span>#</span><span> contour_width=5,</span> <span>#</span><span> contour_color="red" # 轮廓颜色</span> <span>) </span><span>#</span><span> 给词云输入文字</span> <span>wc.generate(string) </span><span>#</span><span> 词云图保存图片地址</span> wc.to_file(r<span>"</span><span>python招聘信息.png</span><span>"</span>)
总结:
数据分析是真的粗糙,属实辣眼睛~