编程新手,使用python爬取智联招聘职位信息
导读:本文共1395字符,通常情况下阅读需要5分钟。同时您也可以点击右侧朗读,来听本文内容。按键盘←(左) →(右) 方向键可以翻页。
摘要:作为python菜鸟,突如其来地想爬取智联招聘的招聘信息,本来是想爬取职位介绍提取关键字做数据分析的,然而智联的html结构太混乱,只得放弃这个想法,先爬取了基本的职位信息存储到本机的mysql数据库中。纯新手一枚,写得相当粗糙,凑合着看吧,具体代码如下:mport requestsimport urllib.parseimport refrom lxml import etreeimport t... ...
目录
(为您整理了一些要点),点击可以直达。作为python菜鸟,突如其来地想爬取智联招聘的招聘信息,本来是想爬取职位介绍提取关键字做数据分析的,然而智联的html结构太混乱,只得放弃这个想法,先爬取了基本的职位信息存储到本机的mysql数据库中。纯新手一枚,写得相当粗糙,凑合着看吧,具体代码如下:
mport requests
import urllib.parse
import re
from lxml import etree
import threading
unity_url = r'http://sou.zhaopin/jobs/searchresult.ashx?jl={location}&kw={job}&sm=0&p={page}&source=0'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
def get_info_url():
info_urls = []
location = input("请输入想要搜索的城市:")
loc = urllib.parse.quote(location.encode('utf-8'))
job = input("请输入想要搜索的职位:")
job = urllib.parse.quote(job.encode('utf-8'))
num = input("请输入想要获取的页数:")
for i in range(1,int(num)+1):
url = unity_url.format(location=loc,job=job,page=str(i))
page = requests.get(url, headers=headers)
page.encoding = 'utf-8'
html = page.text
r = repile('need_urls = r.findall(html)
for n_url in need_urls:
info_urls.append(n_url[1])
return info_urls
def get_infos():
info_urls = get_info_url()
jobs = []
companies = []
work_years = []
degrees = []
salarys = []
places = []
for info_url in info_urls:
print (info_url)
info_page = requests.get(info_url, headers=headers)
info_page.encoding = 'utf-8'
info_html = info_page.text
e_html = etree.HTML(info_html)
try:
job = e_html.xpath('/html/body/p[5]/p[1]/p[1]/h1')[0].text
jobs.append(job)
company = e_html.xpath('/html/body/p[5]/p[1]/p[1]/h2/a')[0].text
companies.append(company)
work_year = e_html.xpath('/html/body/p[6]/p[1]/ul/li[5]/strong')[0].text
work_years.append(work_year)
degree = e_html.xpath('/html/body/p[6]/p[1]/ul/li[6]/strong')[0].text
degrees.append(degree)
salary = e_html.xpath('/html/body/p[6]/p[1]/ul/li[1]/strong')[0].text
salarys.append(salary.split('元')[0])
place = e_html.xpath('/html/body/p[6]/p[1]/ul/li[2]/strong/a')[0].text
places.append(place)
except:
pass
return jobs, companies, work_years, degrees, salarys, places
if __name__ == '__main__':
t = threading.Thread(target=get_infos)
t.start()
此代码只是获取了一些职位信息的URL后写入列表中,后续还有写入数据库,提取数据做数据可视化的例子
后面数据可视化的例子
编程新手,使用python爬取智联招聘职位信息的详细内容,希望对您有所帮助,信息来源于网络。