如何用Python获取成都租房信息(python,编程语言)

如何用Python获取成都租房信息

导读：本文共5032字符，通常情况下阅读需要17分钟。同时您也可以点击右侧朗读，来听本文内容。按键盘←（左） →（右）方向键可以翻页。

摘要：信息数据的获取，这里首先收集赶集网和自如网的信息。1. 赶集网信息获取I. 获取当页内容这里的规则比较明显，获取网页内容用xpath解析即可，各个板块的信息都很容易获取，最后用列表保存并返回即可，首先循环出每个divs块，对里面的每个版块内容逐个获取defget_this_page_gj(url,tmp):html=etree.HTML(requests.ge... ...

音频解说

defget_this_page_gj(url,tmp):html=etree.HTML(requests.get(url).text)divs=html.xpath('//div[@class="f-list-itemershoufang-list"]')fordivindivs:title=div.xpath('./dl/dd[@class="dd-itemtitle"]/a/text()')[0]house_url=div.xpath('./dl/dd[@class="dd-itemtitle"]/a/@href')[0]size="、".join(div.xpath('./dl/dd[@class="dd-itemsize"]/span/text()'))address='-'.join([data.strip()fordataindivs[0].xpath('./dl/dd[@class="dd-itemaddress"][1]//a//text()')ifdata.strip()!=''])agent_string=div.xpath('./dl/dd[@class="dd-itemaddress"][2]/span/span/text()')[0]agent=re.sub('','',agent_string)price=div.xpath('./dl/dd[@class="dd-iteminfo"]/div[@class="price"]/span[@class="num"]/text()')[0]tmp.append([title,size,price,address,agent,house_url])returntmp

II. URL构造

访问首页链接，获取总页数，按照url的访问规则构造url，调用获取当页数据的方法即可，这里的url都是以http://cd.ganji.com/zufang/pn开头的，后面跟上网页的页码

defhouse_gj(headers):index_url='http://cd.ganji.com/zufang/'html=etree.HTML(get_html(index_url,headers))total=html.xpath('//div[@class="pageBox"]/a[position()=last()-1]/span/text()')[0]result=[]fornuminrange(1,int(total)+1):result+=get_this_page_gj('http://cd.ganji.com/zufang/pn{}'.format(num),[])print('完成读取第{}页/赶集网'.format(num))returnresult

2 .

这里和赶集网类似，结构也相似，同样的获取方式，我们也抓取基础信息加url链接，区别在于这里的价格可能不太好获取，并不是直接显示，而是以图片+偏移量的形式展示

如何用Python获取成都租房信息

1. 价格获取

每个数字对应一张图片，图片中的数字会根据style中设置的偏移去原图中获取，每页的原图也不尽相同，所以处理起来比较麻烦

如何用Python获取成都租房信息

这里我们仔细留心的会发现其实每个数字间的间距是一样的，可以自己在页面上更改数值查看规律，每个数字间的距离是21.4px，从原图的左边开始做偏移，根据偏移确定对应的数字，返回的数字下标 = |偏移量/21.4|,当然这里根据页面图片、内容等元素会有微小的误差，但都是极小的误差了，最后取个整去原图的数字列表中取得对应下标的值即可，这里我们用到tesseract来对图片进行解析

............price_strings=div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')offset_list=[]fordatainprice_strings:offset_list.append(re.findall('position:(.*?)px',data)[0])style_string=html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]pic="http:"+re.findall(r'background-image:url\((.*?)\);.*?',style_string)[0]price=get_price_zr(pic,offset_list)defget_price_zr(pic_url,offset_list):'''这里的index保存所有数字的下标值，等待图片解析完成获取对应下标的价格数字'''index,price=[],[]withopen('pic.png','wb')asf:f.write(requests.get(pic_url).content)code_list=list(pytesseract.image_to_string(Image.open('pic.png')))fordatainoffset_list:index.append(int(math.fabs(eval(data)/21.4)))fordatainindex:price.append(code_list[data])return"".join(price)

pic_url是每页的原图地址，将之下载下来后用pytesseract解析，最后返回每个下标对应的数字所组成的新的数字字符串(价格),offset_list是获取的每个数字的偏移值组成的列表

2. 自如网数据获取

这里和赶集网类似，结构也相似，同样的获取方式，我们也抓取基础信息加url链接，区别在于这里的价格可能不太好获取，并不是直接显示，而是以图片+偏移量的形式展示

如何用Python获取成都租房信息

I. 价格获取

每个数字对应一张图片，图片中的数字会根据style中设置的偏移去原图中获取，每页的原图也不尽相同，所以处理起来比较麻烦

如何用Python获取成都租房信息

这里我们仔细留心的会发现其实每个数字间的间距是一样的，可以自己在页面上更改数值查看规律，每个数字间的距离是21.4px，从原图的左边开始做偏移，根据偏移确定对应的数字，返回的数字下标 = |偏移量/21.4|,当然这里根据页面图片、内容等元素会有微小的误差，但都是极小的误差了，最后取个整去原图的数字列表中取得对应下标的值即可，这里我们用到tesseract来对图片进行解析

............price_strings=div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')offset_list=[]fordatainprice_strings:offset_list.append(re.findall('position:(.*?)px',data)[0])style_string=html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]pic="http:"+re.findall(r'background-image:url\((.*?)\);.*?',style_string)[0]price=get_price_zr(pic,offset_list)defget_price_zr(pic_url,offset_list):'''这里的index保存所有数字的下标值，等待图片解析完成获取对应下标的价格数字'''index,price=[],[]withopen('pic.png','wb')asf:f.write(requests.get(pic_url).content)code_list=list(pytesseract.image_to_string(Image.open('pic.png')))fordatainoffset_list:index.append(int(math.fabs(eval(data)/21.4)))fordatainindex:price.append(code_list[data])return"".join(price)

pic_url是每页的原图地址，将之下载下来后用pytesseract解析，最后返回每个下标对应的数字所组成的新的数字字符串(价格),offset_list是获取的每个数字的偏移值组成的列表

II. 获取当页数据

这里和赶集网类似，我们构造获取每页数据的函数，之后调用函数传入每页的url即可，这里可以关注一下xpath的扩展用法(contains函数)和正则获取原图链接

defget_this_page_zr(url,tmp):html=etree.HTML(requests.get(url).text)divs=html.xpath('//div[@class="item"]')fordivindivs:ifdiv.xpath('./div[@class="info-box"]/h6/a/text()'):title=div.xpath('./div[@class="info-box"]/h6/a/text()')[0]else:continuelink='http:'+div.xpath('./div[@class="info-box"]/h6/a/@href')[0]location=div.xpath('./div[@class="info-box"]/div[@class="desc"]/div[@class="location"]/text()')[0]area=div.xpath('./div[@class="info-box"]/div[@class="desc"]/div[contains(text(),"㎡")]/text()')[0]price_strings=div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')offset_list=[]fordatainprice_strings:offset_list.append(re.findall('position:(.*?)px',data)[0])style_string=html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]pic="http:"+re.findall(r'background-image:url\((.*?)\);.*?',style_string)[0]price=get_price_zr(pic,offset_list)tag='、'.join(div.xpath('./div[@class="info-box"]//div[@class="tag"]/span/text()'))tmp.append([title,tag,price,area,location,link])returntmp

III. url构造

原理同赶集网的一样，主要关注一下xpath的扩展用法position()=last()

defhouse_zr(headers):index_url='http://cd.ziroom.com/z/'html=etree.HTML(get_html(index_url,headers))total=html.xpath('//div[@class="Z_pages"]/a[position()=last()-1]/text()')[0]result=[]fornuminrange(1,int(total)+1):result+=get_this_page_zr('http://cd.ziroom.com/z/p{}/'.format(num),[])print('完成读取第{}页/自如网'.format(num))returnresult

 </div> <div class="zixun-tj-product adv-bottom"></div> </div> </div> <div class="prve-next-news">

本文：如何用Python获取成都租房信息的详细内容，希望对您有所帮助，信息来源于网络。

如何用Python获取成都租房信息(python,编程语言)

目录

9 人围观 / 0 条评论 ↓快速评论↓

搜索

最新文章

猜你喜欢

特价优惠

标签

流量统计