怎么在Python中使用Scrapy爬取豆瓣图片(python,scrapy,开发技术)

时间:2024-05-09 17:41:53 作者 : 石家庄SEO 分类 : 开发技术
  • TAG :

1.首先我们在命令行进入到我们要创建的目录,输入 scrapy startproject banciyuan 创建scrapy项目

创建的项目结构如下

怎么在Python中使用Scrapy爬取豆瓣图片

2.为了方便使用pycharm执行scrapy项目,新建main.py

fromscrapyimportcmdlinecmdline.execute("scrapycrawlbanciyuan".split())

再edit configuration

怎么在Python中使用Scrapy爬取豆瓣图片

然后进行如下设置,设置后之后就能通过运行main.py运行scrapy项目了

怎么在Python中使用Scrapy爬取豆瓣图片

3.分析该HTML页面,创建对应spider

怎么在Python中使用Scrapy爬取豆瓣图片

fromscrapyimportSpiderimportscrapyfrombanciyuan.itemsimportBanciyuanItemclassBanciyuanSpider(Spider):name='banciyuan'allowed_domains=['movie.douban.com']start_urls=["https://movie.douban.com/celebrity/1025156/photos/"]url="https://movie.douban.com/celebrity/1025156/photos/"defparse(self,response):num=response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')print(num)foriinrange(int(num)):suffix='?type=C&start='+str(i*30)+'&sortby=like&size=a&subtype=a'yieldscrapy.Request(url=self.url+suffix,callback=self.get_page)defget_page(self,response):href_list=response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()#print(href_list)forhrefinhref_list:yieldscrapy.Request(url=href,callback=self.get_info)defget_info(self,response):src=response.xpath('//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')title=response.xpath('//div[@id="content"]/h2/text()').extract_first('')#print(response.body)item=BanciyuanItem()item['title']=titleitem['src']=[src]yielditem

4.items.py

#Defineherethemodelsforyourscrapeditems##Seedocumentationin:#https://docs.scrapy.org/en/latest/topics/items.htmlimportscrapyclassBanciyuanItem(scrapy.Item):#definethefieldsforyouritemherelike:src=scrapy.Field()title=scrapy.Field()

pipelines.py

#Defineyouritempipelineshere##Don'tforgettoaddyourpipelinetotheITEM_PIPELINESsetting#See:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#usefulforhandlingdifferentitemtypeswithasingleinterfacefromitemadapterimportItemAdapterfromscrapy.pipelines.imagesimportImagesPipelineimportscrapyclassBanciyuanPipeline(ImagesPipeline):defget_media_requests(self,item,info):yieldscrapy.Request(url=item['src'][0],meta={'item':item})deffile_path(self,request,response=None,info=None,*,item=None):item=request.meta['item']image_name=item['src'][0].split('/')[-1]#image_name.replace('.webp','.jpg')path='%s/%s'%(item['title'].split('')[0],image_name)returnpath

settings.py

#Scrapysettingsforbanciyuanproject##Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor#commonlyused.Youcanfindmoresettingsconsultingthedocumentation:##https://docs.scrapy.org/en/latest/topics/settings.html#https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME='banciyuan'SPIDER_MODULES=['banciyuan.spiders']NEWSPIDER_MODULE='banciyuan.spiders'#Crawlresponsiblybyidentifyingyourself(andyourwebsite)ontheuser-agentUSER_AGENT={'User-Agent':'Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/71.0.3578.80Safari/537.36'}#Obeyrobots.txtrulesROBOTSTXT_OBEY=False#ConfiguremaximumconcurrentrequestsperformedbyScrapy(default:16)#CONCURRENT_REQUESTS=32#Configureadelayforrequestsforthesamewebsite(default:0)#Seehttps://docs.scrapy.org/en/latest/topics/settings.html#download-delay#Seealsoautothrottlesettingsanddocs#DOWNLOAD_DELAY=3#Thedownloaddelaysettingwillhonoronlyoneof:#CONCURRENT_REQUESTS_PER_DOMAIN=16#CONCURRENT_REQUESTS_PER_IP=16#Disablecookies(enabledbydefault)#COOKIES_ENABLED=False#DisableTelnetConsole(enabledbydefault)#TELNETCONSOLE_ENABLED=False#Overridethedefaultrequestheaders:#DEFAULT_REQUEST_HEADERS={#'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',#'Accept-Language':'en',#}#Enableordisablespidermiddlewares#Seehttps://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES={#'banciyuan.middlewares.BanciyuanSpiderMiddleware':543,#}#Enableordisabledownloadermiddlewares#Seehttps://docs.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES={#'banciyuan.middlewares.BanciyuanDownloaderMiddleware':543,#}#Enableordisableextensions#Seehttps://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS={#'scrapy.extensions.telnet.TelnetConsole':None,#}#Configureitempipelines#Seehttps://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES={'banciyuan.pipelines.BanciyuanPipeline':1,}IMAGES_STORE='./images'#EnableandconfiguretheAutoThrottleextension(disabledbydefault)#Seehttps://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED=True#Theinitialdownloaddelay#AUTOTHROTTLE_START_DELAY=5#Themaximumdownloaddelaytobesetincaseofhighlatencies#AUTOTHROTTLE_MAX_DELAY=60#TheaveragenumberofrequestsScrapyshouldbesendinginparallelto#eachremoteserver#AUTOTHROTTLE_TARGET_CONCURRENCY=1.0#Enableshowingthrottlingstatsforeveryresponsereceived:#AUTOTHROTTLE_DEBUG=False#EnableandconfigureHTTPcaching(disabledbydefault)#Seehttps://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED=True#HTTPCACHE_EXPIRATION_SECS=0#HTTPCACHE_DIR='httpcache'#HTTPCACHE_IGNORE_HTTP_CODES=[]#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取结果

怎么在Python中使用Scrapy爬取豆瓣图片

 </div> <div class="zixun-tj-product adv-bottom"></div> </div> </div> <div class="prve-next-news">
本文:怎么在Python中使用Scrapy爬取豆瓣图片的详细内容,希望对您有所帮助,信息来源于网络。
上一篇:使用python怎么比较字符串是否一样下一篇:

12 人围观 / 0 条评论 ↓快速评论↓

(必须)

(必须,保密)

阿狸1 阿狸2 阿狸3 阿狸4 阿狸5 阿狸6 阿狸7 阿狸8 阿狸9 阿狸10 阿狸11 阿狸12 阿狸13 阿狸14 阿狸15 阿狸16 阿狸17 阿狸18