scrapy 【meta】的高级应用

发布时间：2023-09-06 02:30责任编辑：郭大石关键词：meta

下面以一个网站的图书爬取为例，数据需要记录大分类、小分类等信息。

页面有大分类页面、小分类页面、列表页面、详情页面、

需要一条数据，包括所有的信息，所以借助meta的作用来把当前响应的数据进行传递给下一个解析函数。

为什么需要深拷贝呢？防止一条数据还没有收集完全，就被下一个请求重新定义item给洗刷掉之前所得到的数据。

 ???def parse(self, response): ???????#1.大分类分组 ???????li_list = response.xpath("//ul[@class=‘ulwrap‘]/li") ???????for li in li_list: ???????????item = {} ???????????item["b_cate"] = li.xpath("./div[1]/a/text()").extract_first() ???????????#2.小分类分组 ???????????a_list = li.xpath("./div[2]/a") ???????????for a in a_list: ???????????????item["s_href"] = a.xpath("./@href").extract_first() ???????????????item["s_cate"] = a.xpath("./text()").extract_first() ???????????????if item["s_href"] is not None: ???????????????????item["s_href"]= "http://snbook.suning.com/" + item["s_href"] ???????????????????yield scrapy.Request( ???????????????????????item["s_href"], ???????????????????????callback=self.parse_book_list, ???????????????????????meta = {"item":deepcopy(item)} ???????????????????) ???def parse_book_list(self,response): ???????item = deepcopy(response.meta["item"]) ???????#图书列表页分组 ???????li_list = response.xpath("//div[@class=‘filtrate-books list-filtrate-books‘]/ul/li") ???????for li in li_list: ???????????item["book_name"] = li.xpath(".//div[@class=‘book-title‘]/a/@title").extract_first() ???????????item["book_img"] = li.xpath(".//div[@class=‘book-img‘]//img/@src").extract_first() ???????????if item["book_img"] is None: ???????????????item["book_img"] = li.xpath(".//div[@class=‘book-img‘]//img/@src2").extract_first() ???????????item["book_author"] = li.xpath(".//div[@class=‘book-author‘]/a/text()").extract_first() ???????????item["book_press"] = li.xpath(".//div[@class=‘book-publish‘]/a/text()").extract_first() ???????????item["book_desc"] = li.xpath(".//div[@class=‘book-descrip c6‘]/text()").extract_first() ???????????item["book_href"]= li.xpath(".//div[@class=‘book-title‘]/a/@href").extract_first() ???????????yield scrapy.Request( ???????????????item["book_href"], ???????????????callback=self.parse_book_detail, ???????????????# 传递给下一个解析函数 ???????????????meta = {"item":deepcopy(item)} ???????????) ???????#翻页 ???????page_count = int(re.findall("var pagecount=(.*?);",response.body.decode())[0]) ???????current_page = ?int(re.findall("var currentPage=(.*?);",response.body.decode())[0]) ???????if current_page<page_count: ???????????next_url = item["s_href"] +"?pageNumber={}&sort=0".format(current_page+1) ???????????yield scrapy.Request( ???????????????next_url, ???????????????callback=self.parse_book_list, ???????????????meta = {"item":response.meta["item"]} ???????????) ???def parse_book_detail(self,response): ???????item = response.meta["item"] ???????item["book_price"] = re.findall("\"bp\":‘(.*?)‘,",response.body.decode()) ???????item["book_price"] = item["book_price"][0] if len(item["book_price"])>0 else None ???????print(item)

scrapy 【meta】的高级应用

原文地址：https://www.cnblogs.com/tangkaishou/p/10268388.html

scrapy 【meta】的高级应用

知识推荐