爬取二重网页

发布时间：2023-09-06 01:15责任编辑：顾先生关键词：暂无标签

1.用 scrapy 新建一个 sun0769 项目

scrapy startproject sun0769

2.在 items.py 中确定要爬去的内容

 1 import scrapy 2 ?3 ?4 class Sun0769Item(scrapy.Item): 5 ????# define the fields for your item here like: 6 ????# name = scrapy.Field() 7 ????problem_type = scrapy.Field() 8 ????title = scrapy.Field() ?9 ????number = scrapy.Field() 10 ????content = scrapy.Field() 11 ????Processing_status = scrapy.Field()12 ????url = scrapy.Field()

3.快速创建 CrawlSpider模板

scrapy genspider -t crawl dongguan wz.sun0769.com

注意此时中的名称不能与项目名相同

4.打开 dongguan.py 编写代码

 1 # -*- coding: utf-8 -*- 2 # 导入scrapy 模块 3 import scrapy 4 # 导入匹配规则类，用来提取符合规则的链接 5 from scrapy.linkextractors import LinkExtractor 6 # 导入CrawlSpiderl类和Rule 7 from scrapy.spiders import CrawlSpider, Rule 8 # 导入items中的类 9 from sun0769.items import Sun0769Item10 11 class DongguanSpider(CrawlSpider):12 ????name = ‘dongguan‘13 ????allowed_domains = [‘wz.sun0769.com‘]14 ????start_urls = [‘http://d.wz.sun0769.com/index.php/question/huiyin?page=30‘]15 ????pagelink = LinkExtractor(allow=r"page=\d+")16 ????pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml")17 18 ????rules = (19 ????????Rule(pagelink, follow=True ),20 ????????Rule(pagelink2, callback=‘parse_item‘,follow=True ),21 22 ????)23 24 ????def parse_item(self, response):25 ????????#print response.url 26 ????????item = Sun0769Item() 27 ????????# xpath 返回是一个列表28 ????????#item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract()29 ????????item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0].split(" ")[-1].split(":")[-1]30 ????????# item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0]31 ????????item[‘number‘] = response.xpath(‘//div[@class="pagecenter p3"]// ???strong[@class="tgray14"]/text()‘).extract()[0].split("：")[1].split(" ?")[0]32 ????????#item[‘content‘] = response.xpath().extract()33 ????????#item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0]34 ????????# 把数据传出去35 ????????yield item36 ????????37 ????????

5.在piplines.py写代码

 1 # -*- coding: utf-8 -*- 2 ?3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 ?8 import json 9 10 class TencentPipeline(object):11 ????def open_spider(self, spider):12 ????????self.filename = open("dongguan.json", "w")13 14 ????def process_item(self, item, spider):15 ????????text = json.dumps(dict(item), ensure_ascii = False) + "\n"16 ????????self.filename.write(text.encode("utf-8")17 ????????return item18 19 ????def close_spider(self, spider):20 ????????self.filename.close()复制代码

6.在setting.py设置相关内容

问题:

1.怎么把不同页面的内容整合到一块

2.内容匹配还有些困难（xpath，re）

爬取二重网页

原文地址：http://www.cnblogs.com/cuzz/p/7630314.html

爬取二重网页

知识推荐