分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > 教程案例

爬取二重网页

发布时间:2023-09-06 01:15责任编辑:顾先生关键词:暂无标签

1.用 scrapy 新建一个 sun0769 项目

scrapy startproject sun0769

2.在 items.py 中确定要爬去的内容

 1 import scrapy 2 ?3 ?4 class Sun0769Item(scrapy.Item): 5 ????# define the fields for your item here like: 6 ????# name = scrapy.Field() 7 ????problem_type = scrapy.Field() 8 ????title = scrapy.Field() ?9 ????number = scrapy.Field() 10 ????content = scrapy.Field() 11 ????Processing_status = scrapy.Field()12 ????url = scrapy.Field() 

3.快速创建 CrawlSpider模板

scrapy genspider -t crawl dongguan wz.sun0769.com

注意  此时中的名称不能与项目名相同

4.打开 dongguan.py 编写代码

 1 # -*- coding: utf-8 -*- 2 # 导入scrapy 模块 3 import scrapy 4 # 导入匹配规则类,用来提取符合规则的链接 5 from scrapy.linkextractors import LinkExtractor 6 # 导入CrawlSpiderl类和Rule 7 from scrapy.spiders import CrawlSpider, Rule 8 # 导入items中的类 9 from sun0769.items import Sun0769Item10 11 class DongguanSpider(CrawlSpider):12 ????name = ‘dongguan‘13 ????allowed_domains = [‘wz.sun0769.com‘]14 ????start_urls = [‘http://d.wz.sun0769.com/index.php/question/huiyin?page=30‘]15 ????pagelink = LinkExtractor(allow=r"page=\d+")16 ????pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml")17 18 ????rules = (19 ????????Rule(pagelink, follow=True ),20 ????????Rule(pagelink2, callback=‘parse_item‘,follow=True ),21 22 ????)23 24 ????def parse_item(self, response):25 ????????#print response.url 26 ????????item = Sun0769Item() 27 ????????# xpath 返回是一个列表28 ????????#item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract()29 ????????item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0].split(" ")[-1].split(":")[-1]30 ????????# item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0]31 ????????item[‘number‘] = response.xpath(‘//div[@class="pagecenter p3"]// ???strong[@class="tgray14"]/text()‘).extract()[0].split(":")[1].split(" ?")[0]32 ????????#item[‘content‘] = response.xpath().extract()33 ????????#item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0]34 ????????# 把数据传出去35 ????????yield item36 ????????37 ????????

5.在piplines.py写代码

 1 # -*- coding: utf-8 -*- 2 ?3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 ?8 import json 9 10 class TencentPipeline(object):11 ????def open_spider(self, spider):12 ????????self.filename = open("dongguan.json", "w")13 14 ????def process_item(self, item, spider):15 ????????text = json.dumps(dict(item), ensure_ascii = False) + "\n"16 ????????self.filename.write(text.encode("utf-8")17 ????????return item18 19 ????def close_spider(self, spider):20 ????????self.filename.close()复制代码

6.在setting.py设置相关内容


问题:

1.怎么把不同页面的内容整合到一块

2.内容匹配还有些困难(xpath,re)

爬取二重网页

原文地址:http://www.cnblogs.com/cuzz/p/7630314.html

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved