WebMagic框架总结

发布时间：2023-09-06 01:09责任编辑：顾先生关键词：Web

一下是博主收藏的第一个WebMagic框架爬虫示例

推荐网站： http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/pageprocessor.html

public class GithubRepoPageProcessor implements PageProcessor { ???// 部分一：抓取网站的相关配置，包括编码、抓取间隔、重试次数等 ???private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); ???@Override ???// process是定制爬虫逻辑的核心接口，在这里编写抽取逻辑 ???public void process(Page page) { ???????// 部分二：定义如何抽取页面信息，并保存下来 ???????page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString()); ???????page.putField("name", page.getHtml().xpath("//h1[@class=‘entry-title public‘]/strong/a/text()").toString()); ???????if (page.getResultItems().get("name") == null) { ???????????//skip this page ???????????page.setSkip(true); ???????} ???????page.putField("readme", page.getHtml().xpath("//div[@id=‘readme‘]/tidyText()")); ???????// 部分三：从页面发现后续的url地址来抓取 ???????page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all()); ???} ???@Override ???public Site getSite() { ???????return site; ???} ???public static void main(String[] args) { ???????Spider.create(new GithubRepoPageProcessor()) ???????????????//从"https://github.com/code4craft"开始抓 ???????????????.addUrl("https://github.com/code4craft") ???????????????//开启5个线程抓取 ???????????????.thread(5) ???????????????//启动爬虫 ???????????????.run(); ???}}

WebMagic框架总结

原文地址：http://www.cnblogs.com/mageblog/p/7494063.html

WebMagic框架总结

知识推荐