一下是博主收藏的第一个WebMagic框架爬虫示例
推荐网站: http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/pageprocessor.html
public class GithubRepoPageProcessor implements PageProcessor { ???// 部分一:抓取网站的相关配置,包括编码、抓取间隔、重试次数等 ???private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); ???@Override ???// process是定制爬虫逻辑的核心接口,在这里编写抽取逻辑 ???public void process(Page page) { ???????// 部分二:定义如何抽取页面信息,并保存下来 ???????page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString()); ???????page.putField("name", page.getHtml().xpath("//h1[@class=‘entry-title public‘]/strong/a/text()").toString()); ???????if (page.getResultItems().get("name") == null) { ???????????//skip this page ???????????page.setSkip(true); ???????} ???????page.putField("readme", page.getHtml().xpath("//div[@id=‘readme‘]/tidyText()")); ???????// 部分三:从页面发现后续的url地址来抓取 ???????page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all()); ???} ???@Override ???public Site getSite() { ???????return site; ???} ???public static void main(String[] args) { ???????Spider.create(new GithubRepoPageProcessor()) ???????????????//从"https://github.com/code4craft"开始抓 ???????????????.addUrl("https://github.com/code4craft") ???????????????//开启5个线程抓取 ???????????????.thread(5) ???????????????//启动爬虫 ???????????????.run(); ???}}
WebMagic框架总结
原文地址:http://www.cnblogs.com/mageblog/p/7494063.html