WebMagic基于Maven进行构建,推荐使用Maven来安装WebMagic。在你自己的项目(已有项目或者新建一个)中添加以下坐标即可:
<dependency> ???<groupId>us.codecraft</groupId> ???<artifactId>webmagic-core</artifactId> ???<version>0.7.3</version></dependency><dependency> ???<groupId>us.codecraft</groupId> ???<artifactId>webmagic-extension</artifactId> ???<version>0.7.3</version></dependency>
WebMagic使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。
以下代码是去除依赖
<dependency> ???<groupId>us.codecraft</groupId> ???<artifactId>webmagic-extension</artifactId> ???<version>0.7.3</version> ???<exclusions> ???????<exclusion> ???????????<groupId>org.slf4j</groupId> ???????????<artifactId>slf4j-log4j12</artifactId> ???????</exclusion> ???</exclusions></dep
endency>
例子: 测试可用的
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class MyProcessor implements PageProcessor {
???// 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
???private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
???private static int count =0;
???@Override
???public Site getSite() {
???????return site;
???}
???@Override
???public void process(Page page) {
???????//判断链接是否符合http://www.cnblogs.com/任意个数字字母-/p/7个数字.html格式
???????if(!page.getUrl().regex("http://www.cnblogs.com/[a-z 0-9 -]+/p/[0-9]{7}.html").match()){
???????????//加入满足条件的链接
???????????page.addTargetRequests(
???????????????????page.getHtml().xpath("//*[@id=\"post_list\"]/div/div[@class=‘post_item_body‘]/h3/a/@href").all());
???????????//获取页面需要的内容
???????????System.out.println("抓取的内容:"+
???????????????????page.getHtml().xpath("//*[@id=\"Header1_HeaderTitle\"]/text()").get()
???????????);
???????????count ++;
???????}
???}
???public static void main(String[] args) {
???????long startTime, endTime;
???????System.out.println("开始爬取...");
???????startTime = System.currentTimeMillis();
???????Spider.create(new MyProcessor()).addUrl("https://www.cnblogs.com/").thread(5).run();
???????endTime = System.currentTimeMillis();
???????System.out.println("爬取结束,耗时约" + ((endTime - startTime) / 1000) + "秒,抓取了"+count+"条记录");
???}
}
webMagic简单的例子
原文地址:https://www.cnblogs.com/junlei0829/p/9409148.html