Golang系列文章：并发抓取网页内容

发布时间：2023-09-06 02:13责任编辑：赖小花关键词：暂无标签

在上一篇中，我们根据命令行的URL参数输入，抓取对应的网页内容并保存到本地磁盘，今天来记录一下如何利用并发，来抓取多个站点的网页内容。

首先，我们在上一次代码的基础上稍作改造，使它能够获取多个站点的内容。下面代码中，我们首先定义好三个URL，然后逐个发送网络请求，获取数据并保存，最后统计消耗的总时间：

// fetch.gopackage mainimport ( ???"os" ???"fmt" ???"time" ???"regexp" ???"net/http" ???"io/ioutil")// 创建正则常量var RE = regexp.MustCompile("\\w+\\.\\w+$")func main() { ???urls := []string { ???????"http://www.qq.com", ???????"http://www.163.com", ???????"http://www.sina.com", ???} ???// 开始时间 ???start := time.Now() ???for _, url := range urls { ???????start := time.Now() ???????// 发送网络请求 ???????res, err := http.Get(url) ???????if err != nil { ???????????fmt.Fprintf(os.Stderr, "fetch: %v\n", err) ???????????os.Exit(1) ???????} ???????// 读取资源数据 ???????body, err := ioutil.ReadAll(res.Body) ???????// 关闭资源 ???????res.Body.Close() ???????if err != nil { ???????????fmt.Fprintf(os.Stderr, "fetch: reading %s: %v\n", url, err) ???????????os.Exit(1) ???????} ???????fileName := getFileName(url) ???????// 写入文件 ???????ioutil.WriteFile(fileName, body, 0644) ???????// 消耗的时间 ???????elapsed := time.Since(start).Seconds() ???????fmt.Printf("%.2fs %s\n", elapsed, fileName) ???} ???// 消耗的时间 ???elapsed := time.Since(start).Seconds() ???fmt.Printf("%.2fs elapsed\n", elapsed)}// 获取文件名func getFileName(url string) string { ???// 从URL中匹配域名后面部分 ???return RE.FindString(url) + ".txt"}

在上面代码中，我们使用正则表达式来从URL中匹配域名后面部分，作为最终的文件名。关于正则表达式，后续会做总结。

下面来看看程序运行后的控制台信息：

$ ./fetch0.12s qq.com.txt0.20s 163.com.txt0.27s sina.com.txt0.59s elapsed

从打印信息中可以看出，最后消耗的总时间等于三次执行的总和。这种方式效率低下，并且不能充分利用计算机资源，下面我们就对程序进行改造，使其能够并发地执行三个抓取操作：

// fetch.gopackage mainimport ( ???"os" ???"fmt" ???"time" ???"regexp" ???"net/http" ???"io/ioutil")// 创建正则var RE = regexp.MustCompile("\\w+\\.\\w+$")func main() { ???urls := []string { ???????"http://www.qq.com", ???????"http://www.163.com", ???????"http://www.sina.com", ???} ???// 创建channel ???ch := make(chan string) ???// 开始时间 ???start := time.Now() ???for _, url := range urls { ???????// 开启一个goroutine ???????go fetch(url, ch) ???} ???for range urls { ???????// 打印channel中的信息 ???????fmt.Println(<-ch) ???} ???// 总消耗的时间 ???elapsed := time.Since(start).Seconds() ???fmt.Printf("%.2fs elapsed\n", elapsed)}// 根据URL获取资源内容func fetch(url string, ch chan<- string) { ???start := time.Now() ???// 发送网络请求 ???res, err := http.Get(url) ???if err != nil { ???????// 输出异常信息 ???????ch <- fmt.Sprint(err) ???????os.Exit(1) ???} ???// 读取资源数据 ???body, err := ioutil.ReadAll(res.Body) ???// 关闭资源 ???res.Body.Close() ???if err != nil { ???????// 输出异常信息 ???????ch <- fmt.Sprintf("while reading %s: %v", url, err) ???????os.Exit(1) ???} ???// 写入文件 ???ioutil.WriteFile(getFileName(url), body, 0644) ???// 消耗的时间 ???elapsed := time.Since(start).Seconds() ???// 输出单个URL消耗的时间 ???ch <- fmt.Sprintf("%.2fs %s", elapsed, url)}// 获取文件名func getFileName(url string) string { ???// 从URL中匹配域名部分 ???return RE.FindString(url) + ".txt"}

上面代码中，我们先创建一个channel，然后对每个抓取操作开启一个goruntine，待抓取程序完成后，通过channel发送消息告知主线程，主线程再做相应的处理操作。关于这部分的原理细节，后续再做总结。

我们运行上面的程序，执行结果如下：

$ ./fetch0.10s http://www.qq.com0.19s http://www.163.com0.29s http://www.sina.com0.29s elapsed

从结果中可以看出，最后消耗的总时间与耗时最长的那个操作等同，可见并发在性能方面带来的提升是非常可观的。

Golang系列文章：并发抓取网页内容

原文地址：https://www.cnblogs.com/liuhe688/p/9597763.html

Golang系列文章：并发抓取网页内容

知识推荐