汽车之家汽车品牌Logo信息抓取 DotnetSpider实战[三]

发布时间：2023-09-06 01:56责任编辑：董明明关键词：暂无标签

一、正题前的唠叨

第一篇实战博客，阅读量1000+，第二篇，阅读量200+，两篇文章相差近5倍，这个差异真的令我很费劲，截止今天，我一直在思考为什么会有这么大的差距，是因为干货变少了，还是什么原因，一直没想清楚，如果有读者发现问题，可以评论写下大家的观点，当出现这样的差距会是什么原因，谢谢大家。

二、分析汽车之家品牌Logo页面

2.1分析页面结构

首先我们打开汽车之家品牌Logo选择页 https://car.m.autohome.com.cn/，我们以华颂为例，实际上我们就是需要将class是item的里面的img的src（图片路径），和strong里面的text（品牌）获取就行了，大家可以看到，这个其实很简单，相比上次我们获取页面，获取接口数据简单多了，为什么要单独拿一个作为一篇文章呢，就是因为这个地方还涉及到一个文件下载，这一块之前都没有提到过。

2.2页面中的坑

最开始抓取的时候，我发现很多地方src都是空，我就很纳闷为什么会这样，后来断点调试后才发现，汽车之家Logo图片在页面还未划到此处的时候，img是不会加载的，只是占一个位置在那，等到滚动条滚到哪，哪的图片就会加载，所以此处抓取img的路径时需要判断一下

三、动手开发

3.1准备Processor

private class GetLogoInfoProcessor : BasePageProcessor //获取Logo信息 ???????{ ???????????public GetLogoInfoProcessor() ???????????{ ???????????} ???????????protected override void Handle(Page page) ???????????{ ???????????????List<LogoInfoModel> logoInfoList = new List<LogoInfoModel>(); ???????????????var logoInfoNodes = page.Selectable.XPath(".//div[@id=‘div_ListBrand‘]//div[@class=‘item‘]").Nodes(); ???????????????foreach (var logoInfo in logoInfoNodes) ???????????????{ ???????????????????LogoInfoModel model = new LogoInfoModel(); ???????????????????model.BrandName = logoInfo.XPath("./strong").GetValue(); ???????????????????model.ImgPath = logoInfo.XPath("./img/@src").GetValue(); ???????????????????if (model.ImgPath == null) ???????????????????{ ???????????????????????model.ImgPath = logoInfo.XPath("./img/@data-src").GetValue(); ???????????????????} ???????????????????if (model.ImgPath.IndexOf("https") == -1) ???????????????????{ ???????????????????????model.ImgPath = "https:" + model.ImgPath; ???????????????????} ???????????????????logoInfoList.Add(model); ???????????????????//page.AddTargetRequest(model.ImgPath); //Site设置DownloadFiles为TRUE就可以自动下载文件 ???????????????} ???????????????page.AddResultItem("LogoInfoList", logoInfoList); ???????????} ???????}

3.2准备Pipeline

这个地方我没用他原用的下载方法，自己写了一个简单的下载方法，因为我感觉他的下载方式直接down下来，不是很符合我的业务逻辑

private class PrintLogInfoPipe : BasePipeline ???????{ ???????????public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider) ???????????{ ???????????????foreach (var resultItem in resultItems) ???????????????{ ???????????????????var logoInfoList = resultItem.GetResultItem("LogoInfoList") as List<LogoInfoModel>; ???????????????????foreach (var logoInfo in logoInfoList) ???????????????????{ ???????????????????????Console.WriteLine($"brand:{logoInfo.BrandName} path:{logoInfo.ImgPath}"); ???????????????????????SaveFile(logoInfo.ImgPath, logoInfo.BrandName); ???????????????????} ???????????????} ???????????} ???????????private void SaveFile(string url, string filename) ???????????{ ???????????????HttpRequestMessage httpRequestMessage = new HttpRequestMessage(); ???????????????httpRequestMessage.RequestUri = new Uri(url); ???????????????httpRequestMessage.Method = HttpMethod.Get; ???????????????HttpClient httpClient = new HttpClient(); ???????????????var httpResponse = httpClient.SendAsync(httpRequestMessage); ???????????????var intervalPath = new Uri(url); ???????????????string filePath = Environment.CurrentDirectory + "/img/"; ???????????????if (!File.Exists(filePath)) ???????????????{ ???????????????????try ???????????????????{ ???????????????????????string folder = Path.GetDirectoryName(filePath); ???????????????????????if (!string.IsNullOrWhiteSpace(folder)) ???????????????????????{ ???????????????????????????if (!Directory.Exists(folder)) ???????????????????????????{ ???????????????????????????????Directory.CreateDirectory(folder); ???????????????????????????} ???????????????????????} ???????????????????????File.WriteAllBytes(filePath + filename + ".jpg", httpResponse.Result.Content.ReadAsByteArrayAsync().Result); ???????????????????} ???????????????????catch ???????????????????{ ???????????????????} ???????????????} ???????????????httpClient.Dispose(); ???????????} ???????}

存储实体类

private class LogoInfoModel{ ????public string BrandName { get; set; } ????public string ImgPath { get; set; }}

3.3构造爬虫

static void Main(string[] args) ???????{ ???????????var site = new Site ???????????{ ???????????????CycleRetryTimes = 1, ???????????????SleepTime = 200, ???????????????//DownloadFiles = true, ????DotNetSpider中设置是否下载文件 ???????????????Headers = new Dictionary<string, string>() ???????????????{ ???????????????????{ "Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" }, ???????????????????{ "Cache-Control","no-cache" }, ???????????????????{ "Connection","keep-alive" }, ???????????????????{ "Content-Type","application/x-www-form-urlencoded; charset=UTF-8" }, ???????????????????{ "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"} ???????????????} ???????????}; ???????????List<Request> resList = new List<Request>(); ???????????Request res = new Request(); ???????????res.Url = "https://car.m.autohome.com.cn/"; ???????????res.Method = System.Net.Http.HttpMethod.Get; ???????????resList.Add(res); ???????????var spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), new GetLogoInfoProcessor()) ????????????????.AddStartRequests(resList.ToArray()) ???????????????.AddPipeline(new PrintLogInfoPipe()); ???????????spider.ThreadNum = 1; ???????????spider.Run(); ???????????Console.Read(); ???????}

3.4 Site中DownloadFiles 源码分析

源代码中HttpClientDownloader中源代码会自动去判断Site中的DownloadFiles是否允许下载文件，默认是false，如果不将DownloadFiles的值设置为true，那么对于非字符串格式的接口数据，直接会被忽略，如果大家感兴趣，可以将我代码中的两行注释取消，那么就可以看到DotnetSpider中的下载方式

四、执行结果

本次执行的结果，已经上传到bilibili中，大家有兴趣可以打开围观一下

https://www.bilibili.com/video/av24022630/

五、总结

这次我们将数据的抓取以及文件的下载进行了一个小综合，也介绍了DotnetSpider原生的下载方式，以及我自己写的一个下载方法，大家如果遇到类似的需求可以自己选择符合自己业务逻辑的方法，希望这篇文章能够帮助到大家，如果觉得哪里写的不好，欢迎拍大板砖

三次博文源代码我已经上传Github，感兴趣可以直接下载下来 https://github.com/FunnyBoyDeng/SpiderAutoHome

六、下期没有预告

至于下期我还没想好爬什么，欢迎大家留言说自己想要爬的东西

2018-05-27

汽车之家汽车品牌Logo信息抓取 DotnetSpider实战[三]

原文地址：https://www.cnblogs.com/FunnyBoy/p/9097377.html

汽车之家汽车品牌Logo信息抓取 DotnetSpider实战[三]

一、正题前的唠叨

二、分析汽车之家品牌Logo页面

2.1分析页面结构

2.2页面中的坑

三、动手开发

3.1准备Processor

3.2准备Pipeline

3.3构造爬虫

3.4 Site中DownloadFiles 源码分析

四、执行结果

五、总结

六、下期没有预告

知识推荐