ajax分析-今日头条街拍美图抓取

发布时间：2023-09-06 02:14责任编辑：沈小雨关键词：暂无标签

　　我们知道，有时候直接利用requests请求得到的原始数据是无效的，因为很多时候，这样获取的一个网页的源代码很可能就几行，明显不是我们想要的东西，这个时候，我们就可以分析，这样的网页中是不是加入了ajax请求，即原始页面加载完成后，会再向服务器请求某个接口去获取数据，然后才被呈现到网上的。

　　ajax并不是一门语言，而是一门技术，它可以在页面不被刷新的情况下，利用javascript进行数据交换，渲染网页进而呈现，我们平时用手机刷微博，往下拉就会出现加载的小圆圈，那个就是加入了ajax请求。

　　所以，我们要想爬取这类网页，就必须要知道ajax的工作原理，爬取之前，安装好必要的库。

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool


#我们进入今日头条的网页后，打开开发者工具，找到ajax请求后，发现它的url里面有以下几个参数，滑动鼠标下拉网页，#我们发现，除了offset这个参数会改变以外，其他的参数都不会改变，而offset正是每页显示的数据条数，即偏移量#由此，我们传入offset参数。def get_page(offset): ???params = { ???????‘offset‘: offset, ???????‘format‘: ‘json‘, ???????‘keyword‘: ‘街拍‘, ???????‘autoload‘: ‘true‘, ???????‘count‘: ‘20‘, ???????‘cur_tab‘: ‘1‘, ???????‘from‘: ‘search_tab‘ ???} ???base_url = ‘https://www.toutiao.com/search_content/?‘ ???#这里我们将构造出来的新的url作为请求对象 ???url = base_url + urlencode(params) ???try: ???????resp = requests.get(url) ???????if codes.ok == resp.status_code: ???????????return resp.json() ???except requests.ConnectionError: ???????return None#这里我们再定义一个方法，加入了一个生成器，用于提取每条数据的图片链接和标题，一并返回。def get_images(json): ???if json.get(‘data‘): ???????data = json.get(‘data‘) ???????for item in data: ???????????title = item.get(‘title‘) ???????????images = item.get(‘image_list‘) ???????????for image in images: ???????????????yield { ???????????????????‘image‘: ‘https:‘ + image.get(‘url‘), ???????????????????‘title‘: title ???????????????}#这里我们定义一个保存数据的方法，引入os模块，以图片的标题来创建文件夹，并请求图片链接获得二进制数据，以二进制数据的形式写入，此处的md5可以做到加密及去重的作用。def save_image(item): ???if not os.path.exists(item.get(‘title‘)): ???????os.makedirs(item.get(‘title‘)) ???try: ???????resp = requests.get(item.get(‘image‘)) ???????if codes.ok == resp.status_code: ???????????file_path = ‘{0}/{1}.{2}‘.format(item.get(‘title‘),md5(response.content).hexdigest(),‘jpg‘) ???????????if not os.path.exists(file_path): ???????????????with open(file_path, ‘wb‘) as f: ???????????????????f.write(response.content) ???????????????print(‘Downloaded image path is %s‘ % file_path) ???????????else: ???????????????print(‘Already Downloaded‘, file_path) ???except requests.ConnectionError: ???????print(‘Failed to Save Image，item %s‘ % item)#这里定义的一个主方法，构造offset变量数组，下面的方法就可以被调用了。def main(offset): ???json = get_page(offset) ???for item in get_images(json): ???????print(item) ???????save_image(item)GROUP_START = 0GROUP_END = 20#此处用到了进程池，调用了map方法，pool.close()表示不加入新的任务，pool.join()表示等待所有子进程结束后再向下执行，也就是整个爬虫的结束。if __name__ == ‘__main__‘: ???pool = Pool() ???groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)]) ???pool.map(main, groups) ???pool.close() ???pool.join()

ajax分析-今日头条街拍美图抓取

原文地址：https://www.cnblogs.com/houziaipangqi/p/9649131.html

ajax分析-今日头条街拍美图抓取

知识推荐