图片使用js onload事件加载
<p><img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc=</span></p>
找到soureces 文件中对应的js 方法jandan_load_img
通过debugger js 将Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc= 传入函数jdugRtgCtw78dflFjGXBvN6TBHAoKvZ7xu base64_decode得到img路经
再通过正则表达式将img路径中的(/W+)替换为large
爬取代码如下:
import base64import reimport requestsfrom concurrent.futures import ThreadPoolExecutorfrom random import choicefrom lxml import etreefrom user_agent_list import USER_AGENTSheaders = {‘user-agent‘: choice(USER_AGENTS)}def fetch_url(url): ???‘‘‘ ???:param url: 路径 ???:return: html ???‘‘‘ ???try: ???????r = requests.get(url, headers=headers) ???????r.raise_for_status() ???????r.encoding = r.apparent_encoding ???????if r.status_code in [200, 201]: ???????????return r.text ???except Exception as e: ???????print(e)def downloadone(url): ???html = fetch_url(url) ???data = etree.HTML(html) ???img_hash_list = data.xpath(‘//*[@class="img-hash"]/text()‘) ???for img_hash in img_hash_list: ???????img_path = ‘http:‘ + bytes.decode(base64.b64decode(img_hash)) ???????img_path = re.sub(r‘mw\d+‘, ‘large‘, img_path) ???????img_name = img_path.rsplit(‘/‘, 1)[1] ???????with open(‘jiandan/‘+img_name, ‘wb‘) as f: ???????????r = requests.get(img_path) ???????????f.write(r.content)def main(): ???url_list = [] ???for _ in range(1, 44): ???????url = ‘http://jandan.net/ooxx/page-{}‘.format(_) ???????url_list.append(url) ???with ThreadPoolExecutor(4) as executor: ??????executor.map(downloadone, url_list)if __name__ == ‘__main__‘: ???main()
煎蛋网爬虫之JS逆向解析img路径
原文地址:https://www.cnblogs.com/frank-shen/p/10269363.html