工具:python3
目标:将编写的代码封装,不同函数完成不同功能,爬取任意页数的html
新学语法:with open as
除了有更优雅的语法,with还可以很好的处理上下文环境产生的异常。
1 # coding:utf-8 2 ?3 import urllib.request 4 ?5 def loadPage(fullurl,filename): 6 ????"""作用:根据url发送请求,获取服务器响应请求""" 7 ????ua_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"} 8 ????print("正在下载" + filename) 9 10 ????request = urllib.request.Request(fullurl, headers=ua_headers )11 ????response = urllib.request.urlopen(request)12 ????return response.read()13 14 15 def writePage(html, filename):16 ????"""17 ????作用:将html内容写入到本地18 ????html:服务器相应文件内容19 ????"""20 ????print("正在写入" + filename)
??????# 新建文件,存储html信息21 ????with open(filename, "w") as f:22 ????????f.write(str(html))23 24 25 def tiebaSpider(url, beginpage, endpage):26 ????"""27 ????作用:贴吧爬虫调度器,负责组合处理每个页面的url28 ????url:贴吧url的前部分29 ????beginPage: 起始页30 ????endPage: 结束页31 ????:return:32 ????"""
??????# 构造每页的url和文件名33 ????for page in range(beginpage, endpage+1):34 ????????pn = (page-1)*5035 ????????fullurl = url + "&" + str(pn)36 ????????filename = "第" + str(page) + "页.html"37 38 ????????html = loadPage(fullurl, filename)39 ????????writePage(html, filename)40 ????print("完成!")41 42 43 if __name__ == "__main__":44 ????kw = input("请输入要爬取的贴吧名: ")45 ????beginPage = int(input("请输入起始页: "))46 ????endPage = int(input("请输入结束页: "))47 48 ????url = "http://tieba.baidu.com/f?"49 ????kw = urllib.parse.urlencode({"kw": kw})50 51 ????url = url + kw52 53 ????tiebaSpider(url, beginPage, endPage)
爬虫(GET)——爬取多页的html
原文地址:https://www.cnblogs.com/gaoquanquan/p/9089738.html