分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > 运营维护

BeautifulSoup解析网页

发布时间:2023-09-06 02:09责任编辑:沈小雨关键词:暂无标签
from bs4 import BeautifulSoupimport requestsurl = ‘http://dangjian.gmw.cn/node_11940.htm‘html = requests.get(url).content# prettify()用于格式化soup = BeautifulSoup(html, ‘lxml‘)# print(soup.prettify())# print(soup.find_all(‘span‘, class_="channel-newsTime"))resultSet = soup.find_all(‘ul‘, class_="channel-newsGroup")urls = set()for rs in resultSet: ???# url = rs.a[‘href‘] ???hrefs = rs.find_all(‘a‘) ???for href in hrefs: ???????url = href[‘href‘] ???????if url.startswith("http"): ???????????urls.add(url) ???????else: ???????????urls.add("http://dangjian.gmw.cn/"+url)print(urls)for url in urls: ???html = requests.get(url).content ???soup = BeautifulSoup(html, ‘lxml‘) ???title = soup.find(id="articleTitle").string ???# parts = soup.find(id="contentMain") ???parts = soup.select("div #contentMain > p") ???content = "" ???for part in parts: ???????content = content + part.string.__str__() ???print(title) ???print(content)

BeautifulSoup解析网页

原文地址:https://www.cnblogs.com/cord/p/9452950.html

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved