分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > 技术分享

一个完整的大作业--‘’数据观”官方网站数据爬取

发布时间:2023-09-06 01:21责任编辑:白小东关键词:暂无标签

1.选一个自己感兴趣的主题。

‘’数据观”官方网站数据爬取,网页网址为‘http://www.cbdio.com/node_2568.htm’

2.网络上爬取相关的数据。

import requestsfrom bs4 import BeautifulSoup

url = ‘http://www.cbdio.com/node_2568.htm‘res = requests.get(url)res.encoding = ‘utf-8‘soup = BeautifulSoup(res.text, ‘html.parser‘)for items in soup.select(‘li‘): ???if len(items.select(‘.cb-media-title‘))>0: ???????title=items.select(‘.cb-media-title‘)[0].text#标题 ???????url1=items.select(‘a‘)[0][‘href‘] ???????url2=‘http://www.cbdio.com/{}‘.format(url1)#链接
???????resd=requests.get(url2) ???????resd.encoding=‘utf-8‘ ???????soupd=BeautifulSoup(resd.text,‘html.parser‘) ???????source=soupd.select(‘.cb-article-info‘)[0].text.strip()#来源 ???????content=soupd.select(‘.cb-article‘)[0].text#内容 ???????print("################################################################################") ???????print(‘标题:‘,title,‘\t链接:‘,url2,source)

3.进行文本分析,生成词云。

url=‘http://www.cbdio.com/node_2568.htm‘res = requests.get(url)res.encoding = ‘utf-8‘soup = BeautifulSoup(res.text, ‘html.parser‘)contentls=[]for item in soup.select(‘li‘): ???if len(item.select(‘.cb-media-title‘))>0: ???????url1=item.select(‘a‘)[0][‘href‘] ???????url2=‘http://www.cbdio.com/{}‘.format(url1) ???????resd=requests.get(url2) ???????resd.encoding=‘utf-8‘ ???????soupd=BeautifulSoup(resd.text,‘html.parser‘) ???????cont=soupd.select(‘.cb-article‘)[0].text#内容 ???????contentls.append(cont)print(contentls)words=jieba.lcut(content)ls=[]counts={}for word in words: ???ls.append(word) ???if len(word)==1: ???????continue ???else: ???????counts[word]=counts.get(word,0)+1items = list(counts.items())items.sort(key = lambda x:x[1], reverse = True)for i in range(10): ???word , count = items[i] ???print ("{:<5}{:>2}".format(word,count))#词云制作from wordcloud import WordCloudimport matplotlib.pyplot as pltcy = WordCloud(font_path=‘msyh.ttc‘).generate(content)plt.imshow(cy, interpolation=‘bilinear‘)plt.axis("off")plt.show()

4.对文本分析结果解释说明。

通过以上数据显示,该中国大数据官网主要的话题是数据以及交易 和政府、企业、专家等。

5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。

import requestsfrom bs4 import BeautifulSoupdef getTheContent(url1): ???res = requests.get(url1) ???res.encoding = ‘utf-8‘ ???soup = BeautifulSoup(res.text, ‘html.parser‘) ???item={} ???item[‘title‘]=soup.select(‘.cb-article-title‘)[0].text#标题 ???item[‘url‘]=url1#链接 ???resd=requests.get(item[‘url‘]) ???resd.encoding=‘utf-8‘ ???soupd=BeautifulSoup(resd.text,‘html.parser‘) ???item[‘source‘]=soupd.select(‘.cb-article-info‘)[0].text.strip()#来源 ???item[‘content‘]=soupd.select(‘.cb-article‘)[0].text#内容 ???return(item)def getOnePage(pageurl): ???res = requests.get(pageurl) ???res.encoding = ‘utf-8‘ ???soup = BeautifulSoup(res.text, ‘html.parser‘) ???itemls=[] ???for item in soup.select(‘li‘): ???????if len(item.select(‘.cb-media-title‘))>0: ???????????url1=item.select(‘a‘)[0][‘href‘] ???????????url2=‘http://www.cbdio.com/{}‘.format(url1) ???????????itemls.append(getTheContent(url2)) ???return(itemls) ?#结巴词频统计import jiebaurl=‘http://www.cbdio.com/node_2568.htm‘res = requests.get(url)res.encoding = ‘utf-8‘soup = BeautifulSoup(res.text, ‘html.parser‘)contentls=[]for item in soup.select(‘li‘): ???if len(item.select(‘.cb-media-title‘))>0: ???????url1=item.select(‘a‘)[0][‘href‘] ???????url2=‘http://www.cbdio.com/{}‘.format(url1) ???????resd=requests.get(url2) ???????resd.encoding=‘utf-8‘ ???????soupd=BeautifulSoup(resd.text,‘html.parser‘) ???????cont=soupd.select(‘.cb-article‘)[0].text#内容 ???????contentls.append(cont)print(contentls)##for each in contentls:## ???f = open("1.txt", ‘r‘, ‘utf-8‘)## ???f.write(each)#### ???print(each)## ???f.close()## ???print(‘#‘)##fo=open(‘1.txt‘,‘r‘)##content=fo.read()##content=str(contentls)words=jieba.lcut(content)ls=[]counts={}for word in words: ???ls.append(word) ???if len(word)==1: ???????continue ???else: ???????counts[word]=counts.get(word,0)+1items = list(counts.items())items.sort(key = lambda x:x[1], reverse = True)for i in range(10): ???word , count = items[i] ???print ("{:<5}{:>2}".format(word,count))#词云制作from wordcloud import WordCloudimport matplotlib.pyplot as pltcy = WordCloud(font_path=‘msyh.ttc‘).generate(content)plt.imshow(cy, interpolation=‘bilinear‘)plt.axis("off")plt.show()#excel导出、数据库存储import reimport pandasimport sqlite3itemtotal=[]for i in range(2,3): ???listurl=‘http://www.cbdio.com/node_2568.htm‘ ???itemtotal.extend(getOnePage(listurl))df =pandas.DataFrame(itemtotal)df.to_excel(‘BigDataItems.xlsx‘)with sqlite3.connect(‘BigDataItems.sqlite‘) as db: ???df.to_sql(‘BigDataItems‘,con=db) ???print(‘输出成功!!‘)

一个完整的大作业--‘’数据观”官方网站数据爬取

原文地址:http://www.cnblogs.com/huanglinxin/p/7732885.html

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved