分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > IT知识

关于爬取json内容生成词云(疯狂踩坑)

发布时间:2023-09-06 01:51责任编辑:傅花花关键词:jsjson

本文爬取了掘金上关于前端前n页的标题。将文章的标题进行分析,可以看出人们对前端关注的点或者近来的热点。

  1. 导入库
    import requestsimport refrom bs4 import BeautifulSoupimport jsonimport urllibimport jiebafrom wordcloud import WordCloudimport matplotlib.pyplot as pltimport numpy as npimport xlwtimport jieba.analysefrom PIL import Image,ImageSequence
  2. 爬取json
    #动态网页json爬取response=urllib.request.urlopen(ajaxUrl)ajaxres=response.read().decode(‘utf-8‘)json_str = json.dumps(ajaxres) #编码strdata = json.loads(json_str) ?# 解码data=eval(strdata) 
  3. 循环输出title内容,并写入文件
    for i in range(0,25): ???ajaxUrl = ajaxUrlBegin + str(i) + ajaxUrlLast; ???for i in range(0,19): ???????result=[] ???????result=data[‘d‘][i][‘title‘] ???????print(result+‘\n‘) ???????f = open(‘finally.txt‘, ‘a‘, encoding=‘utf-8‘) ???????f.write(result) ???????f.close()
  4. 生成词云
    #词频统计f = open(‘finally.txt‘, ‘r‘, encoding=‘utf-8‘)str = f.read()stringList = list(jieba.cut(str))symbol = {"/", "(", ")", " ", ";", "!", "、", ":","+","?"," ",")","(","?",",","之","你","了","吗","】","【"}stringSet = set(stringList) - symboltitle_dict = {}for i in stringSet: ???title_dict[i] = stringList.count(i)print(title_dict)#导入exceldi = title_dictwbk = xlwt.Workbook(encoding=‘utf-8‘)sheet = wbk.add_sheet("wordCount") ?# Excel单元格名字k = 0for i in di.items(): ???sheet.write(k, 0, label=i[0]) ???sheet.write(k, 1, label=i[1]) ???k = k + 1wbk.save(‘前端数据.xls‘) ?# 保存为 wordCount.xls文件  font = r‘C:\Windows\Fonts\simhei.ttf‘content = ‘ ‘.join(title_dict.keys())# 根据图片生成词云image = np.array(Image.open(‘cool.jpg‘))wordcloud = WordCloud(background_color=‘white‘, font_path=font, mask=image, width=1000, height=860, margin=2).generate(content)# 显示生成的词云图片plt.imshow(wordcloud)plt.axis("off")plt.show()wordcloud.to_file(‘c-cool.jpg‘)
  5. 一个项目n个坑,一个坑踩一万年
  • 获取动态网页的具体内容

   爬取动态网页时标题并不能在html里直接找到,需要通过开发者工具里的Network去寻找。寻找到的是ajax发出的json数据。

  • 获取json里面的具体某个数据

    我们获取到json数据之后(通过url获取)发现它。。

(wtf,啥玩意啊这是???)

这时我们可以用一个Google插件JSONview,用了之后发现他说人话了终于!

  • 接下来就是wordCloud的安装

   这个我就不说了(说了之后只是网上那批没用的答案+1.)。想知道怎么解决的出门右转隔壁的隔壁的隔壁老黄的博客。(芬达牛比)

  1. 总体代码
    import requestsimport refrom bs4 import BeautifulSoupimport jsonimport urllibimport jiebafrom wordcloud import WordCloudimport matplotlib.pyplot as pltimport numpy as npimport xlwtimport jieba.analysefrom PIL import Image,ImageSequenceurl=‘https://juejin.im/search?query=前端‘res = requests.get(url)res.encoding = "utf-8"soup = BeautifulSoup(res.text,"html.parser")#遍历n次ajaxUrlBegin=‘https://search-merger-ms.juejin.im/v1/search?query=%E5%89%8D%E7%AB%AF&page=‘ajaxUrlLast=‘&raw_result=false&src=web‘for i in range(0,25): ???ajaxUrl=ajaxUrlBegin+str(i)+ajaxUrlLast;#动态网页json爬取response=urllib.request.urlopen(ajaxUrl)ajaxres=response.read().decode(‘utf-8‘)json_str = json.dumps(ajaxres) #编码strdata = json.loads(json_str) ?# 解码data=eval(strdata) #str转换为dictfor i in range(0,25): ???ajaxUrl = ajaxUrlBegin + str(i) + ajaxUrlLast; ???for i in range(0,19): ???????result=[] ???????result=data[‘d‘][i][‘title‘] ???????print(result+‘\n‘) ???????f = open(‘finally.txt‘, ‘a‘, encoding=‘utf-8‘) ???????f.write(result) ???????f.close()#词频统计f = open(‘finally.txt‘, ‘r‘, encoding=‘utf-8‘)str = f.read()stringList = list(jieba.cut(str))symbol = {"/", "(", ")", " ", ";", "!", "、", ":","+","?"," ",")","(","?",",","之","你","了","吗","】","【"}stringSet = set(stringList) - symboltitle_dict = {}for i in stringSet: ???title_dict[i] = stringList.count(i)print(title_dict)#导入exceldi = title_dictwbk = xlwt.Workbook(encoding=‘utf-8‘)sheet = wbk.add_sheet("wordCount") ?# Excel单元格名字k = 0for i in di.items(): ???sheet.write(k, 0, label=i[0]) ???sheet.write(k, 1, label=i[1]) ???k = k + 1wbk.save(‘前端数据.xls‘) ?# 保存为 wordCount.xls文件  font = r‘C:\Windows\Fonts\simhei.ttf‘content = ‘ ‘.join(title_dict.keys())# 根据图片生成词云image = np.array(Image.open(‘cool.jpg‘))wordcloud = WordCloud(background_color=‘white‘, font_path=font, mask=image, width=1000, height=860, margin=2).generate(content)# 显示生成的词云图片plt.imshow(wordcloud)plt.axis("off")plt.show()wordcloud.to_file(‘c-cool.jpg‘)

               (词云图)

       

关于爬取json内容生成词云(疯狂踩坑)

原文地址:https://www.cnblogs.com/polvem/p/8973449.html

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved