分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > IT知识

requests之headers 'Content-Type': 'text/html' 导致中文encoding错误 'ISO-8859-1'

发布时间:2023-09-06 01:20责任编辑:苏小强关键词:暂无标签

0.

1.参考

代码分析Python requests库中文编码问题

iso-8859是什么?  他又被叫做Latin-1或“西欧语言”

补丁:

import requestsdef monkey_patch(): ???prop = requests.models.Response.content ???def content(self): ???????_content = prop.fget(self) ???????if self.encoding == ‘ISO-8859-1‘: ???????????encodings = requests.utils.get_encodings_from_content(_content) ???????????if encodings: ???????????????self.encoding = encodings[0] ???????????else: ???????????????self.encoding = self.apparent_encoding ???????????_content = _content.decode(self.encoding, ‘replace‘).encode(‘utf8‘, ‘replace‘) ???????????self._content = _content ???????return _content ???requests.models.Response.content = property(content)monkey_patch()

2.原因

In [291]: r = requests.get(‘http://cn.python-requests.org/en/latest/‘)In [292]: r.headers.get(‘content-type‘)Out[292]: ‘text/html; charset=utf-8‘In [293]: r.encodingOut[293]: ‘utf-8‘In [294]: rc = requests.get(‘http://python3-cookbook.readthedocs.io/zh_CN/latest/index.html‘)In [296]: rc.headers.get(‘content-type‘)Out[296]: ‘text/html‘In [298]: rc.encodingOut[298]: ‘ISO-8859-1‘

response text 异常

In [312]: rc.textOut[312]: u‘\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n ?<meta charset="utf-8">\n ?\n ?<meta name="viewport" content="width=device-width, initial-scale=1.0">\n ?\n ?<title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</title>\n ?\n\n ?\n ?\n ?\n ?\n\n ?\n\n ?\n ?\n ???\n\n ?\n\n ?\n ?\n\n ?\n ???<link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n ?\n\n ?\n ???????<link rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n ?????????????href="genindex.html"/>\n ???????<link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n ???????<link rel="copyright"title="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n ???<link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n ???????<link rel="next" titleIn [313]: rc.contentOut[313]: ‘\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n ?<meta charset="utf-8">\n ?\n ?<meta name="viewport" content="width=device-width, initial-scale=1.0">\n ?\n ?<title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</title>\n ?\n\n ?\n ?\n ?\n ?\n\n ?\n\n ?\n ?\n ???\n\n ?\n\n ?\n ?\n\n ?\n ???<link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n ?\n\n ?\n ???????<link rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n ?????????????href="genindex.html"/>\n ???????<link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n ???????<link rel="copyright" title="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n ???<link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n ???????<link rel="next" title=

response headers有‘content-type‘而且没有charset而且有‘text‘,同时满足三个条件导致判定‘ISO-8859-1‘

参考文章说 python3 没有问题,实测有。

C:\Program Files\Anaconda2\Lib\site-packages\requests\utils.py

def get_encoding_from_headers(headers): ???"""Returns encodings from given HTTP Header Dict. ???:param headers: dictionary to extract encoding from. ???:rtype: str ???""" ???content_type = headers.get(‘content-type‘) ???if not content_type: ???????return None ???content_type, params = cgi.parse_header(content_type) ???if ‘charset‘ in params: ???????return params[‘charset‘].strip("‘\"") ???if ‘text‘ in content_type: ???????return ‘ISO-8859-1‘
View Code

C:\Program Files\Anaconda2\Lib\site-packages\requests\adapters.py

class HTTPAdapter(BaseAdapter): ???def build_response(self, req, resp): ???????# Set encoding. ???????response.encoding = get_encoding_from_headers(response.headers)

3.解决办法

参考文章打补丁或:

 ???if resp.encoding == ‘ISO-8859-1‘: ???????encodings = requests.utils.get_encodings_from_content(resp.content) ?#re.compile(r‘<meta.*?charset ?#源代码没有利用这个方法 ???????if encodings: ???????????resp.encoding = encodings[0] ???????else: ???????????resp.encoding = resp.apparent_encoding ?#models.py ?chardet.detect(self.content)[‘encoding‘] 消耗计算 # resp.text >>> if self.encoding is None: encoding = self.apparent_encoding ???????print ‘ISO-8859-1 changed to %s‘%resp.encoding

requests之headers 'Content-Type': 'text/html' 导致中文encoding错误 'ISO-8859-1'

原文地址:http://www.cnblogs.com/my8100/p/requests_encoding_bug.html

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved