0.
1.参考
代码分析Python requests库中文编码问题
iso-8859是什么? 他又被叫做Latin-1或“西欧语言”
补丁:
import requestsdef monkey_patch(): ???prop = requests.models.Response.content ???def content(self): ???????_content = prop.fget(self) ???????if self.encoding == ‘ISO-8859-1‘: ???????????encodings = requests.utils.get_encodings_from_content(_content) ???????????if encodings: ???????????????self.encoding = encodings[0] ???????????else: ???????????????self.encoding = self.apparent_encoding ???????????_content = _content.decode(self.encoding, ‘replace‘).encode(‘utf8‘, ‘replace‘) ???????????self._content = _content ???????return _content ???requests.models.Response.content = property(content)monkey_patch()
2.原因
In [291]: r = requests.get(‘http://cn.python-requests.org/en/latest/‘)In [292]: r.headers.get(‘content-type‘)Out[292]: ‘text/html; charset=utf-8‘In [293]: r.encodingOut[293]: ‘utf-8‘In [294]: rc = requests.get(‘http://python3-cookbook.readthedocs.io/zh_CN/latest/index.html‘)In [296]: rc.headers.get(‘content-type‘)Out[296]: ‘text/html‘In [298]: rc.encodingOut[298]: ‘ISO-8859-1‘
response text 异常
In [312]: rc.textOut[312]: u‘\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n ?<meta charset="utf-8">\n ?\n ?<meta name="viewport" content="width=device-width, initial-scale=1.0">\n ?\n ?<title>Python Cookbook 3rd Edition Documentation — python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</title>\n ?\n\n ?\n ?\n ?\n ?\n\n ?\n\n ?\n ?\n ???\n\n ?\n\n ?\n ?\n\n ?\n ???<link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n ?\n\n ?\n ???????<link rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n ?????????????href="genindex.html"/>\n ???????<link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n ???????<link rel="copyright"title="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n ???<link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n ???????<link rel="next" titleIn [313]: rc.contentOut[313]: ‘\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n ?<meta charset="utf-8">\n ?\n ?<meta name="viewport" content="width=device-width, initial-scale=1.0">\n ?\n ?<title>Python Cookbook 3rd Edition Documentation — python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</title>\n ?\n\n ?\n ?\n ?\n ?\n\n ?\n\n ?\n ?\n ???\n\n ?\n\n ?\n ?\n\n ?\n ???<link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n ?\n\n ?\n ???????<link rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n ?????????????href="genindex.html"/>\n ???????<link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n ???????<link rel="copyright" title="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n ???<link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n ???????<link rel="next" title=
response headers有‘content-type‘而且没有charset而且有‘text‘,同时满足三个条件导致判定‘ISO-8859-1‘
参考文章说 python3 没有问题,实测有。
C:\Program Files\Anaconda2\Lib\site-packages\requests\utils.py
def get_encoding_from_headers(headers): ???"""Returns encodings from given HTTP Header Dict. ???:param headers: dictionary to extract encoding from. ???:rtype: str ???""" ???content_type = headers.get(‘content-type‘) ???if not content_type: ???????return None ???content_type, params = cgi.parse_header(content_type) ???if ‘charset‘ in params: ???????return params[‘charset‘].strip("‘\"") ???if ‘text‘ in content_type: ???????return ‘ISO-8859-1‘
C:\Program Files\Anaconda2\Lib\site-packages\requests\adapters.py
class HTTPAdapter(BaseAdapter): ???def build_response(self, req, resp): ???????# Set encoding. ???????response.encoding = get_encoding_from_headers(response.headers)
3.解决办法
参考文章打补丁或:
???if resp.encoding == ‘ISO-8859-1‘: ???????encodings = requests.utils.get_encodings_from_content(resp.content) ?#re.compile(r‘<meta.*?charset ?#源代码没有利用这个方法 ???????if encodings: ???????????resp.encoding = encodings[0] ???????else: ???????????resp.encoding = resp.apparent_encoding ?#models.py ?chardet.detect(self.content)[‘encoding‘] 消耗计算 # resp.text >>> if self.encoding is None: encoding = self.apparent_encoding ???????print ‘ISO-8859-1 changed to %s‘%resp.encoding
requests之headers 'Content-Type': 'text/html' 导致中文encoding错误 'ISO-8859-1'
原文地址:http://www.cnblogs.com/my8100/p/requests_encoding_bug.html