?1 Html / XHtml 解析 - Parsing Html and XHtml ?2 ??3 HTMLParser 模块 ?4 ????通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类, ?5 ????然后子类中实现处理的标签(<.>)的方法, 其实现是通过 ‘重写‘ 父类(HTMLParser)的 ?6 ????handle_starttag(), handle_data(), handle_endtag() 等方法. ?7 ??8 ????例子, ?9 ????????解析 htmlsample.html 中 <head> 标签, 10 ????????????<-- htmlsample.html --> ?-> 文件内容, 11 ????????????????‘ 12 ????????????????<html> 13 ????????????????<head><title>404 Not Found</title></head> 14 ????????????????<body bgcolor="white"> 15 ????????????????<center><h1>404 Not Found</h1></center> 16 ????????????????<hr><center>nginx/1.12.2</center> 17 ????????????????</body> 18 ????????????????</html> 19 ????????????????‘ 20 ????????from html.parser import HTMLParser 21 ????????class ParsingHeadT(HTMLParser): 22 ????????????def __init__(self): 23 ????????????????self.headtag =‘‘ 24 ????????????????self.parsesemaphore = False 25 ????????????????HTMLParser.__init__(self) 26 ?27 ????????????def handle_starttag(self, tag, attrs): # enable semaphore 28 ????????????????if tag == ‘head‘: 29 ????????????????????self.parsesemaphore = True 30 ?31 ????????????def handle_data(self, data): ?????????# tag process as requirement 32 ????????????????if self.parsesemaphore: 33 ????????????????????self.headtag = data 34 ?35 ????????????def handle_endtag(self, tag): 36 ????????????????if tag == ‘head‘: 37 ????????????????????self.parsesemaphore = False 38 ?39 ????????????def getheadtag(self): 40 ????????????????return self.headtag 41 ?42 ????????if __name__ == "__main__": 43 ????????????with open(‘htmlsample.html‘) as FH: 44 ????????????????pht = ParsingHeadT() 45 ????????????????pht.feed(FH.read()) ???# HTMLParser will invoke the replaced methods 46 ???????????????????????????????????????# handle_starttag, handle_data and handle_endtag 47 ????????????????print("Head Tag : %s" % pht.getheadtag()) 48 ?49 ????????output, 50 ???????????Head Tag : 404 Not Found 51 ?52 ????上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的, 53 ????比如 html 中的特殊字符 © (copyright 符号), &(& 逻辑与符号) 等, 54 ????????对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理, 55 ????????????HTMLParser.handle_entityref(name)¶ 56 ????????????????This method is called to process a named character reference of the form 57 ????????????????&name; (e.g. >), where name is a general entity reference (e.g. ‘gt‘). 58 ????????????????This method is never called if convert_charrefs is True. 59 ?60 ????字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换. 61 ????????HTMLParser.handle_charref(name) 62 ????????????This method is called to process decimal and hexadecimal numeric character 63 ????????????references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent 64 ????????????for > is >, whereas the hexadecimal is > in this case the method 65 ????????????will receive ‘62‘ or ‘x3E‘. This method is never called if convert_charrefs is True. 66 ?67 ????Note, 68 ????????幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html 69 ????????<head> tag 中加入一些特殊字符来看看. 70 ????????????<-- htmlsample.html --> 71 ????????????<html> 72 ????????????<head><title>> > 404 © Not > Found & </title></head> 73 ????????????<body bgcolor="white"> 74 ????????????<center><h1>404 Not Found</h1></center> 75 ????????????<hr><center>nginx/1.12.2</center> 76 ????????????</body> 77 ????????????</html> 78 ?79 ????????上例 Output, 80 ????????????????Head Tag : > > 404 © Not > Found & 81 ????????????????从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况. 82 ?83 ????然而, 在 html 的代码中存在一类 ‘非对称‘的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子 84 ????去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使 85 ????其能够正确解析这些‘非对称‘标签. 86 ????????先扩展一下儿 htmlsample.html, 以 <li> 标签为例, 87 ????????<-- htmlsample.html --> 88 ????????<html> 89 ????????<head><title>> > 404 © Not > Found &</title> 90 ????????<body bgcolor="white"> 91 ????????<center><h1>404 Not Found</h1></center> 92 ????????<hr><center>nginx/1.12.2</center> 93 ????????<ul> 94 ????????????<li> First Reason 95 ????????????<li> Second Reason 96 ????????</body> 97 ????????</html> 98 ?99 ????????htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签100 ????????没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题.101 102 ????????例,103 ????????????from html.parser import HTMLParser104 ????????????class Parser(HTMLParser):105 ????????????????def __init__(self):106 ????????????????????self.taglevels = [] ????# track anchor107 ????????????????????self.tags =[‘head‘,‘ul‘,‘li‘]108 ????????????????????self.parsesemaphore = False109 ????????????????????self.data = ‘‘110 ????????????????????HTMLParser.__init__(self)111 112 ????????????????def handle_starttag(self, tag, attrs): # enable semaphore113 ????????????????????if len(self.taglevels) and self.taglevels[-1] == tag:114 ????????????????????????self.handle_endtag(tag)115 ????????????????????self.taglevels.append(tag)116 117 ????????????????????if tag in self.tags:118 ????????????????????????self.parsesemaphore = True119 120 ????????????????def handle_data(self, data): ?????????# tag process as requirement121 ????????????????????if self.parsesemaphore:122 ????????????????????????self.data += data123 124 ????????????????def handle_endtag(self, tag):125 ????????????????????self.parsesemaphore = False126 127 ????????????????def gettag(self):128 ????????????????????return self.data129 130 ????????????if __name__ == "__main__":131 ????????????????with open(‘htmlsample.html‘) as FH:132 ????????????????????pht = Parser()133 ????????????????????pht.feed(FH.read()) ???# HTMLParser will invoke the replaced methods134 ???????????????????????????????????????????# handle_starttag, handle_data and handle_endtag135 ????????????????????print("Head Tag : %s" % pht.gettag())136 137 ????????????Output,138 ?????????????????Head Tag : > > 404 © Not > Found &139 ?????????????????First Reason140 ?????????????????Second Reason141 142 Reference,143 ????https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref144 145 Appendix,146 ????The example given by python Doc,147 ????????from html.parser import HTMLParser148 ????????from html.entities import name2codepoint149 150 ????????class MyHTMLParser(HTMLParser):151 ????????????def handle_starttag(self, tag, attrs):152 ????????????????print("Start tag:", tag)153 ????????????????for attr in attrs:154 ????????????????????print(" ????attr:", attr)155 156 ????????????def handle_endtag(self, tag):157 ????????????????print("End tag ?:", tag)158 159 ????????????def handle_data(self, data):160 ????????????????print("Data ????:", data)161 162 ????????????def handle_comment(self, data):163 ????????????????print("Comment ?:", data)164 165 ????????????def handle_entityref(self, name):166 ????????????????c = chr(name2codepoint[name])167 ????????????????print("Named ent:", c)168 169 ????????????def handle_charref(self, name):170 ????????????????if name.startswith(‘x‘):171 ????????????????????c = chr(int(name[1:], 16))172 ????????????????else:173 ????????????????????c = chr(int(name))174 ????????????????print("Num ent ?:", c)175 176 ????????????def handle_decl(self, data):177 ????????????????print("Decl ????:", data)178 179 ????????parser = MyHTMLParser()180 181 ????Output,182 ????????Parsing a doctype:183 184 ????# >>> parser.feed(‘<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ‘185 ????... ????????????‘"http://www.w3.org/TR/html4/strict.dtd">‘)186 ????????Decl ????: DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"187 ????????Parsing an element with a few attributes and a title:188 189 190 ????# >>> parser.feed(‘<img src="python-logo.png" alt="The Python logo">‘)191 ????????Start tag: img192 ?????????????attr: (‘src‘, ‘python-logo.png‘)193 ?????????????attr: (‘alt‘, ‘The Python logo‘)194 195 ????# >>> parser.feed(‘<h1>Python</h1>‘)196 ????????Start tag: h1197 ????????Data ????: Python198 ????????End tag ?: h1199 ????????The content of script and style elements is returned as is, without further parsing:200 201 202 ????# >>> parser.feed(‘<style type="text/css">#python { color: green }</style>‘)203 ????????Start tag: style204 ?????????????attr: (‘type‘, ‘text/css‘)205 ????????Data ????: #python { color: green }206 ????????End tag ?: style207 208 ????# >>> parser.feed(‘<script type="text/javascript">‘209 ????... ????????????‘alert("<strong>hello!</strong>");</script>‘)210 ????????Start tag: script211 ?????????????attr: (‘type‘, ‘text/javascript‘)212 ????????Data ????: alert("<strong>hello!</strong>");213 ????????End tag ?: script214 ????????Parsing comments:215 216 ????# >>> parser.feed(‘<!-- a comment -->‘217 ????... ????????????‘<!--[if IE 9]>IE-specific content<![endif]-->‘)218 ????????Comment ?: ?a comment219 ????????Comment ?: [if IE 9]>IE-specific content<![endif]220 ????????Parsing named and numeric character references and converting them to the correct221 ????????char (note: these 3 references are all equivalent to ‘>‘):222 223 ????# >>> parser.feed(‘>>>‘)224 ????????Named ent: >225 ????????Num ent ?: >226 ????????Num ent ?: >227 ????????Feeding incomplete chunks to feed() works, but handle_data() might be called more228 ????????than once (unless convert_charrefs is set to True):229 230 ????# >>> for chunk in [‘<sp‘, ‘an>buff‘, ‘ered ‘, ‘text</s‘, ‘pan>‘]:231 ????... ????parser.feed(chunk)232 ????????Start tag: span233 ????????Data ????: buff234 ????????Data ????: ered235 ????????Data ????: text236 ????????End tag ?: span237 ????????Parsing invalid HTML (e.g. unquoted attributes) also works:238 239 ????# >>> parser.feed(‘<p><a class=link href=#main>tag soup</p ></a>‘)240 ????????Start tag: p241 ????????Start tag: a242 ?????????????attr: (‘class‘, ‘link‘)243 ?????????????attr: (‘href‘, ‘#main‘)244 ????????Data ????: tag soup245 ????????End tag ?: p246 ????????End tag ?: a
Html / XHtml 解析 - Parsing Html and XHtml
原文地址:http://www.cnblogs.com/zzyzz/p/8037020.html