html.parser
— Simple HTML and XHTML parser简单的HTML和XHTML解析器¶
Source code: Lib/html/parser.py
This module defines a class 这个模块定义了一个类HTMLParser
which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.HTMLParser
,它是解析HTML(超文本标记语言)和XHTML格式的文本文件的基础。
-
class
html.parser.
HTMLParser
(*, convert_charrefs=True)¶ Create a parser instance able to parse invalid markup.创建一个能够解析无效标记的解析程序实例。If convert_charrefs is如果convert_charrefs为True
(the default), all character references (except the ones inscript
/style
elements) are automatically converted to the corresponding Unicode characters.True
(默认值),则所有字符引用(script
/style
元素中的字符引用除外)都会自动转换为相应的Unicode字符。AnHTMLParser
instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered.HTMLParser
实例被提供HTML数据,并在遇到开始标记、结束标记、文本、注释和其他标记元素时调用处理程序方法。The user should subclass用户应该将HTMLParser
and override its methods to implement the desired behavior.HTMLParser
子类化并覆盖其方法以实现所需的行为。This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.对于通过关闭外部元素隐式关闭的元素,此解析器不检查结束标记是否与开始标记匹配,也不调用结束标记处理程序。Changed in version 3.4:版本3.4中更改: convert_charrefskeyword argument added.添加了关键字参数。Changed in version 3.5:版本3.5中更改:The default value for argument convert_charrefs is now参数convert_charrefs的默认值现在为True
.True
。
Example HTML Parser ApplicationHTML分析器应用程序示例¶
As a basic example, below is a simple HTML parser that uses the 作为一个基本示例,下面是一个简单的HTML解析器,它使用HTMLParser
class to print out start tags, end tags, and data as they are encountered:HTMLParser
类打印出遇到的开始标记、结束标记和数据:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
The output will then be:然后输出为:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
HTMLParser
Methods方法¶
HTMLParser
instances have the following methods:实例具有以下方法:
-
HTMLParser.
feed
(data)¶ Feed some text to the parser.向解析器提供一些文本。It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or只要它由完整的元素组成,就对其进行处理;不完整的数据将被缓冲,直到提供更多的数据或调用close()
is called.close()
。data must bestr
.
-
HTMLParser.
close
()¶ Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call the强制处理所有缓冲的数据,就好像后面跟着文件结束标记一样。这个方法可以由派生类重新定义,以在输入结束时定义额外的处理,但重新定义的版本应该始终调用HTMLParser
base class methodclose()
.HTMLParser
基类方法close()
。
-
HTMLParser.
reset
()¶ Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time.重置实例。丢失所有未处理的数据。这在实例化时被隐式调用。
-
HTMLParser.
getpos
()¶ Return current line number and offset.返回当前行号和偏移量。
-
HTMLParser.
get_starttag_text
()¶ Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML “as deployed” or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.).返回最近打开的开始标记的文本。结构化处理通常不需要这样做,但在处理“已部署”的HTML或以最小的更改重新生成输入(可以保留属性之间的空白等)时可能会有用。
The following methods are called when data or markup elements are encountered and they are meant to be overridden in a subclass. 当遇到数据或标记元素并且这些元素要在子类中重写时,将调用以下方法。The base class implementations do nothing (except for 基类实现什么都不做(除了handle_startendtag()
):handle_startendtag()
):
-
HTMLParser.
handle_starttag
(tag, attrs)¶ This method is called to handle the start tag of an element (e.g.调用此方法是为了处理元素的开始标记(例如<div id="main">
).<div id="main">
)。The tag argument is the name of the tag converted to lower case.tag参数是转换为小写的标记的名称。The attrs argument is a list ofattrs参数是一个(name, value)
pairs containing the attributes found inside the tag’s<>
brackets.(name, value)
对的列表,其中包含在标记的<>
括号中找到的属性。The name will be translated to lower case, and quotes in the value have been removed, and character and entity references have been replaced.name将被转换为小写,值中的引号已被删除,字符和实体引用已被替换。For instance, for the tag例如,对于标记<A HREF="https://www.cwi.nl/">
, this method would be called ashandle_starttag('a', [('href', 'https://www.cwi.nl/')])
.<A HREF="https://www.cwi.nl/">
,此方法将被调用为handle_starttag('a', [('href', 'https://www.cwi.nl/')])
。All entity references from所有来自html.entities
are replaced in the attribute values.html.entities
的实体引用都将替换为属性值。
-
HTMLParser.
handle_endtag
(tag)¶ This method is called to handle the end tag of an element (e.g.调用此方法是为了处理元素的结束标记(例如</div>
).</div>
)。The tag argument is the name of the tag converted to lower case.tag参数是转换为小写的标记的名称。
-
HTMLParser.
handle_startendtag
(tag, attrs)¶ Similar to类似于handle_starttag()
, but called when the parser encounters an XHTML-style empty tag (<img ... />
).handle_starttag()
,但在解析器遇到XHTML样式的空标记(<img ... />
)时调用。This method may be overridden by subclasses which require this particular lexical information; the default implementation simply calls这个方法可以被需要这个特定词汇信息的子类覆盖;默认实现仅调用handle_starttag()
andhandle_endtag()
.handle_starttag()
和handle_endtag()
。
-
HTMLParser.
handle_data
(data)¶ This method is called to process arbitrary data (e.g. text nodes and the content of调用此方法来处理任意数据(例如,文本节点和<script>...</script>
and<style>...</style>
).<script>...</script>
和<style>...</style>
的内容)。
-
HTMLParser.
handle_entityref
(name)¶ This method is called to process a named character reference of the form调用此方法是为了处理形式&name;
(e.g.>
), where name is a general entity reference (e.g.'gt'
).&name;
的命名字符引用(例如>
),其中name是通用实体引用(例如'gt'
)。This method is never called if convert_charrefs is如果convert_charrefs为True
.True
,则从不调用此方法。
-
HTMLParser.
handle_charref
(name)¶ This method is called to process decimal and hexadecimal numeric character references of the form此方法用于处理形式为的十进制和十六进制数字字符引用以及&#NNN;
and&#xNNN;
.&#NNN;
和&#xNNN;
。For example, the decimal equivalent for例如,>
is>
, whereas the hexadecimal is>
; in this case the method will receive'62'
or'x3E'
.>
的小数等于>
,而十六进制是>
;在这种情况下,该方法将接收'62'
或'x3E'
。This method is never called if convert_charrefs is如果convert_charrefs为True
.True
,则从不调用此方法。
-
HTMLParser.
handle_comment
(data)¶ This method is called when a comment is encountered (e.g.当遇到注释(例如<!--comment-->
).<!--comment-->
)时,会调用此方法。For example, the comment例如,注释<!-- comment -->
will cause this method to be called with the argument' comment '
.<!-- comment -->
将导致使用参数' comment '
调用此方法。The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, forInternet Explorer条件注释(condcoms)的内容也将发送到此方法,因此,对于<!--[if IE 9]>IE9-specific content<![endif]-->
, this method will receive'[if IE 9]>IE9-specific content<![endif]'
.<!--[if IE 9]>IE9-specific content<![endif]-->
,此方法将接收'[if IE 9]>IE9-specific content<![endif]'
。
-
HTMLParser.
handle_decl
(decl)¶ This method is called to handle an HTML doctype declaration (e.g.调用此方法是为了处理HTML doctype声明(例如<!DOCTYPE html>
).<!DOCTYPE html>
)。The decl parameter will be the entire contents of the declaration inside thedecl参数将是<!...>
markup (e.g.'DOCTYPE html'
).<!...>
中声明的全部内容标记(例如'DOCTYPE html'
)。
-
HTMLParser.
handle_pi
(data)¶ Method called when a processing instruction is encountered.当遇到处理指令时调用的方法。The data parameter will contain the entire processing instruction.data参数将包含整个处理指令。For example, for the processing instruction例如,对于处理指令<?proc color='red'>
, this method would be called ashandle_pi("proc color='red'")
.<?proc color='red'>
,此方法将被称为handle_pi("proc color='red'")
。It is intended to be overridden by a derived class; the base class implementation does nothing.它打算被派生类重写;基类实现什么也不做。Note
TheHTMLParser
class uses the SGML syntactic rules for processing instructions.HTMLParser
类使用SGML语法规则来处理指令。An XHTML processing instruction using the trailing使用尾部'?'
will cause the'?'
to be included in data.'?'
的XHTML处理指令将导致'?'
以包括在数据中。
-
HTMLParser.
unknown_decl
(data)¶ This method is called when an unrecognized declaration is read by the parser.当解析程序读取无法识别的声明时,会调用此方法。The data parameter will be the entire contents of the declaration inside the数据参数将是<![...]>
markup.<![...]>
标记。It is sometimes useful to be overridden by a derived class. The base class implementation does nothing.被派生类重写有时很有用。基类实现什么都不做。
Examples示例¶
The following class implements a parser that will be used to illustrate more examples:以下类实现了一个解析器,该解析器将用于说明更多示例:
from html.parser import HTMLParser
from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
def handle_comment(self, data):
print("Comment :", data)
def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)
def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)
def handle_decl(self, data):
print("Decl :", data)
parser = MyHTMLParser()
Parsing a doctype:分析doctype:
>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
... '"http://www.w3.org/TR/html4/strict.dtd">')
Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
Parsing an element with a few attributes and a title:分析具有几个属性和标题的元素:
>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
attr: ('src', 'python-logo.png')
attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data : Python
End tag : h1
The content of script
and style
elements is returned as is, without further parsing:script
和style
元素的内容按原样返回,无需进一步解析:
>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
attr: ('type', 'text/css')
Data : #python { color: green }
End tag : style
>>> parser.feed('<script type="text/javascript">'
... 'alert("<strong>hello!</strong>");</script>')
Start tag: script
attr: ('type', 'text/javascript')
Data : alert("<strong>hello!</strong>");
End tag : script
Parsing comments:分析评论:
>>> parser.feed('<!-- a comment -->'
... '<!--[if IE 9]>IE-specific content<![endif]-->')
Comment : a comment
Comment : [if IE 9]>IE-specific content<![endif]
Parsing named and numeric character references and converting them to the correct char (note: these 3 references are all equivalent to 分析命名和数字字符引用并将其转换为正确的字符(注意:这3个引用都相当于'>'
):'>'
):
>>> parser.feed('>>>')
Named ent: >
Num ent : >
Num ent : >
Feeding incomplete chunks to 将不完整的块馈送到feed()
works, but handle_data()
might be called more than once (unless convert_charrefs is set to True
):feed()
是可行的,但handle_data()
可能会被调用多次(除非convert_charrefs设置为True
):
>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
... parser.feed(chunk)
...
Start tag: span
Data : buff
Data : ered
Data : text
End tag : span
Parsing invalid HTML (e.g. unquoted attributes) also works:分析无效的HTML(例如,未引用的属性)也可以:
>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
Start tag: p
Start tag: a
attr: ('class', 'link')
attr: ('href', '#main')
Data : tag soup
End tag : p
End tag : a