html.parserSimple HTML and XHTML parser简单的HTML和XHTML解析器

Source code: Lib/html/parser.py


This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.这个模块定义了一个类HTMLParser,它是解析HTML(超文本标记语言)和XHTML格式的文本文件的基础。

classhtml.parser.HTMLParser(*, convert_charrefs=True)

Create a parser instance able to parse invalid markup.创建一个能够解析无效标记的解析程序实例。

If convert_charrefs is True (the default), all character references (except the ones in script/style elements) are automatically converted to the corresponding Unicode characters.如果convert_charrefsTrue(默认值),则所有字符引用(script/style元素中的字符引用除外)都会自动转换为相应的Unicode字符。

An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered. HTMLParser实例被提供HTML数据,并在遇到开始标记、结束标记、文本、注释和其他标记元素时调用处理程序方法。The user should subclass HTMLParser and override its methods to implement the desired behavior.用户应该将HTMLParser子类化并覆盖其方法以实现所需的行为。

This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.对于通过关闭外部元素隐式关闭的元素,此解析器不检查结束标记是否与开始标记匹配,也不调用结束标记处理程序。

Changed in version 3.4:版本3.4中更改: convert_charrefs keyword argument added.添加了关键字参数。

Changed in version 3.5:版本3.5中更改: The default value for argument convert_charrefs is now True.参数convert_charrefs的默认值现在为True

Example HTML Parser ApplicationHTML分析器应用程序示例

As a basic example, below is a simple HTML parser that uses the HTMLParser class to print out start tags, end tags, and data as they are encountered:作为一个基本示例,下面是一个简单的HTML解析器,它使用HTMLParser类打印出遇到的开始标记、结束标记和数据:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)

def handle_endtag(self, tag):
print("Encountered an end tag :", tag)

def handle_data(self, data):
print("Encountered some data :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')

The output will then be:然后输出为:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

HTMLParser Methods方法

HTMLParser instances have the following methods:实例具有以下方法:

HTMLParser.feed(data)

Feed some text to the parser. 向解析器提供一些文本。It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or close() is called. 只要它由完整的元素组成,就对其进行处理;不完整的数据将被缓冲,直到提供更多的数据或调用close()data must be str.

HTMLParser.close()

Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call the HTMLParser base class method close().强制处理所有缓冲的数据,就好像后面跟着文件结束标记一样。这个方法可以由派生类重新定义,以在输入结束时定义额外的处理,但重新定义的版本应该始终调用HTMLParser基类方法close()

HTMLParser.reset()

Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time.重置实例。丢失所有未处理的数据。这在实例化时被隐式调用。

HTMLParser.getpos()

Return current line number and offset.返回当前行号和偏移量。

HTMLParser.get_starttag_text()

Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML “as deployed” or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.).返回最近打开的开始标记的文本。结构化处理通常不需要这样做,但在处理“已部署”的HTML或以最小的更改重新生成输入(可以保留属性之间的空白等)时可能会有用。

The following methods are called when data or markup elements are encountered and they are meant to be overridden in a subclass. 当遇到数据或标记元素并且这些元素要在子类中重写时,将调用以下方法。The base class implementations do nothing (except for handle_startendtag()):基类实现什么都不做(除了handle_startendtag()):

HTMLParser.handle_starttag(tag, attrs)

This method is called to handle the start tag of an element (e.g. <div id="main">).调用此方法是为了处理元素的开始标记(例如<div id="main">)。

The tag argument is the name of the tag converted to lower case. tag参数是转换为小写的标记的名称。The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag’s <> brackets. attrs参数是一个(name, value)对的列表,其中包含在标记的<>括号中找到的属性。The name will be translated to lower case, and quotes in the value have been removed, and character and entity references have been replaced.name将被转换为小写,值中的引号已被删除,字符和实体引用已被替换。

For instance, for the tag <A HREF="https://www.cwi.nl/">, this method would be called as handle_starttag('a', [('href', 'https://www.cwi.nl/')]).例如,对于标记<A HREF="https://www.cwi.nl/">,此方法将被调用为handle_starttag('a', [('href', 'https://www.cwi.nl/')])

All entity references from html.entities are replaced in the attribute values.所有来自html.entities的实体引用都将替换为属性值。

HTMLParser.handle_endtag(tag)

This method is called to handle the end tag of an element (e.g. </div>).调用此方法是为了处理元素的结束标记(例如</div>)。

The tag argument is the name of the tag converted to lower case.tag参数是转换为小写的标记的名称。

HTMLParser.handle_startendtag(tag, attrs)

Similar to handle_starttag(), but called when the parser encounters an XHTML-style empty tag (<img ... />). 类似于handle_starttag(),但在解析器遇到XHTML样式的空标记(<img ... />)时调用。This method may be overridden by subclasses which require this particular lexical information; the default implementation simply calls handle_starttag() and handle_endtag().这个方法可以被需要这个特定词汇信息的子类覆盖;默认实现仅调用handle_starttag()handle_endtag()

HTMLParser.handle_data(data)

This method is called to process arbitrary data (e.g. text nodes and the content of <script>...</script> and <style>...</style>).调用此方法来处理任意数据(例如,文本节点和<script>...</script><style>...</style>的内容)。

HTMLParser.handle_entityref(name)

This method is called to process a named character reference of the form &name; (e.g. &gt;), where name is a general entity reference (e.g. 'gt'). 调用此方法是为了处理形式&name;的命名字符引用(例如&gt;),其中name是通用实体引用(例如'gt')。This method is never called if convert_charrefs is True.如果convert_charrefsTrue,则从不调用此方法。

HTMLParser.handle_charref(name)

This method is called to process decimal and hexadecimal numeric character references of the form &#NNN; and &#xNNN;. 此方法用于处理形式为的十进制和十六进制数字字符引用以及&#NNN;&#xNNN;For example, the decimal equivalent for &gt; is &#62;, whereas the hexadecimal is &#x3E;; in this case the method will receive '62' or 'x3E'. 例如,&gt;的小数等于&#62;,而十六进制是&#x3E;;在这种情况下,该方法将接收'62''x3E'This method is never called if convert_charrefs is True.如果convert_charrefsTrue,则从不调用此方法。

HTMLParser.handle_comment(data)

This method is called when a comment is encountered (e.g. <!--comment-->).当遇到注释(例如<!--comment-->)时,会调用此方法。

For example, the comment <!-- comment --> will cause this method to be called with the argument ' comment '.例如,注释<!-- comment -->将导致使用参数' comment '调用此方法。

The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, for <!--[if IE 9]>IE9-specific content<![endif]-->, this method will receive '[if IE 9]>IE9-specific content<![endif]'.Internet Explorer条件注释(condcoms)的内容也将发送到此方法,因此,对于<!--[if IE 9]>IE9-specific content<![endif]-->,此方法将接收'[if IE 9]>IE9-specific content<![endif]'

HTMLParser.handle_decl(decl)

This method is called to handle an HTML doctype declaration (e.g. <!DOCTYPE html>).调用此方法是为了处理HTML doctype声明(例如<!DOCTYPE html>)。

The decl parameter will be the entire contents of the declaration inside the <!...> markup (e.g. 'DOCTYPE html').decl参数将是<!...>中声明的全部内容标记(例如'DOCTYPE html')。

HTMLParser.handle_pi(data)

Method called when a processing instruction is encountered. 当遇到处理指令时调用的方法。The data parameter will contain the entire processing instruction. data参数将包含整个处理指令。For example, for the processing instruction <?proc color='red'>, this method would be called as handle_pi("proc color='red'"). 例如,对于处理指令<?proc color='red'>,此方法将被称为handle_pi("proc color='red'")It is intended to be overridden by a derived class; the base class implementation does nothing.它打算被派生类重写;基类实现什么也不做。

Note

The HTMLParser class uses the SGML syntactic rules for processing instructions. HTMLParser类使用SGML语法规则来处理指令。An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.使用尾部'?'的XHTML处理指令将导致'?'以包括在数据中。

HTMLParser.unknown_decl(data)

This method is called when an unrecognized declaration is read by the parser.当解析程序读取无法识别的声明时,会调用此方法。

The data parameter will be the entire contents of the declaration inside the <![...]> markup. 数据参数将是<![...]>标记。It is sometimes useful to be overridden by a derived class. The base class implementation does nothing.被派生类重写有时很有用。基类实现什么都不做。

Examples示例

The following class implements a parser that will be used to illustrate more examples:以下类实现了一个解析器,该解析器将用于说明更多示例:

from html.parser import HTMLParser
from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)

def handle_endtag(self, tag):
print("End tag :", tag)

def handle_data(self, data):
print("Data :", data)

def handle_comment(self, data):
print("Comment :", data)

def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)

def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)

def handle_decl(self, data):
print("Decl :", data)

parser = MyHTMLParser()

Parsing a doctype:分析doctype:

>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
... '"http://www.w3.org/TR/html4/strict.dtd">')
Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

Parsing an element with a few attributes and a title:分析具有几个属性和标题的元素:

>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
attr: ('src', 'python-logo.png')
attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data : Python
End tag : h1

The content of script and style elements is returned as is, without further parsing:scriptstyle元素的内容按原样返回,无需进一步解析:

>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
attr: ('type', 'text/css')
Data : #python { color: green }
End tag : style
>>> parser.feed('<script type="text/javascript">'
... 'alert("<strong>hello!</strong>");</script>')
Start tag: script
attr: ('type', 'text/javascript')
Data : alert("<strong>hello!</strong>");
End tag : script

Parsing comments:分析评论:

>>> parser.feed('<!-- a comment -->'
... '<!--[if IE 9]>IE-specific content<![endif]-->')
Comment : a comment
Comment : [if IE 9]>IE-specific content<![endif]

Parsing named and numeric character references and converting them to the correct char (note: these 3 references are all equivalent to '>'):分析命名和数字字符引用并将其转换为正确的字符(注意:这3个引用都相当于'>'):

>>> parser.feed('&gt;&#62;&#x3E;')
Named ent: >
Num ent : >
Num ent : >

Feeding incomplete chunks to feed() works, but handle_data() might be called more than once (unless convert_charrefs is set to True):将不完整的块馈送到feed()是可行的,但handle_data()可能会被调用多次(除非convert_charrefs设置为True):

>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
... parser.feed(chunk)
...
Start tag: span
Data : buff
Data : ered
Data : text
End tag : span

Parsing invalid HTML (e.g. unquoted attributes) also works:分析无效的HTML(例如,未引用的属性)也可以:

>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
Start tag: p
Start tag: a
attr: ('class', 'link')
attr: ('href', '#main')
Data : tag soup
End tag : p
End tag : a