Regular Expression HOWTO正则表达式方法¶
- Author
-
A.M. Kuchling <amk@amk.ca>
Abstract
This document is an introductory tutorial to using regular expressions in Python with the 本文档是在Python中使用正则表达式和re
module. re
模块的入门教程。It provides a gentler introduction than the corresponding section in the Library Reference.它提供了一个比图书馆参考资料中相应章节更温和的介绍。
Introduction介绍¶
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the 正则表达式(称为RE,或regex,或regex模式)本质上是一种嵌入在Python中的小型、高度专业化的编程语言,并通过re
module. re
模块提供。Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. 使用这种小语言,您可以为要匹配的可能字符串集指定规则;此集合可能包含英语句子、电子邮件地址、TeX命令或任何您喜欢的内容。You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. 然后,您可以提出诸如“此字符串是否与模式匹配?”,或“此字符串中的任何位置是否有与模式匹配的项?”。You can also use REs to modify a string or to split it apart in various ways.您还可以使用REs修改字符串或以各种方式将其拆分。
Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. 正则表达式模式被编译成一系列字节码,然后由C语言编写的匹配引擎执行。For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. 对于高级使用,可能需要仔细注意引擎将如何执行给定的RE,并以某种方式写入RE,以便生成运行更快的字节码。Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.本文档中没有介绍优化,因为它要求您对匹配引擎的内部结构有很好的理解。
The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. 正则表达式语言相对较小且受到限制,因此并非所有可能的字符串处理任务都可以使用正则表达式完成。There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. 正则表达式也可以完成一些任务,但表达式非常复杂。In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.在这些情况下,您最好编写Python代码来进行处理;虽然Python代码比复杂的正则表达式要慢,但它也可能更容易理解。
Simple Patterns简单模式¶
We’ll start by learning about the simplest possible regular expressions. 我们将从学习最简单的正则表达式开始。Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.由于正则表达式用于对字符串进行操作,我们将从最常见的任务开始:匹配字符。
For a detailed explanation of the computer science underlying regular expressions (deterministic and non-deterministic finite automata), you can refer to almost any textbook on writing compilers.有关正则表达式(确定性和非确定性有限自动机)背后的计算机科学的详细解释,您可以参考几乎所有关于编写编译器的教科书。
Matching Characters匹配字符¶
Most letters and characters will simply match themselves. 大多数字母和字符只会匹配它们自己。For example, the regular expression 例如,正则表达式测试将与字符串test
will match the string test
exactly. test
完全匹配。(You can enable a case-insensitive mode that would let this RE match (您可以启用不区分大小写的模式,该模式将使RE匹配Test
or TEST
as well; more about this later.)Test
或TEST
;稍后将对此进行详细介绍。)
There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. 本规则有例外情况;有些字符是特殊的元字符,它们本身并不匹配。Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. 相反,它们表示应该匹配一些不寻常的东西,或者通过重复它们或改变它们的含义来影响RE的其他部分。Much of this document is devoted to discussing various metacharacters and what they do.本文主要讨论各种元字符及其作用。
Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.下面是元字符的完整列表;它们的含义将在本指南的其余部分讨论。
. ^ $ * + ? { } [ ] \ | ( )
The first metacharacters we’ll look at are 我们将研究的第一个元字符是[
and ]
. [
和]
。They’re used for specifying a character class, which is a set of characters that you wish to match. 它们用于指定字符类,这是一组您希望匹配的字符。Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a 可以单独列出字符,也可以通过提供两个字符并用'-'
. '-'
分隔来指示一系列字符。For example, 例如,[abc]
will match any of the characters a
, b
, or c
; this is the same as [a-c]
, which uses a range to express the same set of characters. [abc]
将匹配任何字符a
、b
或c
;这与[a-c]
相同,它使用一个范围来表示相同的字符集。If you wanted to match only lowercase letters, your RE would be 如果只想匹配小写字母,则RE应该是[a-z]
.[a-z]
。
Metacharacters (except 元字符(除\
) are not active inside classes. \
)在类内不活动。For example, 例如,[akm$]
will match any of the characters 'a'
, 'k'
, 'm'
, or '$'
; '$'
is usually a metacharacter, but inside a character class it’s stripped of its special nature.[akm$]
将匹配任何字符'a'
、'k'
、'm'
或'$'
;'$'
通常是元字符,但在字符类中,它被剥夺了其特殊性质。
You can match the characters not listed within the class by complementing the set. 可以通过补足集合来匹配类中未列出的字符。This is indicated by including a 这通过将'^'
as the first character of the class. '^'
作为类的第一个字符来表示。For example, 例如,[^5]
will match any character except '5'
. [^5]
将匹配除'5'
之外的任何字符。If the caret appears elsewhere in a character class, it does not have special meaning. 如果插入符号出现在字符类的其他位置,则它没有特殊含义。For example: 例如:[5^]
will match either a '5'
or a '^'
.[5^]
将匹配'5'
或'^'
。
Perhaps the most important metacharacter is the backslash, 也许最重要的元字符是反斜杠\
. \
。As in Python string literals, the backslash can be followed by various characters to signal various special sequences. 与Python字符串文字一样,反斜杠后面可以跟着各种字符来表示各种特殊序列。It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a 它还用于转义所有元字符,以便您仍然可以在模式中匹配它们;例如,如果需要匹配[
or \
, you can precede them with a backslash to remove their special meaning: \[
or \\
.[
或\
,可以在其前面加一个反斜杠以删除其特殊含义:\[
或\\
。
Some of the special sequences beginning with 一些以'\'
represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.'\'
开头的特殊序列表示通常有用的预定义字符集,例如数字集、字母集或任何非空白的字符集。
Let’s take an example: 让我们举个例子:\w
matches any alphanumeric character. \w
匹配任何字母数字字符。If the regex pattern is expressed in bytes, this is equivalent to the class 如果regex模式以字节表示,则相当于类[a-zA-Z0-9_]
. [a-zA-Z0-9_]
。If the regex pattern is a string, 如果regex模式是字符串,\w
will match all the characters marked as letters in the Unicode database provided by the unicodedata
module. \w
将匹配unicodedata
模块提供的Unicode数据库中标记为字母的所有字符。You can use the more restricted definition of 通过在编译正则表达式时提供\w
in a string pattern by supplying the re.ASCII
flag when compiling the regular expression.re.ASCII
标志,可以在字符串模式中使用更严格的\w
定义。
The following list of special sequences isn’t complete. 以下特殊序列列表不完整。For a complete list of sequences and expanded class definitions for Unicode string patterns, see the last part of Regular Expression Syntax in the Standard Library reference. 有关Unicode字符串模式的序列和扩展类定义的完整列表,请参阅标准库参考中正则表达式语法的最后一部分。In general, the Unicode versions match any character that’s in the appropriate category in the Unicode database.通常,Unicode版本匹配Unicode数据库中相应类别中的任何字符。
\d
Matches any decimal digit; this is equivalent to the class匹配任何十进制数字;这相当于类[0-9]
.[0-9]
。\D
Matches any non-digit character; this is equivalent to the class匹配任何非数字字符;这相当于类[^0-9]
.[^0-9]
。\s
Matches any whitespace character; this is equivalent to the class匹配任何空白字符;这相当于类[ \t\n\r\f\v]
.[ \t\n\r\f\v]
。\S
Matches any non-whitespace character; this is equivalent to the class匹配任何非空白字符;这相当于类[^ \t\n\r\f\v]
.[^ \t\n\r\f\v]
。\w
Matches any alphanumeric character; this is equivalent to the class匹配任何字母数字字符;这相当于类[a-zA-Z0-9_]
.[a-zA-Z0-9\]
。\W
Matches any non-alphanumeric character; this is equivalent to the class匹配任何非字母数字字符;这相当于类[^a-zA-Z0-9_]
.[^a-zA-Z0-9_]
。
These sequences can be included inside a character class. 这些序列可以包含在字符类中。For example, 例如,[\s,.]
is a character class that will match any whitespace character, or ','
or '.'
.[\s,.]
是一个字符类,它将匹配任何空白字符,或','
或'.'
。
The final metacharacter in this section is 本节中的最后一个元字符是.
. .
。It matches anything except a newline character, and there’s an alternate mode (它匹配除换行符以外的任何字符,并且有一个备用模式(re.DOTALL
) where it will match even a newline. re.DOTALL
),它甚至可以匹配换行符。.
is often used where you want to match “any character”.通常用于要匹配“任意字符”的位置。
Repeating Things重复的事情¶
Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. 能够匹配不同的字符集是正则表达式所能做的第一件事,这在字符串上可用的方法中是不可能做到的。However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. 然而,如果这是regex的唯一附加功能,那么它们就不会有太大的进步。Another capability is that you can specify that portions of the RE must be repeated a certain number of times.另一个功能是您可以指定RE的部分必须重复一定次数。
The first metacharacter for repeating things that we’ll look at is 我们将要研究的第一个重复事物的元字符是*
. *
。*与文字字符*
doesn’t match the literal character '*'
; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.'*'
不匹配;相反,它指定前一个字符可以匹配零次或多次,而不是只匹配一次。
For example, 例如,ca*t
will match 'ct'
(0 'a'
characters), 'cat'
(1 'a'
), 'caaat'
(3 'a'
characters), and so forth.ca*t
将匹配'ct'
(0个'a'
字符)、'cat'
(1个'a'
)、'caaat'
(3个'a'
字符)等。
Repetitions such as 像*
are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. *
这样的重复是贪婪的;重复RE时,匹配引擎将尝试尽可能多次重复。If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.如果模式的后续部分不匹配,则匹配引擎将备份并以较少的重复次数重试。
A step-by-step example will make this more obvious. 一个循序渐进的例子将使这一点更加明显。Let’s consider the expression 让我们考虑表达式a[bcd]*b
. a[bcd]*b
。This matches the letter 这与字母'a'
, zero or more letters from the class [bcd]
, and finally ends with a 'b'
. 'a'
匹配,来自类[bcd]
的零个或多个字母匹配,最后以'b'
结尾。Now imagine matching this RE against the string 现在想象一下将这个RE与字符串'abcbd'
.'abcbd'
匹配。
|
|
|
---|---|---|
1 |
|
|
2 |
|
|
3 |
Failure |
|
4 |
|
|
5 |
Failure |
|
6 |
|
|
6 |
|
|
The end of the RE has now been reached, and it has matched 现在已经到达RE的末尾,并且它与'abcb'
. 'abcb'
匹配。This demonstrates how the matching engine goes as far as it can at first, and if no match is found it will then progressively back up and retry the rest of the RE again and again. 这演示了匹配引擎最初是如何尽可能地运行的,如果没有找到匹配,它将逐步备份并一次又一次地重试剩余的RE。It will back up until it has tried zero matches for 它将进行备份,直到尝试对[bcd]*
, and if that subsequently fails, the engine will conclude that the string doesn’t match the RE at all.[bcd]*
进行零匹配为止,如果随后失败,引擎将断定字符串与RE根本不匹配。
Another repeating metacharacter is 另一个重复的元字符是+
, which matches one or more times. +
,它匹配一次或多次。Pay careful attention to the difference between 注意*
and +
; *
matches zero or more times, so whatever’s being repeated may not be present at all, while +
requires at least one occurrence. *
和+
;*
匹配零次或多次,因此重复的内容可能根本不存在,而+
需要至少出现一次。To use a similar example, 使用类似的示例,ca+t
will match 'cat'
(1 'a'
), 'caaat'
(3 'a'
s), but won’t match 'ct'
.ca+t
将匹配'cat'
(1个'a'
)、'caaat'
(3个'a'
),但不匹配'ct'
。
There are two more repeating qualifiers. 还有两个重复限定符。The question mark character, 问号字符?
, matches either once or zero times; you can think of it as marking something as being optional. ?
,匹配一次或零次;您可以将其视为将某个对象标记为可选对象。For example, 例如,home-?brew
matches either 'homebrew'
or 'home-brew'
.home-?brew
匹配'homebrew'
或'home-brew'
匹配。
The most complicated repeated qualifier is 最复杂的重复限定符是{m,n}
, where m and n are decimal integers. {m,n}
,其中m和n是十进制整数。This qualifier means there must be at least m repetitions, and at most n. 这个限定符意味着必须有至少m个重复,最多n个。For example, 例如,a/{1,3}b
will match 'a/b'
, 'a//b'
, and 'a///b'
. a/{1,3}b
将匹配'a/b'
、'a//b'
和'a///b'
。It won’t match 它与没有斜杠的'ab'
, which has no slashes, or 'a////b'
, which has four.'ab'
或有四个斜杠的'a////b'
不匹配。
You can omit either m or n; in that case, a reasonable value is assumed for the missing value. 可以省略m或n;在这种情况下,将为缺失的值假设一个合理的值。Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity.忽略m被解释为下限0,而忽略n则会导致上限无穷大。
Readers of a reductionist bent may notice that the three other qualifiers can all be expressed using this notation. 倾向于简化论的读者可能会注意到,其他三个限定符都可以用这种符号表示。{0,}
is the same as *
, {1,}
is equivalent to +
, and {0,1}
is the same as ?
. {0,}
等于*
,{1,}
等于+
,{0,1}
等于?
。It’s better to use 如果可以,最好使用*
, +
, or ?
when you can, simply because they’re shorter and easier to read.*
、+
、或?
,仅仅因为它们更短,更容易阅读。
Using Regular Expressions使用正则表达式¶
Now that we’ve looked at some simple regular expressions, how do we actually use them in Python? 既然我们已经了解了一些简单的正则表达式,那么我们实际上如何在Python中使用它们呢?The re
module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them.re
模块为正则表达式引擎提供了一个接口,允许您将REs编译成对象,然后对其执行匹配。
Compiling Regular Expressions编译正则表达式¶
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.正则表达式被编译成模式对象,这些对象具有用于各种操作的方法,例如搜索模式匹配或执行字符串替换。
>>> import re
>>> p = re.compile('ab*')
>>> p
re.compile('ab*')
re.compile()
also accepts an optional flags argument, used to enable various special features and syntax variations. 还接受可选的flags参数,用于启用各种特殊功能和语法变体。We’ll go over the available settings later, but for now a single example will do:稍后我们将讨论可用的设置,但现在只举一个示例:
>>> p = re.compile('ab*', re.IGNORECASE)
The RE is passed to RE作为字符串传递给re.compile()
as a string. re.compile()
。REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. REs作为字符串处理,因为正则表达式不是核心Python语言的一部分,并且没有创建用于表达它们的特殊语法。(There are applications that don’t need REs at all, so there’s no need to bloat the language specification by including them.) (有些应用程序根本不需要REs,因此没有必要通过包含它们来扩充语言规范。)Instead, the 相反,re
module is simply a C extension module included with Python, just like the socket
or zlib
modules.re
模块只是Python附带的一个C扩展模块,就像socket
或zlib
模块一样。
Putting REs in strings keeps the Python language simpler, but has one disadvantage which is the topic of the next section.将REs放在字符串中可以简化Python语言,但有一个缺点,这就是下一节的主题。
The Backslash Plague反斜杠瘟疫¶
As stated earlier, regular expressions use the backslash character (如前所述,正则表达式使用反斜杠('\'
) to indicate special forms or to allow special characters to be used without invoking their special meaning. '\'
)表示特殊形式,或允许使用特殊字符而不调用其特殊含义。This conflicts with Python’s usage of the same character for the same purpose in string literals.这与Python在字符串文字中使用相同的字符用于相同的目的相冲突。
Let’s say you want to write a RE that matches the string 假设您要编写一个与字符串\section
, which might be found in a LaTeX file. \section
匹配的RE,该字符串\节可能位于LaTeX文件中。To figure out what to write in the program code, start with the desired string to be matched. 要想知道在程序代码中要写什么,请从要匹配的所需字符串开始。Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string 接下来,必须通过在反斜杠和其他元字符前面加一个反斜杠来转义任何反斜杠和其他元字符,从而生成字符串\\section
. \\section
。The resulting string that must be passed to 必须传递给re.compile()
must be \\section
. re.compile()
的结果字符串必须是\\section
。However, to express this as a Python string literal, both backslashes must be escaped again.然而,要将其表示为Python字符串文字,必须再次转义两个反斜杠。
|
|
---|---|
|
|
|
|
|
|
In short, to match a literal backslash, one has to write 简而言之,为了匹配文本反斜杠,必须将'\\\\'
as the RE string, because the regular expression must be \\
, and each backslash must be expressed as \\
inside a regular Python string literal. '\\\\'
写入RE字符串,因为正则表达式必须是\\
,并且每个反斜杠必须在常规Python字符串文本中表示为\\
。In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.在重复使用反斜杠的REs中,这会导致大量重复的反斜杠,并使生成的字符串难以理解。
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 解决方案是对正则表达式使用Python的原始字符串表示法;反斜杠在前缀为'r'
, so r"\n"
is a two-character string containing '\'
and 'n'
, while "\n"
is a one-character string containing a newline. 'r'
的字符串文字中不会以任何特殊方式处理,因此r"\n"
是一个包含'\'
和'n'
的双字符字符串,而"\n"
是一个包含换行符的单字符字符串。Regular expressions will often be written in Python code using this raw string notation.正则表达式通常使用这种原始字符串表示法在Python代码中编写。
In addition, special escape sequences that are valid in regular expressions, but not valid as Python string literals, now result in a 此外,在正则表达式中有效但作为Python字符串文字无效的特殊转义序列现在会导致DeprecationWarning
and will eventually become a SyntaxError
, which means the sequences will be invalid if raw string notation or escaping the backslashes isn’t used.DeprecationWarning
,并最终成为SyntaxError
,这意味着如果不使用原始字符串表示法或转义反斜杠,这些序列将无效。
|
|
---|---|
|
|
|
|
|
|
Performing Matches正在执行匹配¶
Once you have an object representing a compiled regular expression, what do you do with it? 一旦有了一个表示已编译正则表达式的对象,您将如何处理它?Pattern objects have several methods and attributes. 模式对象有多种方法和属性。Only the most significant ones will be covered here; consult the 这里只讨论最重要的问题;有关完整列表,请参阅re
docs for a complete listing.re
文档。
|
|
---|---|
|
|
|
|
|
|
|
|
如果找不到匹配项,match()
and search()
return None
if no match can be found. match()
和search()
将返回None
。If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.如果成功,将返回一个match对象实例,其中包含有关匹配的信息:匹配的开始和结束位置、匹配的子字符串等。
You can learn about this by interactively experimenting with the 您可以通过交互实验re
module. re
模块来了解这一点。If you have 如果您有可用的tkinter
available, you may also want to look at Tools/demo/redemo.py, a demonstration program included with the Python distribution. tkinter
,您可能还想看看Tools/demo/redemo.py,这是Python发行版附带的一个演示程序。It allows you to enter REs and strings, and displays whether the RE matches or fails. 它允许您输入RE和字符串,并显示RE是否匹配或失败。redemo.py
can be quite useful when trying to debug a complicated RE.在尝试调试复杂RE时非常有用。
This HOWTO uses the standard Python interpreter for its examples. 本文将使用标准Python解释器作为示例。First, run the Python interpreter, import the 首先,运行Python解释器,导入re
module, and compile a RE:re
模块,并编译RE:
>>> import re
>>> p = re.compile('[a-z]+')
>>> p
re.compile('[a-z]+')
Now, you can try matching various strings against the RE 现在,您可以尝试将各种字符串与RE[a-z]+
. [a-z]+
进行匹配。An empty string shouldn’t match at all, since 空字符串不应该匹配,因为+
means ‘one or more repetitions’. +
表示“一个或多个重复”。在这种情况下,match()
should return None
in this case, which will cause the interpreter to print no output. match()
应该返回None
,这将导致解释器不打印输出。You can explicitly print the result of 您可以显式打印match()
to make this clear.match()
的结果以明确这一点。
>>> p.match("")
>>> print(p.match(""))
None
Now, let’s try it on a string that it should match, such as 现在,让我们在应该匹配的字符串上进行尝试,例如tempo
. tempo
。In this case, 在这种情况下,match()
will return a match object, so you should store the result in a variable for later use.match()
将返回一个match
对象,因此您应该将结果存储在一个变量中以供以后使用。
>>> m = p.match('tempo')
>>> m
<re.Match object; span=(0, 5), match='tempo'>
Now you can query the match object for information about the matching string. 现在,您可以查询match
对象以获取有关匹配字符串的信息。Match object instances also have several methods and attributes; the most important ones are:匹配对象实例还具有多个方法和属性;最重要的是:
|
|
---|---|
|
|
|
|
|
|
|
|
Trying these methods will soon clarify their meaning:尝试这些方法将很快阐明其含义:
>>> m.group()
'tempo'
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)
group()
returns the substring that was matched by the RE. 返回RE匹配的子字符串。start()
and 和end()
return the starting and ending index of the match. 返回匹配的开始索引和结束索引。span()
returns both start and end indexes in a single tuple. 返回单个元组中的开始索引和结束索引。Since the 由于match()
method only checks if the RE matches at the start of a string, start()
will always be zero. match()
方法仅检查字符串开头是否重新匹配,因此start()
将始终为零。However, the 然而,模式的search()
method of patterns scans through the string, so the match may not start at zero in that case.search()
方法会扫描字符串,因此在这种情况下,匹配可能不会从零开始。
>>> print(p.match('::: message'))
None
>>> m = p.search('::: message'); print(m)
<re.Match object; span=(4, 11), match='message'>
>>> m.group()
'message'
>>> m.span()
(4, 11)
In actual programs, the most common style is to store the match object in a variable, and then check if it was 在实际程序中,最常见的样式是将None
. match
对象存储在变量中,然后检查它是否为None
。This usually looks like:这通常看起来像:
p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
print('Match found: ', m.group())
else:
print('No match')
Two pattern methods return all of the matches for a pattern. 两个模式方法返回模式的所有匹配项。findall()
returns a list of matching strings:返回匹配字符串的列表:
>>> p = re.compile(r'\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
The 在本例中,需要r
prefix, making the literal a raw string literal, is needed in this example because escape sequences in a normal “cooked” string literal that are not recognized by Python, as opposed to regular expressions, now result in a DeprecationWarning
and will eventually become a SyntaxError
. r
前缀,使该文本成为原始字符串文本,因为Python无法识别的普通“煮熟”字符串文本中的转义序列(与正则表达式相反)现在会导致DeprecationWarning
,并最终成为SyntaxError
。See The Backslash Plague.请参见反斜杠瘟疫。
findall()
has to create the entire list before it can be returned as the result. 必须先创建整个列表,然后才能将其作为结果返回。The finditer()
method returns a sequence of match object instances as an iterator:finditer()
方法以迭代器的形式返回一系列match
对象实例:
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable_iterator object at 0x...>
>>> for match in iterator:
... print(match.span())
...
(0, 2)
(22, 24)
(29, 31)
Module-Level Functions模块级功能¶
You don’t have to create a pattern object and call its methods; the 您不必创建模式对象并调用其方法;re
module also provides top-level functions called match()
, search()
, findall()
, sub()
, and so forth. re
模块还提供名为match()
、search()
、findall()
、sub()
等顶级函数。These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either 这些函数采用与相应的模式方法相同的参数,并添加RE字符串作为第一个参数,但仍然返回None
or a match object instance.None
或match
对象实例。
>>> print(re.match(r'From\s+', 'Fromage amk'))
None
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
<re.Match object; span=(0, 5), match='From '>
Under the hood, these functions simply create a pattern object for you and call the appropriate method on it. 在后台,这些函数只需为您创建一个pattern
对象,并对其调用适当的方法。They also store the compiled object in a cache, so future calls using the same RE won’t need to parse the pattern again and again.它们还将编译后的对象存储在缓存中,因此将来使用相同RE的调用不需要反复解析模式。
Should you use these module-level functions, or should you get the pattern and call its methods yourself? 您是应该使用这些模块级函数,还是应该自己获取模式并调用其方法?If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. 如果要访问循环中的正则表达式,预编译它将节省一些函数调用。Outside of loops, there’s not much difference thanks to the internal cache.在循环之外,由于内部缓存,没有太大区别。
Compilation Flags编译标志¶
Compilation flags let you modify some aspects of how regular expressions work. 编译标志允许您修改正则表达式工作方式的某些方面。Flags are available in the re
module under two names, a long name such as IGNORECASE
and a short, one-letter form such as I
. re
模块中的标志有两个名称,一个是长名称(如IGNORECASE
),另一个是短的单字母形式(如I
)。(If you’re familiar with Perl’s pattern modifiers, the one-letter forms use the same letters; the short form of (如果您熟悉Perl的模式修饰符,那么单字母形式使用相同的字母;例如,re.VERBOSE
is re.X
, for example.)re.VERBOSE
的缩写形式是re.X
。)Multiple flags can be specified by bitwise OR-ing them; 多个标志可以通过按位或对其进行运算来指定;例如,re.I | re.M
sets both the I
and M
flags, for example.re.I | re.M
同时设置I
和M
标志。
Here’s a table of the available flags, followed by a more detailed explanation of each one.下面是一个可用标志的表,后面是对每个标志的更详细解释。
|
|
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
-
I
-
IGNORECASE
Perform case-insensitive matching; character class and literal strings will match letters by ignoring case.进行不区分大小写的匹配;通过忽略大小写,字符类和文字字符串将匹配字母。For example,例如,[A-Z]
will match lowercase letters, too.[A-Z]
也将匹配小写字母。Full Unicode matching also works unless the除非ASCII标志用于禁用非ASCII
flag is used to disable non-ASCII matches.ASCII
匹配,否则完整的Unicode匹配也可以工作。When the Unicode patterns当Unicode模式[a-z]
or[A-Z]
are used in combination with theIGNORECASE
flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign).[a-z]
或[A-Z]
与IGNORECASE
标记配合使用时,它们将匹配52个ASCII字母以及4个额外的非ASCII字符:“İ”(U+0130,拉丁大写字母I上面带点),“ı” (U+0131,拉丁小写字母i上面无点), “ſ” (U+017F,拉丁小写字母长s)和“K” (U+212A,开尔文符号)。Spam
will match'Spam'
,'spam'
,'spAM'
, or'ſpam'
(the latter is matched only in Unicode mode).Spam
将匹配'Spam'
、'spam'
、'spAM'
或'ſpam'
(最后一个只在Unicode模式下匹配)This lowercasing doesn’t take the current locale into account; it will if you also set the这种小写不考虑当前语言环境;如果您还设置了LOCALE
flag.LOCALE
标志,则会出现这种情况。
-
L
-
LOCALE
Make使\w
,\W
,\b
,\B
and case-insensitive matching dependent on the current locale instead of the Unicode database.\w
、\W
、\b
、\W
和不区分大小写的匹配依赖于当前区域设置,而不是Unicode数据库。Locales are a feature of the C library intended to help in writing programs that take account of language differences.语言环境是C库的一项功能,旨在帮助编写考虑到语言差异的程序。For example, if you’re processing encoded French text, you’d want to be able to write例如,如果您正在处理编码的法语文本,您可能希望能够写入\w+
to match words, but\w
only matches the character class[A-Za-z]
in bytes patterns; it won’t match bytes corresponding toé
orç
.\w+
以匹配单词,但\w
只匹配字节模式中的字符类[A-Za-z]
;它与或对应的字节不匹配。If your system is configured properly and a French locale is selected, certain C functions will tell the program that the byte corresponding to如果系统配置正确并且选择了法语区域设置,则某些C函数会告诉程序,对应的字节é
should also be considered a letter.é
也应视为字母。Setting the编译正则表达式时设置LOCALE
flag when compiling a regular expression will cause the resulting compiled object to use these C functions for\w
; this is slower, but also enables\w+
to match French words as you’d expect.LOCALE
标志将导致生成的编译对象将这些C函数用于\w
;这会比较慢,但也会使\w+
像您所期望的那样匹配法语单词。The use of this flag is discouraged in Python 3 as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales.Python 3不鼓励使用此标志,因为区域设置机制非常不可靠,一次只能处理一个“区域性”,并且只能处理8位区域设置。Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages.Python 3中的Unicode(str)模式默认情况下已经启用了Unicode匹配,并且它能够处理不同的地区/语言。
-
M
-
MULTILINE
((^
and$
haven’t been explained yet; they’ll be introduced in section More Metacharacters.)^
和$
尚未解释;将在更多元字符一节中介绍。)Usually通常^
matches only at the beginning of the string, and$
matches only at the end of the string and immediately before the newline (if any) at the end of the string.^
只在字符串的开头匹配,而$
只在字符串的结尾和字符串末尾的换行符(如果有)之前匹配。When this flag is specified,指定此标志后,^
matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline.^
将在字符串开头和字符串中每行的开头匹配,紧跟在每一换行之后。Similarly, the类似地,$
metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).$
元字符在字符串末尾和每行末尾(紧靠每一换行之前)匹配。
-
S
-
DOTALL
Makes the生成'.'
special character match any character at all, including a newline; without this flag,'.'
will match anything except a newline.'.'
特殊字符匹配任何字符,包括换行符;没有此标志,'.'
将匹配除换行符以外的任何内容。
-
A
-
ASCII
Make使\w
,\W
,\b
,\B
,\s
and\S
perform ASCII-only matching instead of full Unicode matching.\w
、\W
、\b
、\B
、\s
和\S
执行仅ASCII匹配,而不是完全Unicode匹配。This is only meaningful for Unicode patterns, and is ignored for byte patterns.这仅对Unicode模式有意义,而对字节模式则被忽略。
-
X
-
VERBOSE
This flag allows you to write regular expressions that are more readable by granting you more flexibility in how you can format them.此标志允许您编写更可读的正则表达式,因为它为您提供了更大的格式灵活性。When this flag has been specified, whitespace within the RE string is ignored, except when the whitespace is in a character class or preceded by an unescaped backslash; this lets you organize and indent the RE more clearly.指定此标志后,RE字符串中的空格将被忽略,除非空格位于字符类中或前面有未转义的反斜杠;这样可以更清楚地组织和缩进RE。This flag also lets you put comments within a RE that will be ignored by the engine; comments are marked by a此标志还允许您在RE中放置注释,引擎将忽略这些注释;注释由'#'
that’s neither in a character class or preceded by an unescaped backslash.'#'
标记,该'#'
既不在字符类中,也不在未转义的反斜杠之前。For example, here’s a RE that uses例如,这里有一个使用re.VERBOSE
; see how much easier it is to read?re.VERBOSE
的RE;看看它读起来有多容易?charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)Without the verbose setting, the RE would look like this:如果没有详细设置,RE将如下所示:charref = re.compile("&#(0[0-7]+"
"|[0-9]+"
"|x[0-9a-fA-F]+);")In the above example, Python’s automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it’s still more difficult to understand than the version using在上面的示例中,Python的字符串文本自动串联被用来将RE分解成更小的部分,但它仍然比使用re.VERBOSE
.re.VERBOSE
.的版本更难理解。
More Pattern Power更多模式力量¶
So far we’ve only covered a part of the features of regular expressions. 到目前为止,我们只讨论了正则表达式的一部分特性。In this section, we’ll cover some new metacharacters, and how to use groups to retrieve portions of the text that was matched.在本节中,我们将介绍一些新的元字符,以及如何使用组检索匹配的文本部分。
More Metacharacters更多元字符¶
There are some metacharacters that we haven’t covered yet. 有些元字符我们还没有涉及。Most of them will be covered in this section.本节将介绍其中的大部分内容。
Some of the remaining metacharacters to be discussed are zero-width assertions. 剩下的一些要讨论的元字符是零宽度断言。They don’t cause the engine to advance through the string; instead, they consume no characters at all, and simply succeed or fail. 它们不会导致发动机通过管柱前进;相反,它们根本不消耗角色,只是成功或失败。For example, 例如,\b
is an assertion that the current position is located at a word boundary; the position isn’t changed by the \b
at all. \b
是当前位置位于单词边界的断言;位置根本不会被\b
更改。This means that zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times.这意味着零宽度断言永远不应该重复,因为如果它们在给定位置匹配一次,显然可以匹配无限次。
|
Alternation, or the “or” operator.或“或”运算符。If A and B are regular expressions,如果A和B是正则表达式,则A|B
will match any string that matches either A or B.A|B
将匹配与A或B匹配的任何字符串。|
has very low precedence in order to make it work reasonably when you’re alternating multi-character strings.具有非常低的优先级,以便在交替使用多个字符串时合理地工作。Crow|Servo
will match either'Crow'
or'Servo'
, not'Cro'
, a'w'
or an'S'
, and'ervo'
.Crow|Servo
将匹配'Crow'
或'Servo'
,而不是'Cro'
、'w'
或'S'
以及'ervo'
。To match a literal要匹配文字'|'
, use\|
, or enclose it inside a character class, as in[|]
.'|'
,请使用\|
,或将其括在字符类中,如[|]
中所示。^
Matches at the beginning of lines.在行首匹配。Unless the除非设置了MULTILINE
flag has been set, this will only match at the beginning of the string.MULTILINE
标志,否则这将仅在字符串的开头匹配。In在MULTILINE
mode, this also matches immediately after each newline within the string.MULTILINE
模式下,这也会在字符串中的每个换行之后立即匹配。For example, if you wish to match the word例如,如果希望仅在行首匹配单词From
only at the beginning of a line, the RE to use is^From
.From
,则要使用的RE是^From
。>>> print(re.search('^From', 'From Here to Eternity'))
<re.Match object; span=(0, 4), match='From'>
>>> print(re.search('^From', 'Reciting From Memory'))
NoneTo match a literal要匹配文字'^'
, use\^
.'^'
,请使用\^
。$
Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.在行尾匹配,行尾定义为字符串的结尾或后跟换行符的任何位置。>>> print(re.search('}$', '{block}'))
<re.Match object; span=(6, 7), match='}'>
>>> print(re.search('}$', '{block} '))
None
>>> print(re.search('}$', '{block}\n'))
<re.Match object; span=(6, 7), match='}'>To match a literal要匹配文字'$'
, use\$
or enclose it inside a character class, as in[$]
.'$'
,请使用\$
或将其括在字符类中,如[$]
。\A
Matches only at the start of the string.仅在字符串开头匹配。When not in当不处于MULTILINE
mode,\A
and^
are effectively the same.MULTILINE
模式时,\A
和^
实际上是相同的。In在MULTILINE
mode, they’re different:\A
still matches only at the beginning of the string, but^
may match at any location inside the string that follows a newline character.MULTILINE
模式下,它们是不同的:\A
仍然只在字符串的开头匹配,但^
可以在字符串中新行字符后面的任何位置匹配。\Z
Matches only at the end of the string.仅在字符串末尾匹配。\b
Word boundary.单词边界。This is a zero-width assertion that matches only at the beginning or end of a word.这是一个零宽度断言,仅在单词的开头或结尾匹配。A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.单词定义为字母数字字符序列,因此单词的结尾由空格或非字母数字字符表示。The following example matches以下示例仅当class
only when it’s a complete word; it won’t match when it’s contained inside another word.class
是一个完整的单词时才匹配它;当它包含在另一个单词中时,它将不匹配。>>> p = re.compile(r'\bclass\b')
>>> print(p.search('no class at all'))
<re.Match object; span=(3, 8), match='class'>
>>> print(p.search('the declassified algorithm'))
None
>>> print(p.search('one subclass is'))
NoneThere are two subtleties you should remember when using this special sequence.使用此特殊序列时,您应该记住两个细微之处。First, this is the worst collision between Python’s string literals and regular expression sequences.首先,这是Python字符串文字和正则表达式序列之间最严重的冲突。In Python’s string literals,在Python的字符串文字中,\b
is the backspace character, ASCII value 8.\b
是退格字符,ASCII值为8。If you’re not using raw strings, then Python will convert the如果您没有使用原始字符串,那么Python会将\b
to a backspace, and your RE won’t match as you expect it to.\b
转换为退格,并且您的re不会像您期望的那样匹配。The following example looks the same as our previous RE, but omits the下面的示例看起来与前面的RE相同,但省略了RE字符串前面的'r'
in front of the RE string.'r'
。>>> p = re.compile('\bclass\b')
>>> print(p.search('no class at all'))
None
>>> print(p.search('\b' + 'class' + '\b'))
<re.Match object; span=(0, 7), match='\x08class\x08'>Second, inside a character class, where there’s no use for this assertion,其次,在字符类中,这个断言没有任何用处,\b
represents the backspace character, for compatibility with Python’s string literals.\b
表示退格字符,以便与Python的字符串文本兼容。\B
Another zero-width assertion, this is the opposite of另一个零宽度断言,与\b
, only matching when the current position is not at a word boundary.\b
相反,仅当当前位置不在单词边界时匹配。
Grouping分组¶
Frequently you need to obtain more information than just whether the RE matched or not. 通常,您需要获取更多信息,而不仅仅是是否重新匹配。Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. 正则表达式通常用于解析字符串,方法是编写一个重新划分为多个子组的字符串,这些子组与感兴趣的不同组件相匹配。For example, an RFC-822 header line is divided into a header name and a value, separated by a 例如,RFC-822标题行分为标题名称和值,以code>':'分隔,如下所示:':'
, like this:
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
This can be handled by writing a regular expression which matches an entire header line, and has one group which matches the header name, and another group which matches the header’s value.这可以通过编写一个正则表达式来处理,该正则表达式匹配整个标题行,并且有一个组匹配标题名称,另一个组匹配标题值。
Groups are marked by the 组由'('
, ')'
metacharacters. '('
、')'
元字符标记'('
and ')'
have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as *
, +
, ?
, or {m,n}
. '('
和')'
的含义与它们在数学表达式中的含义大致相同;它们将其中包含的表达式组合在一起,您可以使用重复的限定符重复组的内容,例如*
、+
、?
,或{m,n}
。For example, 例如,(ab)*
will match zero or more repetitions of ab
.(ab)*
将匹配零次或多次重复ab
。
>>> p = re.compile('(ab)*')
>>> print(p.match('ababababab').span())
(0, 10)
Groups indicated with 用'('
, ')'
also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument to group()
, start()
, end()
, and span()
. '('
、')'
表示的组还捕获它们匹配的文本的起始索引和结束索引;可以通过将参数传递给group()
、start()
、end()
和span()
来检索。Groups are numbered starting with 0. 组从0开始编号。Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument. 组0始终存在;这是整个RE,所以match
对象方法都将组0作为其默认参数。Later we’ll see how to express groups that don’t capture the span of text that they match.稍后,我们将看到如何表示无法捕获匹配文本范围的组。
>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'
Subgroups are numbered from left to right, from 1 upward. 子组从左向右编号,从1向上编号。Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.可以嵌套组;要确定数字,只需从左到右数开括号字符。
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
group()
can be passed multiple group numbers at a time, in which case it will return a tuple containing the corresponding values for those groups.可以一次传递多个组号,在这种情况下,它将返回一个包含这些组的相应值的元组。
>>> m.group(2,1,2)
('b', 'abc', 'b')
The groups()
method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.groups()
方法返回一个元组,该元组包含所有子组的字符串,从1到有多少子组。
>>> m.groups()
('abc', 'b')
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. 模式中的反向引用允许您指定在字符串中的当前位置也必须找到早期捕获组的内容。For example, 例如,如果在当前位置可以找到组1的确切内容,\1
will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. \1
将成功,否则将失败。Remember that Python’s string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE.请记住,Python的字符串文字还使用后跟数字的反斜杠,以允许在字符串中包含任意字符,因此在RE中合并反引用时,请确保使用原始字符串。
For example, the following RE detects doubled words in a string.例如,下面重新检测字符串中的双字。
>>> p = re.compile(r'\b(\w+)\s+\1\b')
>>> p.search('Paris in the the spring').group()
'the the'
Backreferences like this aren’t often useful for just searching through a string — there are few text formats which repeat data in this way — but you’ll soon find out that they’re very useful when performing string substitutions.像这样的反向引用通常不适用于仅搜索字符串-很少有文本格式以这种方式重复数据-但您很快就会发现它们在执行字符串替换时非常有用。
Non-capturing and Named Groups非捕获组和命名组¶
Elaborate REs may use many groups, both to capture substrings of interest, and to group and structure the RE itself. 精心设计的RE可以使用许多组,既可以捕获感兴趣的子字符串,也可以对RE本身进行分组和构造。In complex REs, it becomes difficult to keep track of the group numbers. 在复杂的REs中,很难跟踪组号。There are two features which help with this problem. 有两个功能可以帮助解决此问题。Both of them use a common syntax for regular expression extensions, so we’ll look at that first.它们都使用正则表达式扩展的通用语法,因此我们将首先了解这一点。
Perl 5 is well known for its powerful additions to standard regular expressions. Perl 5以其对标准正则表达式的强大添加而闻名。For these new features the Perl developers couldn’t choose new single-keystroke metacharacters or new special sequences beginning with 对于这些新特性,Perl开发人员无法选择新的单击键元字符或以\
without making Perl’s regular expressions confusingly different from standard REs. \
开头的新的特殊序列,否则Perl的正则表达式将与标准的REs有明显的不同。If they chose 例如,如果他们选择&
as a new metacharacter, for example, old expressions would be assuming that &
was a regular character and wouldn’t have escaped it by writing \&
or [&]
.&
作为新的元字符,旧表达式将假定&
是常规字符,并且不会通过写入\&
或[&]
来转义它。
The solution chosen by the Perl developers was to use Perl开发人员选择的解决方案是使用(?…)作为扩展语法。(?...)
as the extension syntax. 小括号后面紧挨着?
immediately after a parenthesis was a syntax error because the ?
would have nothing to repeat, so this didn’t introduce any compatibility problems. ?
曾是语法错误,因为?
不会有任何重复,因此这不会带来任何兼容性问题。The characters immediately after the 紧跟在?
indicate what extension is being used, so (?=foo)
is one thing (a positive lookahead assertion) and (?:foo)
is something else (a non-capturing group containing the subexpression foo
).?
后面的字符指示正在使用的扩展,因此(?=foo)
是一件事(肯定的前瞻断言),而(?:foo)
是另一件事(包含子表达式foo
的非捕获组)。
Python supports several of Perl’s extensions and adds an extension syntax to Perl’s extension syntax. Python支持几个Perl扩展,并在Perl扩展语法中添加了扩展语法。If the first character after the question mark is a 如果问号后面的第一个字符是P
, you know that it’s an extension that’s specific to Python.P
,那么您就知道它是Python特有的扩展。
Now that we’ve looked at the general extension syntax, we can return to the features that simplify working with groups in complex REs.现在我们已经了解了一般的扩展语法,我们可以返回到简化在复杂REs中使用组的功能。
Sometimes you’ll want to use a group to denote a part of a regular expression, but aren’t interested in retrieving the group’s contents. 有时,您可能希望使用组来表示正则表达式的一部分,但对检索组的内容不感兴趣。You can make this fact explicit by using a non-capturing group: 可以通过使用非捕获组(?:...)
, where you can replace the ...
with any other regular expression.(?:...)
来明确此事实,您可以在其中替换...
使用任何其他正则表达式。
>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition metacharacter such as 除了无法检索组匹配内容之外,非捕获组的行为与捕获组的行为完全相同;您可以将任何内容放入其中,使用重复元字符(如*
, and nest it within other groups (capturing or non-capturing). *
)重复它,并将其嵌套在其他组中(捕获或非捕获)。(?:...)
is particularly useful when modifying an existing pattern, since you can add new groups without changing how all the other groups are numbered. 在修改现有模式时特别有用,因为您可以添加新组,而无需更改所有其他组的编号方式。It should be mentioned that there’s no performance difference in searching between capturing and non-capturing groups; neither form is any faster than the other.应该提到的是,捕获组和非捕获组在搜索方面没有性能差异;两种形式都不比另一种快。
A more significant feature is named groups: instead of referring to them by numbers, groups can be referenced by a name.一个更重要的特性是命名组:组可以通过名称来引用,而不是通过数字来引用它们。
The syntax for a named group is one of the Python-specific extensions: 命名组的语法是Python特定的扩展之一:(?P<name>...)
. (?P<name>...)
。name is, obviously, the name of the group. 很明显,name是组的名称。Named groups behave exactly like capturing groups, and additionally associate a name with a group. 命名组的行为与捕获组完全相同,并且还将名称与组相关联。The match object methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group’s name. 处理捕获组的match
对象方法都接受按数字引用组的整数或包含所需组名称的字符串。Named groups are still given numbers, so you can retrieve information about a group in two ways:命名组仍然是给定的数字,因此您可以通过两种方式检索组的信息:
>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
Additionally, you can retrieve named groups as a dictionary with 此外,您可以使用groupdict()
:groupdict()
以字典的形式检索命名组:
>>> m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe')
>>> m.groupdict()
{'first': 'Jane', 'last': 'Doe'}
Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. 命名组很方便,因为它们允许您使用易于记住的名称,而不必记住数字。Here’s an example RE from the 以下是来自imaplib
module:imaplib
模块的RE示例:
InternalDate = re.compile(r'INTERNALDATE "'
r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
r'(?P<year>[0-9][0-9][0-9][0-9])'
r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
r'"')
It’s obviously much easier to retrieve 显然,检索m.group('zonem')
, instead of having to remember to retrieve group 9.m.group('zonem')
要容易得多,而不必记住检索组9。
The syntax for backreferences in an expression such as 表达式中反引用的语法,如(...)\1
refers to the number of the group. (...)\1
表示组的编号。There’s naturally a variant that uses the group name instead of the number. 自然有一种变体使用组名而不是数字。This is another Python extension: 这是另一个Python扩展:(?P=name)
indicates that the contents of the group called name should again be matched at the current point. (?P=name)
表示名为name的组的内容应在当前点再次匹配。The regular expression for finding doubled words, 用于查找双字的正则表达式\b(\w+)\s+\1\b
can also be written as \b(?P<word>\w+)\s+(?P=word)\b
:\b(\w+)\s+\1\b
也可以写成\b(?P<word>\w+)\s+(?P=word)\b
:
>>> p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b')
>>> p.search('Paris in the the spring').group()
'the the'
Lookahead Assertions前瞻性断言¶
Another zero-width assertion is the lookahead assertion. 另一个零宽度断言是lookahead断言。Lookahead assertions are available in both positive and negative form, and look like this:前瞻断言有正面和负面两种形式,如下所示:
(?=...)
Positive lookahead assertion.正向前瞻断言。This succeeds if the contained regular expression, represented here by如果包含的正则表达式在此处表示为...
, successfully matches at the current location, and fails otherwise....
,则此操作成功,在当前位置成功匹配,否则将失败。But, once the contained expression has been tried, the matching engine doesn’t advance at all; the rest of the pattern is tried right where the assertion started.但是,一旦尝试了包含的表达式,匹配引擎就根本无法前进;该模式的其余部分将在断言开始的地方进行尝试。(?!...)
Negative lookahead assertion.消极的前瞻性断言。This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at the current position in the string.这与积极断言相反;如果包含的表达式在字符串中的当前位置不匹配,则会成功。
To make this concrete, let’s look at a case where a lookahead is useful. 为了使其具体化,让我们看一个前瞻有用的案例。Consider a simple pattern to match a filename and split it apart into a base name and an extension, separated by a 考虑一个简单的模式来匹配文件名,并将其拆分为基本名称和扩展名,由.
. .
分隔。For example, in 例如,在news.rc
, news
is the base name, and rc
is the filename’s extension.news.rc
中,news
是基名称,rc
是文件名的扩展名。
The pattern to match this is quite simple:与此匹配的模式非常简单:
.*[.].*$
Notice that the 请注意.
needs to be treated specially because it’s a metacharacter, so it’s inside a character class to only match that specific character. .
需要进行特殊处理,因为它是一个元字符,所以它位于字符类中,只能匹配该特定字符。Also notice the trailing 还要注意后面的$
; this is added to ensure that all the rest of the string must be included in the extension. $
;添加此选项是为了确保扩展中必须包含字符串的所有其余部分。This regular expression matches 此正则表达式匹配foo.bar
and autoexec.bat
and sendmail.cf
and printers.conf
.foo.bar
、autoexec.bat
、sendmail.cf
和printers.conf
。
Now, consider complicating the problem a bit; what if you want to match filenames where the extension is not 现在,考虑将问题复杂化一点;如果要匹配扩展名不是bat
? bat
的文件名,该怎么办?Some incorrect attempts:一些不正确的尝试:
.*[.][^b].*$
The first attempt above tries to exclude 上面的第一次尝试通过要求扩展名的第一个字符不是bat
by requiring that the first character of the extension is not a b
. b
来排除bat
。This is wrong, because the pattern also doesn’t match 这是错误的,因为模式也与foo.bar
.foo.bar
不匹配。
.*[.]([^b]..|.[^a].|..[^t])$
The expression gets messier when you try to patch up the first solution by requiring one of the following cases to match: the first character of the extension isn’t 当您尝试修补第一个解决方案时,表达式会变得更加混乱,需要匹配以下情况之一:扩展的第一个字符不是b
; the second character isn’t a
; or the third character isn’t t
. b
;第二个字符不是a
;或者第三个字符不是t
。This accepts 它接受foo.bar
and rejects autoexec.bat
, but it requires a three-letter extension and won’t accept a filename with a two-letter extension such as sendmail.cf
. foo.bar
并拒绝autoexec.bat
,但它需要三个字母的扩展名,并且不接受具有两个字母扩展名的文件名,例如sendmail.cf
。We’ll complicate the pattern again in an effort to fix it.我们将再次使模式复杂化,以修复它。
.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$
In the third attempt, the second and third letters are all made optional in order to allow matching extensions shorter than three characters, such as 在第三次尝试中,第二个和第三个字母都是可选的,以便允许匹配短于三个字符的扩展名,例如sendmail.cf
.sendmail.cf
。
The pattern’s getting really complicated now, which makes it hard to read and understand. Worse, if the problem changes and you want to exclude both 这种模式现在变得非常复杂,这使得它很难阅读和理解。更糟糕的是,如果问题发生了变化,并且您希望将bat
and exe
as extensions, the pattern would get even more complicated and confusing.bat
和exe
都排除在扩展之外,那么该模式将变得更加复杂和混乱。
A negative lookahead cuts through all this confusion:消极的前瞻性解决了所有这些困惑:
.*[.](?!bat$)[^.]*$
The negative lookahead means: if the expression 消极的前瞻意味着:如果表达式bat
doesn’t match at this point, try the rest of the pattern; if bat$
does match, the whole pattern will fail. bat
在这一点上不匹配,请尝试该模式的其余部分;如果bat$
匹配,整个模式将失败。The trailing 后面的$
is required to ensure that something like sample.batch
, where the extension only starts with bat
, will be allowed. $
是必需的,以确保像sample.batch
这样的扩展只以bat
开头的东西将被允许。The [^.]*
makes sure that the pattern works when there are multiple dots in the filename.[^.]*
确保当文件名中有多个点时,该模式有效。
Excluding another filename extension is now easy; simply add it as an alternative inside the assertion. 排除另一个文件扩展名现在很容易;只需在断言中添加它作为替代。The following pattern excludes filenames that end in either 以下模式排除以bat
or exe
:bat
或exe
结尾的文件名:
.*[.](?!bat$|exe$)[^.]*$
Modifying Strings修改字符串¶
Up to this point, we’ve simply performed searches against a static string. 到目前为止,我们只对静态字符串执行了搜索。Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:正则表达式还常用以下模式方法以各种方式修改字符串:
|
|
---|---|
|
|
|
|
|
|
Splitting Strings正在拆分字符串¶
The 模式的split()
method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. split()
方法将字符串拆分到重新匹配的位置,返回片段列表。It’s similar to the 它类似于字符串的split()
method of strings but provides much more generality in the delimiters that you can split by; string split()
only supports splitting by whitespace or by a fixed string. split()
方法,但在分隔符中提供了更多的通用性,您可以使用这些分隔符进行拆分;字符串split()
只支持按空格或固定字符串进行拆分。As you’d expect, there’s a module-level 正如您所料,还有一个模块级re.split()
function, too.re.split()
函数。
-
.
split
(string[, maxsplit=0]) Split string by the matches of the regular expression.按正则表达式的匹配项拆分string。If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list.如果RE中使用了捕获括号,则其内容也将作为结果列表的一部分返回。If maxsplit is nonzero, at most maxsplit splits are performed.如果maxsplit为非零,则最多执行maxsplit次拆分。
You can limit the number of splits made, by passing a value for maxsplit. When maxsplit is nonzero, at most maxsplit splits will be made, and the remainder of the string is returned as the final element of the list. 通过为maxsplit传递一个值,可以限制进行的拆分次数。当maxsplit为非零时,最多进行maxsplit次拆分,字符串的其余部分作为列表的最后一个元素返回。In the following example, the delimiter is any sequence of non-alphanumeric characters.在以下示例中,分隔符是任何非字母数字字符序列。
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
Sometimes you’re not only interested in what the text between delimiters is, but also need to know what the delimiter was. 有时,您不仅对分隔符之间的文本感兴趣,还需要知道分隔符是什么。If capturing parentheses are used in the RE, then their values are also returned as part of the list. Compare the following calls:如果RE中使用了捕获括号,那么它们的值也会作为列表的一部分返回。比较以下调用:
>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
The module-level function 模块级函数re.split()
adds the RE to be used as the first argument, but is otherwise the same.re.split()
添加要用作第一个参数的RE,但在其他方面是相同的。
>>> re.split(r'[\W]+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split(r'([\W]+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split(r'[\W]+', 'Words, words, words.', 1)
['Words', 'words, words.']
Search and Replace搜索和替换¶
Another common task is to find all the matches for a pattern, and replace them with a different string. 另一个常见任务是查找模式的所有匹配项,并用不同的字符串替换它们。The <sub()
method takes a replacement value, which can be either a string or a function, and the string to be processed.sub()
方法接受替换值(可以是字符串或函数)和要处理的字符串。
-
.
sub
(replacement, string[, count=0]) Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement.返回通过将string最左边的RE非重叠匹配项替换为replacement得到的字符串。If the pattern isn’t found, string is returned unchanged.如果找不到模式,则string将原封不动地返回。The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer.可选参数count是要替换的最大模式出现次数;count必须是非负整数。The default value of 0 means to replace all occurrences.默认值0表示替换所有引用。
Here’s a simple example of using the 下面是一个使用sub()
method. sub()
方法的简单示例。It replaces colour names with the word 它将颜色名称替换为单词colour
:colour
:
>>> p = re.compile('(blue|white|red)')
>>> p.sub('colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub('colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'
The subn()
method does the same work, but returns a 2-tuple containing the new string value and the number of replacements that were performed:subn()
方法执行相同的工作,但返回一个包含新字符串值和执行的替换数的2元组:
>>> p = re.compile('(blue|white|red)')
>>> p.subn('colour', 'blue socks and red shoes')
('colour socks and colour shoes', 2)
>>> p.subn('colour', 'no colours at all')
('no colours at all', 0)
Empty matches are replaced only when they’re not adjacent to a previous empty match.只有当空匹配项与前一个空匹配项不相邻时,才会替换空匹配项。
>>> p = re.compile('x*')
>>> p.sub('-', 'abxd')
'-a-b--d-'
If replacement is a string, any backslash escapes in it are processed. 如果replacement为字符串,则会处理其中的任何反斜杠转义。That is, 即,\n
is converted to a single newline character, \r
is converted to a carriage return, and so forth. \n
转换为单个换行符,\r
转换为回车符,依此类推。Unknown escapes such as 诸如\&
are left alone. \&
之类的未知转义被单独保留。Backreferences, such as 反向引用(如\6
, are replaced with the substring matched by the corresponding group in the RE. \6
)替换为RE中相应组匹配的子字符串。This lets you incorporate portions of the original text in the resulting replacement string.这允许您将部分原始文本合并到生成的替换字符串中。
This example matches the word 此示例匹配单词section
followed by a string enclosed in {
, }
, and changes section
to subsection
:section
,后跟{
、}
中包含的字符串,并将section
更改为subsection
:
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First} section{second}')
'subsection{First} subsection{second}'
There’s also a syntax for referring to named groups as defined by the 还有一种语法用于引用由(?P<name>...)
syntax. (?P<name>...)
定义的命名组语法。\g<name>
will use the substring matched by the group named name
, and \g<number>
uses the corresponding group number. \g<name>
将使用与名为name
的组匹配的子字符串,而\g<number>
将使用相应的组编号。\因此,\g<2>
is therefore equivalent to \2
, but isn’t ambiguous in a replacement string such as \g<2>0
. \g<2>
相当于\2
,但在替换字符串中(如\g<2>0
)并不含糊。((\20
would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'
.) \20
将被解释为对组20的引用,而不是对组2的引用,后跟文字字符'0'
。)The following substitutions are all equivalent, but use all three variations of the replacement string.以下替换都是等效的,但使用替换字符串的所有三种变体。
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<1>}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<name>}','section{First}')
'subsection{First}'
replacement can also be a function, which gives you even more control. 也可以是一个函数,它为您提供了更多的控制。If replacement is a function, the function is called for every non-overlapping occurrence of pattern. 如果replacement是一个函数,则会为pattern的每个非重叠出现调用该函数。On each call, the function is passed a match object argument for the match and can use this information to compute the desired replacement string and return it.每次调用时,函数都会被传递一个匹配的match
对象参数,并可以使用此信息计算所需的替换字符串并返回它。
In the following example, the replacement function translates decimals into hexadecimal:在以下示例中,替换函数将小数转换为十六进制:
>>> def hexrepl(match):
... "Return the hex string for a decimal number"
... value = int(match.group())
... return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'
When using the module-level 使用模块级re.sub()
function, the pattern is passed as the first argument. re.sub()
函数时,模式作为第一个参数传递。The pattern may be provided as an object or as a string; if you need to specify regular expression flags, you must either use a pattern object as the first parameter, or use embedded modifiers in the pattern string, e.g. 模式可以作为对象或字符串提供;如果需要指定正则表达式标志,则必须使用模式对象作为第一个参数,或者在模式字符串中使用嵌入的修饰符,例如sub("(?i)b+", "x", "bbbb BBBB")
returns 'x x'
.sub("(?i)b+", "x", "bbbb BBBB")
返回'x x'
。
Common Problems常见问题¶
Regular expressions are a powerful tool for some applications, but in some ways their behaviour isn’t intuitive and at times they don’t behave the way you may expect them to. 正则表达式对于某些应用程序来说是一个强大的工具,但在某些方面,它们的行为并不直观,有时它们的行为也不符合您的预期。This section will point out some of the most common pitfalls.本节将指出一些最常见的陷阱。
Use String Methods使用字符串方法¶
Sometimes using the 有时使用re
module is a mistake. re
模块是一个错误。If you’re matching a fixed string, or a single character class, and you’re not using any 如果您匹配的是固定字符串或单个字符类,并且没有使用任何re
features such as the IGNORECASE
flag, then the full power of regular expressions may not be required. re
特性,如IGNORECASE
标志,则可能不需要正则表达式的全部功能。Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine.字符串有几种使用固定字符串执行操作的方法,它们通常要快得多,因为实现是一个小型的C循环,已为此目的进行了优化,而不是大型的、更通用的正则表达式引擎。
One example might be replacing a single fixed string with another one; for example, you might replace 一个例子可能是用另一个固定字符串替换单个固定字符串;例如,您可以将word
with deed
. word
替换为deed
。re.sub()
seems like the function to use for this, but consider the replace()
method. re.sub()
似乎是用于此的函数,但请考虑replace()
方法。Note that 请注意,replace()
will also replace word
inside words, turning swordfish
into sdeedfish
, but the naive RE word
would have done that, too. replace()
还将替换单词中的word
,将swordfish
变成sdeedfish
,但原生的REword
也可以做到这一点。(To avoid performing the substitution on parts of words, the pattern would have to be (为了避免对单词的某些部分进行替换,模式必须是\bword\b
, in order to require that word
have a word boundary on either side. \bword\b
,以便要求该word
在任何一侧都有单词边界。This takes the job beyond 这使作业超出了replace()
’s abilities.)replace()
的能力。)
Another common task is deleting every occurrence of a single character from a string or replacing it with another single character. 另一个常见任务是删除字符串中出现的每个字符,或将其替换为另一个字符。You might do this with something like 您可以使用re.sub('\n', ' ', S)
, but translate()
is capable of doing both tasks and will be faster than any regular expression operation can be.re.sub('\n', ' ', S)
这样的操作来完成这两项任务,但translate()
能够同时完成这两项任务,并且比任何正则表达式操作都要快。
In short, before turning to the 简而言之,在转向re
module, consider whether your problem can be solved with a faster and simpler string method.re
模块之前,请考虑是否可以使用更快更简单的字符串方法来解决您的问题。
match() versus search()match()
对比search()
¶
The match()
function only checks if the RE matches at the beginning of the string while search()
will scan forward through the string for a match. match()
函数仅检查字符串开头是否重新匹配,而search()
将向前扫描字符串以查找匹配项。It’s important to keep this distinction in mind. 记住这一区别很重要。Remember, 记住,match()
will only report a successful match which will start at 0; if the match wouldn’t start at zero, match()
will not report it.match()
只报告从0开始的成功匹配;如果匹配不是从零开始,match()
将不会报告它。
>>> print(re.match('super', 'superstition').span())
(0, 5)
>>> print(re.match('super', 'insuperable'))
None
On the other hand, 另一方面,search()
will scan forward through the string, reporting the first match it finds.search()
将向前扫描字符串,报告找到的第一个匹配项。
>>> print(re.search('super', 'superstition').span())
(0, 5)
>>> print(re.search('super', 'insuperable').span())
(2, 7)
Sometimes you’ll be tempted to keep using 有时,您可能想继续使用re.match()
, and just add .*
to the front of your RE. re.match()
,只需在RE的前面添加.*
。Resist this temptation and use 抵制这种诱惑,改用re.search()
instead. re.search()
。The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. 正则表达式编译器对REs进行一些分析,以加快查找匹配项的过程。One such analysis figures out what the first character of a match must be; for example, a pattern starting with 一个这样的分析指出了匹配的第一个字符必须是什么;例如,以Crow
must match starting with a 'C'
. Crow
开头的模式必须与以'C'
开头的模式匹配。The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 该分析允许引擎快速扫描字符串以查找起始字符,只有在找到'C'
is found.'C'
时才尝试完全匹配。
Adding 添加.*
defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. .*
破坏了这种优化,需要扫描到字符串的末尾,然后回溯以找到与RE其余部分匹配的内容。Use 请改用re.search()
instead.re.search()
。
Greedy versus Non-Greedy贪婪与非贪婪¶
When repeating a regular expression, as in 当重复正则表达式时,如在a*
, the resulting action is to consume as much of the pattern as possible. a*
中,结果操作是使用尽可能多的模式。This fact often bites you when you’re trying to match a pair of balanced delimiters, such as the angle brackets surrounding an HTML tag. 当您试图匹配一对平衡的分隔符(例如HTML标记周围的尖括号)时,这一事实常常会让您感到头疼。The naive pattern for matching a single HTML tag doesn’t work because of the greedy nature of 由于.*
..*
的贪婪性质,匹配单个HTML标记的天真模式不起作用。
>>> s = '<html><head><title>Title</title>'
>>> len(s)
32
>>> print(re.match('<.*>', s).span())
(0, 32)
>>> print(re.match('<.*>', s).group())
<html><head><title>Title</title>
The RE matches the RE匹配'<'
in '<html>'
, and the .*
consumes the rest of the string. '<html>'
中的'<'
,并且.*
消费字符串的其余部分。There’s still more left in the RE, though, and the 不过,RE中还有更多内容,而且>
can’t match at the end of the string, so the regular expression engine has to backtrack character by character until it finds a match for the >
. >
无法在字符串末尾匹配,因此正则表达式引擎必须逐字符回溯,直到找到与>
匹配的内容。The final match extends from the 最后的匹配从'<'
in '<html>'
to the '>'
in '</title>'
, which isn’t what you want.'<html>'
中的'<'
扩展到'</title>'
中的'>'
,这不是您想要的。
In this case, the solution is to use the non-greedy qualifiers 在这种情况下,解决方案是使用非贪婪限定符*?
, +?
, ??
, or {m,n}?
, which match as little text as possible. *?
、+?
、??
,或{m,n}?
,尽可能少地匹配文本。In the above example, the 在上面的示例中,在第一个'>'
is tried immediately after the first '<'
matches, and when it fails, the engine advances a character at a time, retrying the '>'
at every step. '<'
匹配后立即尝试'>'
,当它失败时,引擎一次前进一个字符,在每一步重试'>'
。This produces just the right result:这会产生正确的结果:
>>> print(re.match('<.*?>', s).group())
<html>
(Note that parsing HTML or XML with regular expressions is painful. (请注意,用正则表达式解析HTML或XML很痛苦。Quick-and-dirty patterns will handle common cases, but HTML and XML have special cases that will break the obvious regular expression; by the time you’ve written a regular expression that handles all of the possible cases, the patterns will be very complicated. 快速脏模式将处理常见情况,但HTML和XML有特殊情况,这将破坏明显的正则表达式;当您编写一个正则表达式来处理所有可能的情况时,模式将非常复杂。Use an HTML or XML parser module for such tasks.)使用HTML或XML解析器模块执行此类任务。)
Using 使用re.VERBOSE
¶
By now you’ve probably noticed that regular expressions are a very compact notation, but they’re not terribly readable. 到目前为止,您可能已经注意到正则表达式是一种非常紧凑的表示法,但它们的可读性不太好。REs of moderate complexity can become lengthy collections of backslashes, parentheses, and metacharacters, making them difficult to read and understand.中等复杂度的RE可能会成为反斜杠、圆括号和元字符的冗长集合,使它们难以阅读和理解。
For such REs, specifying the 对于这样的RE,在编译正则表达式时指定re.VERBOSE
flag when compiling the regular expression can be helpful, because it allows you to format the regular expression more clearly.re.VERBOSE
标志会很有帮助,因为它允许您更清楚地格式化正则表达式。
The re.VERBOSE
flag has several effects. re.VERBOSE
标志具有多种效果。Whitespace in the regular expression that isn’t inside a character class is ignored. 忽略正则表达式中不在字符类内的空格。This means that an expression such as 这意味着像dog | cat
is equivalent to the less readable dog|cat
, but [a b]
will still match the characters 'a'
, 'b'
, or a space. dog | cat
这样的表达式相当于可读性较差的dog|cat
,但[a b]
仍将匹配字符'a'
、'b'
或空格。In addition, you can also put comments inside a RE; comments extend from a 此外,您还可以在RE中添加注释;注释从#
character to the next newline. #
字符扩展到下一换行符。When used with triple-quoted strings, this enables REs to be formatted more neatly:当与三重引号字符串一起使用时,这使REs的格式更加整洁:
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P<header>[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P<value>.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)
This is far more readable than:这远比:
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
Feedback反馈¶
Regular expressions are a complicated topic. Did this document help you understand them? 正则表达式是一个复杂的主题。此文档是否有助于您理解它们?Were there parts that were unclear, or Problems you encountered that weren’t covered here? 是否有不清楚的部分,或者您遇到的问题没有在这里介绍? If so, please send suggestions for improvements to the author.如果是,请将改进建议发送给作者。
The most complete book on regular expressions is almost certainly Jeffrey Friedl’s Mastering Regular Expressions, published by O’Reilly. 关于正则表达式最完整的书几乎可以肯定是JeffreyFriedl的《掌握正则表达式》,由O'Reilly出版。Unfortunately, it exclusively concentrates on Perl and Java’s flavours of regular expressions, and doesn’t contain any Python material at all, so it won’t be useful as a reference for programming in Python. 不幸的是,它只关注Perl和Java风格的正则表达式,并且根本不包含任何Python材料,因此它对于Python编程没有任何参考价值。(The first edition covered Python’s now-removed (第一版介绍了Python现在删除的regex
module, which won’t help you much.) regex
模块,这对您没有多大帮助。)Consider checking it out from your library.考虑从您的库中检出它。