reRegular expression operations正则表达式操作

Source code: Lib/re.py


This module provides regular expression matching operations similar to those found in Perl.此模块提供与Perl中类似的正则表达式匹配操作。

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). 要搜索的模式和字符串都可以是Unicode字符串(str)以及8位字符串(bytes)。However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.但是,Unicode字符串和8位字符串不能混合:也就是说,不能将Unicode字符串与字节模式匹配,反之亦然;类似地,当请求替换时,替换字符串必须与模式和搜索字符串的类型相同。

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. 正则表达式使用反斜杠('\')表示特殊形式,或允许使用特殊字符而不调用其特殊含义。This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. 这与Python在字符串文字中使用相同的字符来达到相同的目的相冲突;例如,要匹配文本反斜杠,可能必须将'\\\\'写入模式字符串,因为正则表达式必须是\\,并且每个反斜杠必须在常规Python字符串文本中表示为\\Also, please note that any invalid escape sequences in Python’s usage of the backslash in string literals now generate a DeprecationWarning and in the future this will become a SyntaxError. 另外,请注意,Python在字符串文字中使用反斜杠时,任何无效的转义序列现在都会生成一个DeprecationWarning,将来这将成为一个语法错误。This behaviour will happen even if it is a valid escape sequence for a regular expression.即使它是正则表达式的有效转义序列,也会发生这种行为。

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. 解决方案是对正则表达式模式使用Python的原始字符串表示法;反斜杠不会以任何特殊方式在前缀为'r'的字符串文字中处理。So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. 因此,r"\n"是一个包含'\''n'的双字符字符串,而"\n"是一个包含换行符的单字符字符串。Usually patterns will be expressed in Python code using this raw string notation.通常,模式将使用这种原始字符串表示法在Python代码中表示。

It is important to note that most regular expression operations are available as module-level functions and methods on compiled regular expressions. 需要注意的是,大多数正则表达式操作都可以作为模块级函数和方法在编译后的正则表达式上使用。The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.这些函数是快捷方式,不需要先编译regex对象,但会错过一些微调参数。

See also

The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.第三方regex模块具有与标准库re模块兼容的API,但提供了额外的功能和更全面的Unicode支持。

Regular Expression Syntax正则表达式语法

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).正则表达式(或RE)指定一组与其匹配的字符串;此模块中的函数允许您检查特定字符串是否与给定正则表达式匹配(或者给定正则表达式是否与特定字符串匹配,这归结为同一件事)。

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. 正则表达式可以连接起来形成新的正则表达式;如果AB都是正则表达式,那么AB也是正则表达式。In general, if a string p matches A and another string q matches B, the string pq will match AB. 通常,如果字符串pA匹配,而另一个字符串qB匹配,则字符串pq将与AB匹配。This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. 除非AB包含低优先级操作,否则这将保持不变;AB之间的边界条件;或具有编号的组引用。Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. 因此,可以很容易地从这里描述的简单的基本表达式构造复杂表达式。For details of the theory and implementation of regular expressions, consult the Friedl book [Frie09], or almost any textbook about compiler construction.有关正则表达式的理论和实现的详细信息,请参阅Friedl书籍[Frie09],或几乎所有关于编译器构造的教科书。

A brief explanation of the format of regular expressions follows. 下面简要说明正则表达式的格式。For further information and a gentler presentation, consult the Regular Expression HOWTO.有关更多信息和更温和的表示,请参阅正则表达式HOWTO

Regular expressions can contain both special and ordinary characters. 正则表达式可以包含特殊字符和普通字符。Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. 大多数普通字符,如'A''a''0',都是最简单的正则表达式;他们只是匹配自己。You can concatenate ordinary characters, so last matches the string 'last'. 您可以串联普通字符,使last与字符串'last'匹配。(In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)(在本节的其余部分中,我们将以这种特殊的样式编写RE,通常不带引号,字符串将以“单引号”匹配。)

Some characters, like '|' or '(', are special. 有些字符,如'|''(',是特殊的。Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.特殊字符要么代表普通字符的类,要么影响它们周围的正则表达式的解释方式。

Repetition qualifiers (*, +, ?, {m,n}, etc) cannot be directly nested. 重复限定符(*+?{m,n}等)不能直接嵌套。This avoids ambiguity with the non-greedy modifier suffix ?, and with other modifiers in other implementations. 这可以避免使用非贪婪修饰符后缀?,以及其他实现中的其他修饰符。To apply a second repetition to an inner repetition, parentheses may be used. 要将第二个重复应用于内部重复,可以使用括号。For example, the expression (?:a{6})* matches any multiple of six 'a' characters.例如,表达式(?:a{6})*匹配六个'a'字符的任意倍数。

The special characters are:特殊字符为:

.

(Dot.) (点.)In the default mode, this matches any character except a newline. 在默认模式下,这将匹配除换行符以外的任何字符。If the DOTALL flag has been specified, this matches any character including a newline.如果指定了DOTALL标志,则该标志将匹配包括换行符在内的任何字符。

^

(Caret.) (插入符号.)Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.匹配字符串的开头,在MULTILINE模式下,也会在每个换行符后立即匹配。

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. 匹配字符串的结尾或字符串结尾处的换行符之前,并且在MULTILINE模式下也匹配换行符之前。foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. foo同时匹配“foo”和“foobar”,而正则表达式foo$只匹配“foo”。More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.更有趣的是,在'foo1\nfoo2\n'中搜索foo.$,通常与“foo2”匹配,但在MULTILINE模式中与“foo1”匹配;在'foo\n'中搜索单个$将找到两个(空)匹配项:一个在换行符之前,另一个在字符串末尾。

*

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. 使结果RE与前面RE的0个或更多重复匹配,尽可能多的重复。ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.ab*将匹配“a”、“ab”或“a”,后跟任意数量的“b”。

+

Causes the resulting RE to match 1 or more repetitions of the preceding RE. 使生成的RE与前面RE的1个或多个重复匹配。ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.ab+将匹配“a”,后跟任何非零数量的“b”;它不会只匹配“a”。

?

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 使结果RE与前面RE的0或1次重复匹配。ab? will match either ‘a’ or ‘ab’.将匹配“a”或“ab”。

*?, +?, ??

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. '*''+''?'限定符都是贪婪的;它们匹配尽可能多的文本。Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<a> b <c>', it will match the entire string, and not just '<a>'. 有时这种行为是不可取的;如果RE<.*>'<a> b <c>'匹配,它将匹配整个字符串,而不仅仅是'<a>'Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. 正在添加?限定符使其以非贪婪最小方式执行匹配后;将匹配尽可能少的字符。Using the RE <.*?> will match only '<a>'.使用<.*?>将仅匹配'<a>'

{m}

Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. 指定应匹配前一个RE的m个副本;较少的匹配会导致整个RE不匹配。For example, a{6} will match exactly six 'a' characters, but not five.例如,一个a{6}将正好匹配六个'a'字符,而不是五个。

{m,n}

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. 使生成的RE与前一个RE的mn个重复匹配,尝试匹配尽可能多的重复。For example, a{3,5} will match from 3 to 5 'a' characters. 例如,a{3,5}将匹配3到5个'a'字符。Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. 省略m指定零的下界,省略n指定无限上界。As an example, a{4,}b will match 'aaaab' or a thousand 'a' characters followed by a 'b', but not 'aaab'. 例如,a{4,}b将匹配'aaaab'或1000个'a'字符后跟'b',但不匹配'aaab'The comma may not be omitted or the modifier would be confused with the previously described form.逗号不能省略,否则修饰符会与前面描述的形式混淆。

{m,n}?

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. 使生成的RE与前一个RE的mn个重复相匹配,尝试匹配尽可能少的重复。This is the non-greedy version of the previous qualifier. 这是前一个限定符的非贪婪版本。For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.例如,在6个字符的字符串'aaaaaa'上,a{3,5}将匹配5个'a'字符,而a{3,5}?将仅匹配3个字符。

\

Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.要么转义特殊字符(允许您匹配诸如'*''?'等字符),要么发出特殊序列的信号;下面讨论特殊序列。

If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. 如果没有使用原始字符串来表示模式,请记住Python还使用反斜杠作为字符串文本中的转义序列;如果Python的解析器无法识别转义序列,则结果字符串中会包含反斜杠和后续字符。However, if Python would recognize the resulting sequence, the backslash should be repeated twice. 但是,如果Python能够识别结果序列,那么反斜杠应该重复两次。This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.这很复杂,很难理解,因此强烈建议您对所有表达式使用原始字符串,但最简单的表达式除外。

[]

Used to indicate a set of characters. 用于指示一组字符。In a set:在一组中:

  • Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.可以单独列出字符,例如,[amk]将匹配'a''m''k'

  • Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. 字符范围可以通过给出两个字符并用'-'分隔来表示,例如,[a-z]将匹配任何小写ASCII字母,[0-5][0-9]将匹配从0059的所有两位数,[0-9A-Fa-f]将匹配任何十六进制数字。If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal '-'.如果-被转义(例如,[a\-z]),或者如果它被放置为第一个或最后一个字符(例如,[-a][a-]),则它将匹配文本'-'

  • Special characters lose their special meaning inside sets. 特殊字符在集合中失去其特殊意义。For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.例如,[(+*)]将匹配任何文字字符'(''+''*'')'

  • Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.字符类,如\w\S(定义如下),也可以在一个集中接受,尽管它们匹配的字符取决于ASCIILOCALE模式是否有效。

  • Characters that are not within a range can be matched by complementing the set. 不在某个范围内的字符可以通过补足集合进行匹配。If the first character of the set is '^', all the characters that are not in the set will be matched. 如果集合的第一个字符是'^',则集合中不存在的所有字符都将匹配。For example, [^5] will match any character except '5', and [^^] will match any character except '^'. 例如,[^5]将匹配除'5'以外的任何字符,[^^]将匹配除'^'以外的任何字符。^ has no special meaning if it’s not the first character in the set.如果不是集合中的第一个字符,则没有特殊意义。

  • To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. 要匹配集合中的文字']',请在其前面加反斜杠,或将其放在集合的开头。For example, both [()[\]{}] and []()[{}] will both match a parenthesis.例如,[()[\]{}][]()[{}]都将匹配括号。

  • Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. 将来可能会添加对Unicode技术标准#18中的嵌套集和集操作的支持。This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. 这将改变语法,因此,为了便于进行此更改,目前将在模棱两可的情况下发出FutureWarningThat includes sets starting with a literal '[' or containing literal character sequences '--', '&&', '~~', and '||'. 它包括以文字'['开头的集或包含文字字符序列'--''&&''~~''||'的集。To avoid a warning escape them with a backslash.为了避免警告,请用反斜杠将其转义。

Changed in version 3.7:版本3.7中更改: FutureWarning is raised if a character set contains constructs that will change semantically in the future.如果字符集包含将来将更改语义的构造,则引发FutureWarning

|

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. ,其中AB可以是任意RE,创建将匹配AB的正则表达式。An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. 任意数量的REs可以用 '|'以这种方式分隔。这也可以在组内使用(见下文)。As the target string is scanned, REs separated by '|' are tried from left to right. 扫描目标字符串时,将从左到右尝试用'|'分隔的RE。When one pattern completely matches, that branch is accepted. 当一个模式完全匹配时,接受该分支。This means that once A matches, B will not be tested further, even if it would produce a longer overall match. 这意味着一旦A匹配,B将不会被进一步测试,即使它将产生更长的整体匹配。In other words, the '|' operator is never greedy. 换句话说,'|'运算符从不贪婪。To match a literal '|', use \|, or enclose it inside a character class, as in [|].要匹配文字'|',请使用\|,或将其括在字符类中,如[|]中所示。

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. 匹配括号内的任何正则表达式,并指示组的开始和结束;组的内容可以在执行匹配后检索,并且可以稍后在字符串中与\number特殊序列进行匹配,如下所述。To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(], [)].要匹配文本'('')',请使用\(\),或将它们括在字符类中:[(][)]

(?...)

This is an extension notation (a '?' following a '(' is not meaningful otherwise). 这是一个扩展符号(在'('之后是'?',否则没有意义)。The first character after the '?' determines what the meaning and further syntax of the construct is. '?'后的第一个字符确定构造的含义和进一步语法。Extensions usually do not create a new group; (?P<name>...) is the only exception to this rule. 扩展通常不会创建新组;(?P<name>...)是这条规则的唯一例外。Following are the currently supported extensions.以下是当前支持的扩展。

(?aiLmsux)

(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) (集合'a''i''L''m''s''u''x'中的一个或多个字母。)The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode matching), and re.X (verbose), for the entire regular expression. 组匹配空字符串;这些字母为整个正则表达式设置了相应的标志:re.A(仅ASCII匹配)、re.I(忽略大小写)、re.L(与区域设置相关)、re.M(多行)、re.S(点匹配全部)、reU(Unicode匹配)和re.U(详细)。(The flags are described in Module Contents.) 模块内容中描述了这些标志。)This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. 如果希望将flag作为正则表达式的一部分包括在内,而不是将标志参数传递给re.compile()函数,这将非常有用。Flags should be used first in the expression string.应首先在表达式字符串中使用标志。

(?:...)

A non-capturing version of regular parentheses. 普通括号的非捕获版本。Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.匹配括号内的任何正则表达式,但在执行匹配或稍后在模式中引用后,无法检索组匹配的子字符串。

(?aiLmsux-imsx:...)

(Zero or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x', optionally followed by '-' followed by one or more letters from the 'i', 'm', 's', 'x'.)(集'a''i''L''m''s''u''x'中的零个或多个字母,可选后跟'-',后跟一个或多个'i''m''s''x'中的字母。) The letters set or remove the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode matching), and re.X (verbose), for the part of the expression. 这些字母设置或删除表达式部分的相应标志:re.A(仅ASCII匹配)、re.I(忽略大小写)、re.L(与区域设置相关)、re.M(多行)、re.S(点匹配全部)、re.U(Unicode匹配)和re.X(详细)。(The flags are described in Module Contents.)模块内容中描述了这些标志。)

The letters 'a', 'L' and 'u' are mutually exclusive when used as inline flags, so they can’t be combined or follow '-'. 字母'a''L''u'在用作内联标志时是互斥的,因此它们不能组合或跟随'-'Instead, when one of them appears in an inline group, it overrides the matching mode in the enclosing group. 相反,当其中一个出现在内联组中时,它会覆盖封闭组中的匹配模式。In Unicode patterns (?a:...) switches to ASCII-only matching, and (?u:...) switches to Unicode matching (default). 在Unicode模式中(?a:...)切换到仅ASCII匹配,并且(?u:...)切换到Unicode匹配(默认)。In byte pattern (?L:...) switches to locale depending matching, and (?a:...) switches to ASCII-only matching (default). 字节模式(?L:...)根据匹配情况切换到区域设置,并(?a:...)切换到仅ASCII匹配(默认)。This override is only in effect for the narrow inline group, and the original matching mode is restored outside of the group.此替代仅对窄内联组有效,原始匹配模式在组外恢复。

New in version 3.6.版本3.6中新增。

Changed in version 3.7:版本3.7中更改: The letters 'a', 'L' and 'u' also can be used in a group.字母'a''L''u'也可以组合使用。

(?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. 与普通括号类似,但组匹配的子字符串可以通过符号组名称name访问。Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. 组名必须是有效的Python标识符,并且每个组名只能在正则表达式中定义一次。A symbolic group is also a numbered group, just as if the group were not named.符号组也是一个编号的组,就像组没有命名一样。

Named groups can be referenced in three contexts. 命名组可以在三种上下文中引用。If the pattern is (?P<quote>['"]).*?(?P=quote) (i.e. matching a string quoted with either single or double quotes):如果模式为(?P<quote>['"]).*?(?P=quote)(即匹配带单引号或双引号的字符串):

Context of reference to group “quote”引用组“quote”的上下文

Ways to reference it引用它的方法

in the same pattern itself以相同的模式本身

  • (?P=quote) (as shown)

  • \1

when processing match object m处理匹配对象m

  • m.group('quote')

  • m.end('quote') (etc.)

in a string passed to the repl argument of re.sub()在传递给re.sub()repl参数的字符串中

  • \g<quote>

  • \g<1>

  • \1

(?P=name)

A backreference to a named group; it matches whatever text was matched by the earlier group named name.对命名组的反向引用;它匹配先前名为name的组所匹配的任何文本。

(?#...)

A comment; the contents of the parentheses are simply ignored.评论;括号的内容将被忽略。

(?=...)

Matches if ... matches next, but doesn’t consume any of the string. 如果...匹配下一个,则匹配,但不使用消费字符串。This is called a lookahead assertion. 这称为前瞻断言For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.例如,Isaac (?=Asimov)仅当后跟'Asimov'时才与'Isaac '匹配。

(?!...)

Matches if ... doesn’t match next. 如果...与下一个不匹配则匹配。This is a negative lookahead assertion. 这是一个否定的前瞻断言For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.例如,Isaac (?!Asimov)只有在后面没有'Asimov'时才与'Isaac '匹配。

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. 如果字符串中的当前位置前面有一个...的匹配,它在当前位置结束,则匹配。This is called a positive lookbehind assertion. 这被称为肯定的后顾断言(?<=abc)def will find a match in 'abcdef', since the lookbehind will back up 3 characters and check if the contained pattern matches. (?<=abc)def将在'abcdef'中找到匹配项,因为后顾将备份3个字符并检查所包含的模式是否匹配。The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. 包含的模式必须只匹配某个固定长度的字符串,这意味着允许使用abca|b,但不允许使用a*a{3,4}Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:请注意,以肯定的后顾断言开头的模式在正在搜索的字符串的开头将不匹配;您很可能希望使用search()函数,而不是match()函数:

>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'

This example looks for a word following a hyphen:此示例查找连字符后面的单词:

>>> m = re.search(r'(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Changed in version 3.5:版本3.5中更改: Added support for group references of fixed length.添加了对固定长度的组引用的支持。

(?<!...)

Matches if the current position in the string is not preceded by a match for .... 如果字符串中的当前位置前面没有与...匹配,则匹配。This is called a negative lookbehind assertion. 这被称为否定的后顾断言Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. 与肯定的后顾断言类似,包含的模式必须只匹配某些固定长度的字符串。Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.以否定的后顾断言开头的模式可能在正在搜索的字符串的开头匹配。

(?(id/name)yes-pattern|no-pattern)

Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. 如果具有给定idname的组存在,将尝试匹配yes-pattern,如果不存在,则尝试匹配no-patternno-pattern is optional and can be omitted. 是可选的,可以省略。For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com' nor 'user@host.com>'.例如,(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)是一个糟糕的邮件匹配模式,它将匹配'<user@host.com>'以及'user@host.com',但是不匹配'<user@host.com'也不匹配'user@host.com>'

The special sequences consist of '\' and a character from the list below. 特殊序列由'\'和下表中的一个字符组成。If the ordinary character is not an ASCII digit or an ASCII letter, then the resulting RE will match the second character. 如果普通字符不是ASCII数字或ASCII字母,则生成的RE将与第二个字符匹配。For example, \$ matches the character '$'.例如,\$与字符'$'匹配。

\number

Matches the contents of the group of the same number. 匹配相同编号的组的内容。Groups are numbered starting from 1. 组从1开始编号。For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). 例如,((.+) \1匹配'the the''55 55',但不匹配'thethe'(请注意组后的空格)。This special sequence can only be used to match one of the first 99 groups. 此特殊序列只能用于匹配前99组中的一组。If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. 如果number的第一个数字是0,或者数字的长度是3个八进制数字,则不会将其解释为组匹配,而是解释为具有八进制值number的字符。Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.在字符类的'['']'中,所有数字转义都被视为字符。

\A

Matches only at the start of the string.仅在字符串开头匹配。

\b

Matches the empty string, but only at the beginning or end of a word. 匹配空字符串,但仅在单词的开头或结尾处匹配。A word is defined as a sequence of word characters. 单词定义为一系列单词字符。Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. 请注意,形式上,\b定义为\w\W字符之间的边界(反之亦然),或\w和字符串的开头/结尾之间的边界。This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.这意味着r'\bfoo\b'匹配'foo', 'foo.''(foo)''bar foo baz'但不匹配'foobar'也不匹配'foo3'

By default Unicode alphanumerics are the ones used in Unicode patterns, but this can be changed by using the ASCII flag. 默认情况下,Unicode字母数字是Unicode模式中使用的字母数字,但这可以通过使用ASCII标志进行更改。Word boundaries are determined by the current locale if the LOCALE flag is used. 如果使用区域设置标志,则单词边界由当前LOCALE确定。Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.在字符范围内,\b表示退格字符,以与Python的字符串文字兼容。

\B

Matches the empty string, but only when it is not at the beginning or end of a word. 匹配空字符串,但仅当它不在单词的开头或结尾时匹配。This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. 这意味着r'py\B'匹配'python''py3''py2',但不匹配'py''py.',或'py!'\B is just the opposite of \b, so word characters in Unicode patterns are Unicode alphanumerics or the underscore, although this can be changed by using the ASCII flag. \B\b正好相反,因此Unicode模式中的单词字符是Unicode字母数字或下划线,尽管这可以通过使用ASCII标志来更改。Word boundaries are determined by the current locale if the LOCALE flag is used.如果使用LOCALE标志,则单词边界由当前区域设置确定。

\d
For Unicode (str) patterns:对于Unicode(str)模式:

Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). 匹配任何Unicode十进制数字(即Unicode字符类别[Nd]中的任何字符)。This includes [0-9], and also many other digit characters. 这包括[0-9]和许多其他数字字符。If the ASCII flag is used only [0-9] is matched.如果使用ASCII标志,则仅匹配[0-9]

For 8-bit (bytes) patterns:对于8位(字节)模式:

Matches any decimal digit; this is equivalent to [0-9].匹配任何十进制数字;这相当于[0-9]

\D

Matches any character which is not a decimal digit. 匹配任何非十进制数字的字符。This is the opposite of \d. 这与\d相反。If the ASCII flag is used this becomes the equivalent of [^0-9].如果使用ASCII标志,则等效于[^0-9]

\s
For Unicode (str) patterns:对于Unicode(str)模式:

Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). 匹配Unicode空白字符(包括[ \t\n\r\f\v],以及许多其他字符,例如许多语言的排版规则要求的不间断空格)。If the ASCII flag is used, only [ \t\n\r\f\v] is matched.如果使用ASCII标志,则只匹配[ \t\n\r\f\v]

For 8-bit (bytes) patterns:对于8位(字节)模式:

Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].匹配ASCII字符集中被视为空白的字符;这相当于[ \t\n\r\f\v]

\S

Matches any character which is not a whitespace character. 匹配任何非空白字符的字符。This is the opposite of \s. 这与\s相反。If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v].如果使用ASCII标志,这将等效于[^ \t\n\r\f\v]

\w
For Unicode (str) patterns:对于Unicode(str)模式:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. 匹配Unicode单词字符;这包括任何语言中可以作为单词一部分的大多数字符,以及数字和下划线。If the ASCII flag is used, only [a-zA-Z0-9_] is matched.如果使用ASCII标志,则仅匹配[a-zA-Z0-9\]

For 8-bit (bytes) patterns:对于8位(字节)模式:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. 匹配ASCII字符集中被视为字母数字的字符;这相当于[a-zA-Z0-9\]If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.如果使用LOCALE标志,则匹配当前区域设置中被视为字母数字的字符和下划线。

\W

Matches any character which is not a word character. 匹配任何非单词字符的字符。This is the opposite of \w. 这与\w相反。If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. 如果使用ASCII标志,则等效于[^a-zA-Z0-9_]If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.如果使用LOCALE标志,则匹配当前区域设置中既不是字母数字也不是下划线的字符。

\Z

Matches only at the end of the string.仅在字符串末尾匹配。

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:正则表达式解析器也接受Python字符串文本支持的大多数标准转义:

\a      \b      \f      \n
\N \r \t \u
\U \v \x \\

(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)(注意\b用于表示单词边界,仅在字符类内表示“backspace”。)

'\u', '\U', and '\N' escape sequences are only recognized in Unicode patterns. '\u''\U''\N'转义序列只能在Unicode模式中识别。In bytes patterns they are errors. Unknown escapes of ASCII letters are reserved for future use and treated as errors.在字节模式中,它们是错误。ASCII字母的未知转义保留供将来使用,并视为错误。

Octal escapes are included in a limited form. 八进制转义包含在有限的形式中。If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. 如果第一个数字是0,或者有三个八进制数字,则视为八进制转义。Otherwise, it is a group reference. 否则,它是一个组引用。As for string literals, octal escapes are always at most three digits in length.至于字符串文字,八进制转义的长度总是最多为三位数。

Changed in version 3.3:版本3.3中更改: The '\u' and '\U' escape sequences have been added.已添加'\u''\U'转义序列。

Changed in version 3.6:版本3.6中更改: Unknown escapes consisting of '\' and an ASCII letter now are errors.'\'和ASCII字母组成的未知转义现在是错误。

Changed in version 3.8:版本3.8中更改: The '\N{name}' escape sequence has been added. 已添加'\N{name}'转义序列。As in string literals, it expands to the named Unicode character (e.g. '\N{EM DASH}').与字符串文字一样,它扩展为命名的Unicode字符(例如'\N{EM DASH}')。

Module Contents模块内容

The module defines several functions, constants, and an exception. 该模块定义了几个函数、常量和一个异常。Some of the functions are simplified versions of the full featured methods for compiled regular expressions. 其中一些函数是用于编译正则表达式的全功能方法的简化版本。Most non-trivial applications always use the compiled form.大多数非平凡的应用程序总是使用编译后的表单。

Changed in version 3.6:版本3.6中更改: Flag constants are now instances of RegexFlag, which is a subclass of enum.IntFlag.标志常量现在是RegexFlag的实例,RegexFlagenum.IntFlag的一个子类。

re.compile(pattern, flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.将正则表达式模式编译为正则表达式对象,该对象可用于使用其match()search()和其他方法进行匹配,如下所述。

The expression’s behaviour can be modified by specifying a flags value. 可以通过指定flags值来修改表达式的行为。Values can be any of the following variables, combined using bitwise OR (the | operator).值可以是以下任意变量,使用按位OR(|运算符)组合。

The sequence序列

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to相当于

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.但是,当表达式在单个程序中多次使用时,使用re.compile()并保存生成的正则表达式对象以供重用会更加有效。

Note

The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.传递给re.compile()的最新模式的编译版本和模块级匹配函数被缓存,因此一次只使用少数正则表达式的程序不必担心编译正则表达式。

re.A
re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. 使\w\w\b\B\d\D\s\S执行仅ASCII匹配,而不是完全Unicode匹配。This is only meaningful for Unicode patterns, and is ignored for byte patterns. 这仅对Unicode模式有意义,而对字节模式则被忽略。Corresponds to the inline flag (?a).对应于内联标志(?a)

Note that for backward compatibility, the re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).请注意,为了向后兼容,re.U标志仍然存在(以及它的同义词re.UNICODE及其嵌入的对应词(?u)),但在Python 3中这些是多余的,因为字符串的匹配默认为Unicode(字节不允许Unicode匹配)。

re.DEBUG

Display debug information about compiled expression. 显示有关已编译表达式的调试信息。No corresponding inline flag.没有相应的内联标志。

re.I
re.IGNORECASE

Perform case-insensitive matching; expressions like [A-Z] will also match lowercase letters. 进行不区分大小写的匹配;像[A-Z]这样的表达式也将匹配小写字母。Full Unicode matching (such as Ü matching ü) also works unless the re.ASCII flag is used to disable non-ASCII matches. 除非re.ASCII标志用于禁用非ASCII匹配,否则完整的Unicode匹配(例如Ü匹配ü)也可以工作。The current locale does not change the effect of this flag unless the re.LOCALE flag is also used. 当前区域设置不会更改此标志的效果,除非还使用了re.LOCALE标志。Corresponds to the inline flag (?i).对应于内联标志(?i)

Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). 请注意当Unicode模式[a-z]A-ZIGNORECASE标记结合使用时,它们将匹配52个ASCII字母以及4个额外的非ASCII字母:“İ”(U+0130,拉丁大写字母I上面带点),“ı” (U+0131,拉丁小写字母i上面无点), “ſ” (U+017F,拉丁小写字母长s)和“K” (U+212A,开尔文符号)。If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.如果使用ASCII标志,则只匹配字母“a”到“z”以及“a”到“z”。

re.L
re.LOCALE

Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale. This flag can be used only with bytes patterns. 使\w\W\b\B和不区分大小写的匹配依赖于当前区域设置。此标志只能用于字节模式。The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales. 不鼓励使用此标志,因为区域设置机制非常不可靠,一次只能处理一个“区域性”,并且只能处理8位区域设置。Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages. Python 3中的Unicode(str)模式默认情况下已经启用了Unicode匹配,并且它能够处理不同的地区/语言。Corresponds to the inline flag (?L).对应于内联标志(?L)

Changed in version 3.6:版本3.6中更改: re.LOCALE can be used only with bytes patterns and is not compatible with re.ASCII.只能用于字节模式,与re.ASCII不兼容。

Changed in version 3.7:版本3.7中更改: Compiled regular expression objects with the re.LOCALE flag no longer depend on the locale at compile time. 具有re.LOCALE标志的已编译正则表达式对象在编译时不再依赖于区域设置。Only the locale at matching time affects the result of matching.只有匹配时的区域设置才会影响匹配结果。

re.M
re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). 指定时,模式字符'^'在字符串开头和每行开头匹配(紧跟在每一换行之后);并且模式字符'$'在字符串末尾和每行末尾(紧跟在每一换行之前)匹配。By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. 默认情况下,'^'仅在字符串的开头匹配,而'$'仅在字符串的结尾和字符串末尾的换行符(如果有)之前匹配。Corresponds to the inline flag (?m).对应于内联标志(?m)

re.S
re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. 使'.'特殊字符匹配任何字符,包括换行符;没有此标志,'.'将匹配除换行符以外的任何内容。Corresponds to the inline flag (?s).对应于内联标志(?s)

re.X
re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. 该标志允许您通过直观地分隔模式的逻辑部分并添加注释来编写外观更好、可读性更好的正则表达式。Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. 模式中的空格将被忽略,除非在字符类中,或前面有未转义的反斜杠,或在譬如*?)、(?:(?P<...>这些标记中。When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.当一行中包含不在字符类中且前面没有未加反斜杠的#时,将忽略从最左侧的此类#到行尾的所有字符。

This means that the two following regular expression objects that match a decimal number are functionally equal:这意味着以下两个匹配十进制数的正则表达式对象在功能上是相等的:

a = re.compile(r"""\d +  # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

Corresponds to the inline flag (?x).对应于内联标志(?x)

re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. 扫描string,查找正则表达式pattern生成匹配的第一个位置,并返回相应的匹配对象。Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.如果字符串中没有与模式匹配的位置,则返回None;请注意,这与在字符串中的某个点查找零长度匹配不同。

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. 如果string开头的零个或多个字符与正则表达式模式匹配,则返回相应的匹配对象。Return None if the string does not match the pattern; note that this is different from a zero-length match.如果字符串与模式不匹配,则返回None;请注意,这与零长度匹配不同。

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.请注意,即使在MULTILINE模式下,re.match()也只会在字符串的开头匹配,而不会在每行的开头匹配。

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).如果要在string中的任何位置找到匹配项,请改用search()(另请参见search()match()的对比)。

re.fullmatch(pattern, string, flags=0)

If the whole string matches the regular expression pattern, return a corresponding match object. 如果整个string与正则表达式pattern匹配,则返回相应的match对象Return None if the string does not match the pattern; note that this is different from a zero-length match.如果字符串与模式不匹配,则返回None;请注意,这与零长度匹配不同。

New in version 3.4.版本3.4中新增。

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern. pattern的出现次数拆分stringIf capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. 如果在pattern中使用捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.如果maxsplit为非零,则最多会发生maxsplit拆分,字符串的其余部分将作为列表的最后一个元素返回。

>>> re.split(r'\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split(r'(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split(r'\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. 如果分隔符中有捕获组,并且它在字符串开头匹配,则结果将以空字符串开头。The same holds for the end of the string:字符串的结尾也是如此:

>>> re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

That way, separator components are always found at the same relative indices within the result list.这样,分隔符组件总是位于结果列表中相同的相对索引处。

Empty matches for the pattern split the string only when not adjacent to a previous empty match.模式的空匹配仅在与前一个空匹配不相邻时拆分字符串。

>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']
>>> re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

Changed in version 3.1:版本3.1中更改: Added the optional flags argument.添加了可选标志参数。

Changed in version 3.7:版本3.7中更改: Added support of splitting on a pattern that could match an empty string.添加了对可能匹配空字符串的模式的拆分支持。

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings or tuples. 以字符串或元组列表的形式返回stringpattern的所有非重叠匹配。The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.string从左到右扫描,并按找到的顺序返回匹配项。结果中包含空匹配项。

The result depends on the number of capturing groups in the pattern. 结果取决于模式中捕获组的数量。If there are no groups, return a list of strings matching the whole pattern. 如果没有组,则返回与整个模式匹配的字符串列表。If there is exactly one group, return a list of strings matching that group. 如果只有一个组,则返回与该组匹配的字符串列表。If multiple groups are present, return a list of tuples of strings matching the groups. 如果存在多个组,则返回与这些组匹配的字符串元组列表。Non-capturing groups do not affect the form of the result.非捕获组不会影响结果的形式。

>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
>>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]

Changed in version 3.7:版本3.7中更改: Non-empty matches can now start just after a previous empty match.非空匹配现在可以在前一个空匹配之后开始。

re.finditer(pattern, string, flags=0)

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. 返回一个迭代器,为string中的REpattern在所有非重叠匹配上生成match对象The string is scanned left-to-right, and matches are returned in the order found. string从左到右扫描,并按找到的顺序返回匹配项。Empty matches are included in the result.结果中包含空匹配项。

Changed in version 3.7:版本3.7中更改: Non-empty matches can now start just after a previous empty match.非空匹配现在可以在前一个空匹配之后开始。

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 返回通过替换repl替换string中最左侧不重叠的 pattern而获得的字符串。If the pattern isn’t found, string is returned unchanged. 如果找不到模式,则返回的string将保持不变。repl can be a string or a function; if it is a string, any backslash escapes in it are processed. 可以是字符串或函数;如果是字符串,则会处理其中的任何反斜杠转义。That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. 即,\n转换为单个换行符,\r转换为回车符,依此类推。Unknown escapes of ASCII letters are reserved for future use and treated as errors. ASCII字母的未知转义保留供将来使用,并视为错误。Other unknown escapes such as \& are left alone. 其他未知的转义(如\&)被单独保留。Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. 反向引用(如\6)将替换为模式中与组6匹配的子字符串。For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
... r'static PyObject*\npy_\1(void)\n{',
... 'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'

If repl is a function, it is called for every non-overlapping occurrence of pattern. 如果repl是一个函数,则会为pattern的每个非重叠出现调用它。The function takes a single match object argument, and returns the replacement string. 该函数接受单个match对象参数,并返回替换字符串。For example:例如:

>>> def dashrepl(matchobj):
... if matchobj.group(0) == '-': return ' '
... else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

The pattern may be a string or a pattern object.模式可以是字符串或模式对象

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. 可选参数count是要替换的最大模式出现次数;count必须是非负整数。If omitted or zero, all occurrences will be replaced. 如果省略或为零,将替换所有引用。Empty matches for the pattern are replaced only when not adjacent to a previous empty match, so sub('x*', '-', 'abxd') returns '-a-b--d-'.模式的空匹配仅在与前一个空匹配不相邻时才被替换,因此sub('x*', '-', 'abxd')返回'-a-b--d-'

In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. 在字符串类型repl参数中,除了上面描述的字符转义和反向引用之外,\g<name>还将使用由(?P<name>...)语法定义的与名为name的组匹配的子字符串。\g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \g<number>使用相应的组号;因此,\g<2>相当于\2,但在替换中,例如\g<2>0,并不含糊。\20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. \20将被解释为对组20的引用,而不是对组2后跟文字字符'0'的引用。The backreference \g<0> substitutes in the entire substring matched by the RE.反向引用\g<0>替换RE匹配的整个子字符串。

Changed in version 3.1:版本3.1中更改: Added the optional flags argument.添加了可选标志参数。

Changed in version 3.5:版本3.5中更改: Unmatched groups are replaced with an empty string.不匹配的组将替换为空字符串。

Changed in version 3.6:版本3.6中更改: Unknown escapes in pattern consisting of '\' and an ASCII letter now are errors.'\'和ASCII字母组成的pattern中的未知转义现在是错误。

Changed in version 3.7:版本3.7中更改: Unknown escapes in repl consisting of '\' and an ASCII letter now are errors.'\'和ASCII字母组成的repl中的未知转义现在是错误。

Changed in version 3.7:版本3.7中更改: Empty matches for the pattern are replaced when adjacent to a previous non-empty match.当与前一个非空匹配相邻时,将替换模式的空匹配。

re.subn(pattern, repl, string, count=0, flags=0)

Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).执行与sub()相同的操作,但返回一个元组(new_string, number_of_subs_made)

Changed in version 3.1:版本3.1中更改: Added the optional flags argument.添加了可选标志参数。

Changed in version 3.5:版本3.5中更改: Unmatched groups are replaced with an empty string.不匹配的组将替换为空字符串。

re.escape(pattern)

Escape special characters in pattern. pattern中的特殊字符转义。This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it. 如果要匹配可能包含正则表达式元字符的任意文字字符串,这将非常有用。For example:例如:

>>> print(re.escape('https://www.python.org'))
https://www\.python\.org
>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
>>> print('[%s]+' % re.escape(legal_chars))
[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+

>>> operators = ['+', '-', '*', '/', '**']
>>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
/|\-|\+|\*\*|\*

This function must not be used for the replacement string in sub() and subn(), only backslashes should be escaped. 此函数不能用于sub()subn()中的替换字符串,只能转义反斜杠。For example:

>>> digits_re = r'\d+'
>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
/usr/sbin/sendmail - \d+ errors, \d+ warnings

Changed in version 3.3:版本3.3中更改: The '_' character is no longer escaped.不再转义'_'字符。

Changed in version 3.7:版本3.7中更改: Only characters that can have special meaning in a regular expression are escaped. 只有在正则表达式中具有特殊含义的字符才会转义。As a result, '!', '"', '%', "'", ',', '/', ':', ';', '<', '=', '>', '@', and "`" are no longer escaped.结果是'!''"''%'"'"',''/'':'';''<''=''>''@'"`"不再转义。

re.purge()

Clear the regular expression cache.清除正则表达式缓存。

exceptionre.error(msg, pattern=None, pos=None)

Exception raised when a string passed to one of the functions here is not a valid regular expression (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. 当传递给此处某个函数的字符串不是有效的正则表达式(例如,它可能包含不匹配的括号)或在编译或匹配过程中发生其他错误时引发异常。It is never an error if a string contains no match for a pattern. 如果字符串中没有与模式匹配的内容,则永远不会出错。The error instance has the following additional attributes:错误实例具有以下附加属性:

msg

The unformatted error message.未格式化的错误消息。

pattern

The regular expression pattern.正则表达式模式。

pos

The index in pattern where compilation failed (may be None).pattern中编译失败的索引(可能为None)。

lineno

The line corresponding to pos (may be None).对应于pos的行(可以是无)。

colno

The column corresponding to pos (may be None).对应于pos的列(可以是None)。

Changed in version 3.5:版本3.5中更改: Added additional attributes.添加了其他属性。

Regular Expression Objects正则表达式对象

Compiled regular expression objects support the following methods and attributes:编译的正则表达式对象支持以下方法和属性:

Pattern.search(string[, pos[, endpos]])

Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object. 扫描string,查找此正则表达式生成匹配的第一个位置,并返回相应的匹配对象。Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.如果字符串中没有与模式匹配的位置,则返回None;请注意,这与在字符串中的某个点查找零长度匹配不同。

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. 可选的第二个参数pos在字符串中给出了一个索引,搜索将从这里开始;默认为0This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.这并不完全等同于切割字符串;'^'模式字符在字符串的真正开头和换行后的位置匹配,但不一定在搜索开始的索引处匹配。

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. 可选参数endpos限制字符串的搜索距离;这就好像字符串是endpos字符一样长,因此只会搜索从posendpos-1的字符以查找匹配项。If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).如果endpos小于pos,则找不到匹配项;否则,如果rx是编译的正则表达式对象,则rx.search(string[:50], 0)等效于rx.search(string, 0, 50)

>>> pattern = re.compile("d")
>>> pattern.search("dog") # Match at index 0
<re.Match object; span=(0, 1), match='d'>
>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Pattern.match(string[, pos[, endpos]])

If zero or more characters at the beginning of string match this regular expression, return a corresponding match object. 如果stringbeginning的零个或多个字符与此正则表达式匹配,则返回相应的match对象Return None if the string does not match the pattern; note that this is different from a zero-length match.如果字符串与模式不匹配,则返回None;请注意,这与零长度匹配不同。

The optional pos and endpos parameters have the same meaning as for the search() method.可选的posendpos参数与search()方法的含义相同。

>>> pattern = re.compile("o")
>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
<re.Match object; span=(1, 2), match='o'>

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).如果要在string中的任何位置找到匹配项,请改用search()(另请参见search()match()的对比)。

Pattern.fullmatch(string[, pos[, endpos]])

If the whole string matches this regular expression, return a corresponding match object. 如果整个string与此正则表达式匹配,则返回相应的match对象Return None if the string does not match the pattern; note that this is different from a zero-length match.如果字符串与模式不匹配,则返回None;请注意,这与零长度匹配不同。

The optional pos and endpos parameters have the same meaning as for the search() method.可选的posendpos参数与search()方法的含义相同。

>>> pattern = re.compile("o[gh]")
>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
>>> pattern.fullmatch("ogre") # No match as not the full string matches.
>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
<re.Match object; span=(1, 3), match='og'>

New in version 3.4.版本3.4中新增。

Pattern.split(string, maxsplit=0)

Identical to the split() function, using the compiled pattern.split()函数相同,使用编译模式。

Pattern.findall(string[, pos[, endpos]])

Similar to the findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for search().findall()函数类似,使用编译的模式,但也接受可选的posendpos参数,这些参数限制搜索区域,如search()

Pattern.finditer(string[, pos[, endpos]])

Similar to the finditer() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for search().finditer()函数类似,使用编译的模式,但也接受限制搜索区域的可选posendpos参数,如用于search()

Pattern.sub(repl, string, count=0)

Identical to the sub() function, using the compiled pattern.sub()函数相同,使用编译模式。

Pattern.subn(repl, string, count=0)

Identical to the subn() function, using the compiled pattern.subn()函数相同,使用编译模式。

Pattern.flags

The regex matching flags. 正则表达式匹配标志。This is a combination of the flags given to compile(), any (?...) inline flags in the pattern, and implicit flags such as UNICODE if the pattern is a Unicode string.这是指定给compile()和任何(?...)的标志的组合模式中的内联标志,如果模式是UNICODE字符串,则为UNICODE等隐式标志。

Pattern.groups

The number of capturing groups in the pattern.模式中捕获组的数目。

Pattern.groupindex

A dictionary mapping any symbolic group names defined by (?P<id>) to group numbers. (?P<id>)定义的任何符号组名映射到组号的字典。The dictionary is empty if no symbolic groups were used in the pattern.如果模式中未使用符号组,则字典为空。

Pattern.pattern

The pattern string from which the pattern object was compiled.从中编译模式对象的模式字符串。

Changed in version 3.7:版本3.7中更改: Added support of copy.copy() and copy.deepcopy(). 添加了对copy.copy()copy.deepcopy()的支持。Compiled regular expression objects are considered atomic.编译的正则表达式对象被视为原子对象。

Match Objects匹配对象

Match objects always have a boolean value of True. 匹配对象的布尔值始终为TrueSince match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:由于match()search()在没有匹配项时返回None,因此可以使用简单的if语句测试是否有匹配项:

match = re.search(pattern, string)
if match:
process(match)

Match objects support the following methods and attributes:match对象支持以下方法和属性:

Match.expand(template)

Return the string obtained by doing backslash substitution on the template string template, as done by the sub() method. 返回通过对模板字符串template进行反斜杠替换获得的字符串,如sub()方法所做的那样。Escapes such as \n are converted to the appropriate characters, and numeric backreferences (\1, \2) and named backreferences (\g<1>, \g<name>) are replaced by the contents of the corresponding group.转义符(例如\n)将转换为适当的字符,数字反引用(\1\2)和命名反引用(\g<1>\g<name>)将替换为相应组的内容。

Changed in version 3.5:版本3.5中更改: Unmatched groups are replaced with an empty string.不匹配的组将替换为空字符串。

Match.group([group1, ...])

Returns one or more subgroups of the match. 返回匹配的一个或多个子组。If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. 如果只有一个参数,则结果是一个字符串;如果有多个参数,则结果是一个元组,每个参数有一个项。Without arguments, group1 defaults to zero (the whole match is returned). 如果没有参数,group1默认为零(返回整个匹配)。If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. 如果groupN参数为零,则相应的返回值是整个匹配字符串;如果在包含范围[1..99]内,则为与相应括号组匹配的字符串。If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. 如果组号为负数或大于模式中定义的组数,则引发IndexErrorIf a group is contained in a part of the pattern that did not match, the corresponding result is None. 如果某个组包含在模式的不匹配部分中,则相应的结果为NoneIf a group is contained in a part of the pattern that matched multiple times, the last match is returned.如果一个组包含在多次匹配的模式部分中,则返回最后一个匹配。

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')

If the regular expression uses the (?P<name>...) syntax, the groupN arguments may also be strings identifying groups by their group name. 如果正则表达式使用(?P<name>...)语法,groupN参数也可以是通过组名标识组的字符串。If a string argument is not used as a group name in the pattern, an IndexError exception is raised.如果模式中未将字符串参数用作组名,则会引发IndexError

A moderately complicated example:一个中等复杂的例子:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'

Named groups can also be referred to by their index:命名组也可以通过其索引引用:

>>> m.group(1)
'Malcolm'
>>> m.group(2)
'Reynolds'

If a group matches multiple times, only the last match is accessible:如果组匹配多次,则只能访问最后一个匹配:

>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
>>> m.group(1) # Returns only the last match.
'c3'
Match.__getitem__(g)

This is identical to m.group(g). 这与m.group(g)相同。This allows easier access to an individual group from a match:这样可以更方便地从匹配中访问单个组:

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m[0] # The entire match
'Isaac Newton'
>>> m[1] # The first parenthesized subgroup.
'Isaac'
>>> m[2] # The second parenthesized subgroup.
'Newton'

New in version 3.6.版本3.6中新增。

Match.groups(default=None)

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. 返回包含匹配的所有子组的元组,从1到模式中的组数。The default argument is used for groups that did not participate in the match; it defaults to None.default参数用于未参与匹配的组;默认为None

For example:

>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
>>> m.groups()
('24', '1632')

If we make the decimal place and everything after it optional, not all groups might participate in the match. 如果我们将小数位及其后的所有内容设置为可选,则并非所有组都可能参与比赛。These groups will default to None unless the default argument is given:除非给出default参数,否则这些组将默认为None

>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
>>> m.groups() # Second group defaults to None.
('24', None)
>>> m.groups('0') # Now, the second group defaults to '0'.
('24', '0')
Match.groupdict(default=None)

Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. 返回一个包含匹配的所有命名子组的字典,由子组名称加键。The default argument is used for groups that did not participate in the match; it defaults to None. default参数用于未参与比赛的组;默认为NoneFor example:例如:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Match.start([group])
Match.end([group])

Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). 返回group匹配的子串的开始和结束的索引;group默认为零(表示整个匹配的子字符串)。Return -1 if group exists but did not contribute to the match. 如果group存在但未参与匹配,则返回-1For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is对于匹配对象m和对匹配有贡献的组g,组g匹配的子字符串(相当于m.group(g))为

m.string[m.start(g):m.end(g)]

Note that m.start(group) will equal m.end(group) if group matched a null string. 请注意,若group匹配空字符串,则m.start(group)将等于m.end(group)For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception.例如,在m = re.search('b(c?)', 'cba')之后,m.start(0)是1,m.end(0)是2,m.start(1)m.end(1)都是2,m.start(2)引发IndexError异常。

An example that will remove remove_this from email addresses:将从电子邮件地址中删除remove_this的示例:

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'
Match.span([group])

For a match m, return the 2-tuple (m.start(group), m.end(group)). 对于匹配m,返回2元组(m.start(group), m.end(group))Note that if group did not contribute to the match, this is (-1, -1). 请注意,如果group对匹配没有贡献,则为(-1, -1)group defaults to zero, the entire match.默认为零,表示整个匹配。

Match.pos

The value of pos which was passed to the search() or match() method of a regex object. 传递给regex对象search()match()方法的pos值。This is the index into the string at which the RE engine started looking for a match.这是重新引擎开始查找匹配的字符串的索引。

Match.endpos

The value of endpos which was passed to the search() or match() method of a regex object. 传递给regex对象search()match()方法的endpos的值。This is the index into the string beyond which the RE engine will not go.这是字符串的索引,重引擎将不会超出该索引。

Match.lastindex

The integer index of the last matched capturing group, or None if no group was matched at all. 上次匹配的捕获组的整数索引,如果没有匹配的组,则为NoneFor example, the expressions (a)b, ((a)(b)), and ((ab)) will have lastindex == 1 if applied to the string 'ab', while the expression (a)(b) will have lastindex == 2, if applied to the same string.例如,表达式(a)b((a)(b))((ab))如果应用于字符串'ab',则lastindex == 1,而表达式(a)(b)如果应用于同一字符串,则lastindex == 2

Match.lastgroup

The name of the last matched capturing group, or None if the group didn’t have a name, or if no group was matched at all.上次匹配的捕获组的名称,如果该组没有名称,或者如果没有匹配的组,则为None

Match.re

The regular expression object whose match() or search() method produced this match instance.正则表达式对象,其match()search()方法生成了此匹配实例。

Match.string

The string passed to match() or search().传递给match()search()的字符串。

Changed in version 3.7:版本3.7中更改: Added support of copy.copy() and copy.deepcopy(). 添加了对copy.copy()copy.deepcopy()的支持。Match objects are considered atomic.匹配对象被视为原子对象。

Regular Expression Examples正则表达式示例

Checking for a Pair正在检查一对

In this example, we’ll use the following helper function to display match objects a little more gracefully:在本例中,我们将使用以下帮助器函数更优雅地显示匹配对象:

def displaymatch(match):
if match is None:
return None
return '<Match: %r, groups=%r>' % (match.group(), match.groups())

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.假设您正在编写一个扑克程序,其中玩家的手牌表示为一个5个字符的字符串,每个字符表示一张牌,“a”表示王牌,“k”表示王牌,“q”表示皇后牌,“j”表示杰克牌,“t”表示10,以及“2”到“9”表示具有该值的牌。

To see if a given string is a valid hand, one could do the following:要查看给定字符串是否为有效手,可以执行以下操作:

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q")) # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e")) # Invalid.
>>> displaymatch(valid.match("akt")) # Invalid.
>>> displaymatch(valid.match("727ak")) # Valid.
"<Match: '727ak', groups=()>"

That last hand, "727ak", contained a pair, or two of the same valued cards. 最后一手牌"727ak"包含一对或两张相同价值的牌。To match this with a regular expression, one could use backreferences as such:要将其与正则表达式匹配,可以使用反向引用:

>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak")) # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak")) # No pairs.
>>> displaymatch(pair.match("354aa")) # Pair of aces.
"<Match: '354aa', groups=('a',)>"

To find out what card the pair consists of, one could use the group() method of the match object in the following manner:要找出这对卡片由什么组成,可以按以下方式使用match对象的group()方法:

>>> pair = re.compile(r".*(.).*\1")
>>> pair.match("717ak").group(1)
'7'
# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

>>> pair.match("354aa").group(1)
'a'

Simulating scanf()模拟scanf()

Python does not currently have an equivalent to scanf(). Python当前没有与scanf()等效的版本。Regular expressions are generally more powerful, though also more verbose, than scanf() format strings. 正则表达式通常比scanf()格式的字符串更强大,但也更详细。The table below offers some more-or-less equivalent mappings between scanf() format tokens and regular expressions.下表提供了scanf()格式标记和正则表达式之间或多或少的等效映射。

scanf() Token口令

Regular Expression正则表达式

%c

.

%5c

.{5}

%d

[-+]?\d+

%e, %E, %f, %g

[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?

%i

[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)

%o

[-+]?[0-7]+

%s

\S+

%u

\d+

%x, %X

[-+]?(0[xX])?[\dA-Fa-f]+

To extract the filename and numbers from a string like从字符串中提取文件名和数字,如

/usr/sbin/sendmail - 0 errors, 4 warnings

you would use a scanf() format like您可以使用scanf()格式,如

%s - %d errors, %d warnings

The equivalent regular expression would be等效的正则表达式为

(\S+) - (\d+) errors, (\d+) warnings

search() vs. match()search()match()的对比

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).Python基于正则表达式提供了两种不同的原语操作:re.match()仅在字符串的开头检查匹配,而re.search()检查字符串中任何位置的匹配(这是Perl默认的操作)。

For example:例如:

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef") # Match
<re.Match object; span=(2, 3), match='c'>

Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:'^'开头的正则表达式可以与search()一起使用,以限制字符串开头的匹配:

>>> re.match("c", "abcdef")    # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<re.Match object; span=(0, 1), match='a'>

Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.但是请注意,在MULTILINE模式下,match()仅在字符串开头匹配,而使用search()和以'^'开头的正则表达式将在每行开头匹配。

>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<re.Match object; span=(4, 5), match='X'>

Making a Phonebook制作电话簿

split() splits a string into a list delimited by the passed pattern. 将字符串拆分为由传递的模式分隔的列表。The method is invaluable for converting textual data into data structures that can be easily read and modified by Python as demonstrated in the following example that creates a phonebook.该方法对于将文本数据转换为Python可以轻松读取和修改的数据结构非常有用,如下面创建电话簿的示例所示。

First, here is the input. Normally it may come from a file, here we are using triple-quoted string syntax首先,这里是输入。通常它可能来自文件,这里我们使用三引号字符串语法

>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
...
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

The entries are separated by one or more newlines. 条目由一个或多个换行符分隔。Now we convert the string into a list with each nonempty line having its own entry:现在,我们将字符串转换为一个列表,其中每个非空行都有自己的条目:

>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']

Finally, split each entry into a list with first name, last name, telephone number, and address. 最后,将每个条目拆分为一个包含名字、姓氏、电话号码和地址的列表。We use the maxsplit parameter of split() because the address has spaces, our splitting pattern, in it:我们使用split()maxsplit参数,因为地址中有空格,即拆分模式:

>>> [re.split(":? ", entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

The :? pattern matches the colon after the last name, so that it does not occur in the result list. 这个:?模式匹配姓氏后的冒号,以便它不会出现在结果列表中。With a maxsplit of 4, we could separate the house number from the street name:maxsplit4时,我们可以将门牌号与街道名称分开:

>>> [re.split(":? ", entry, 4) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

Text Munging文本咀嚼

sub() replaces every occurrence of a pattern with a string or the result of a function. 用字符串或函数的结果替换模式的每次出现。This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters:此示例演示如何将sub()与函数一起使用以“咀嚼”文本,或随机化句子中每个单词中所有字符的顺序,第一个和最后一个字符除外:

>>> def repl(m):
... inner_word = list(m.group(2))
... random.shuffle(inner_word)
... return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

Finding all Adverbs查找所有副词

findall() matches all occurrences of a pattern, not just the first one as search() does. findall()匹配模式的所有匹配项,而不是像search()那样只匹配第一个。For example, if a writer wanted to find all of the adverbs in some text, they might use findall() in the following manner:例如,如果一个作家想在某个文本中找到所有副词,他们可以按以下方式使用findall()

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly\b", text)
['carefully', 'quickly']

Finding all Adverbs and their Positions查找所有副词及其位置

If one wants more information about all matches of a pattern than the matched text, finditer() is useful as it provides match objects instead of strings. 如果希望获得有关模式所有匹配项的更多信息而不是匹配文本的信息,finditer()很有用,因为它提供match对象而不是字符串。Continuing with the previous example, if a writer wanted to find all of the adverbs and their positions in some text, they would use finditer() in the following manner:继续上一个示例,如果作者希望在某个文本中找到所有副词及其位置,他们将按以下方式使用finditer()

>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly\b", text):
... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly

Raw String Notation原始字符串表示法

Raw string notation (r"text") keeps regular expressions sane. 原始字符串表示法(r"text")使正则表达式保持正常。Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. 如果没有它,正则表达式中的每个反斜杠('\')都必须以另一个反斜杠作为前缀才能转义。For example, the two following lines of code are functionally identical:例如,以下两行代码在功能上完全相同:

>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

When one wants to match a literal backslash, it must be escaped in the regular expression. 当需要匹配文字反斜杠时,必须在正则表达式中对其进行转义。With raw string notation, this means r"\\". 对于原始字符串表示法,这表示r"\\"Without raw string notation, one must use "\\\\", making the following lines of code functionally identical:如果没有原始字符串表示法,则必须使用"\\\\",使以下代码行在功能上完全相同:

>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>

Writing a Tokenizer编写标记器

A tokenizer or scanner analyzes a string to categorize groups of characters. 标记器或扫描程序分析字符串以对字符组进行分类。This is a useful first step in writing a compiler or interpreter.这是编写编译器或解释器的有用的第一步。

The text categories are specified with regular expressions. 文本类别由正则表达式指定。The technique is to combine those into a single master regular expression and to loop over successive matches:该技术是将它们组合到一个主正则表达式中,并在连续匹配上循环:

from typing import NamedTuple
import re
class Token(NamedTuple):
type: str
value: str
line: int
column: int

def tokenize(code):
keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
token_specification = [
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+\-*/]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]+'), # Skip over spaces and tabs
('MISMATCH', r'.'), # Any other character
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group()
column = mo.start() - line_start
if kind == 'NUMBER':
value = float(value) if '.' in value else int(value)
elif kind == 'ID' and value in keywords:
kind = value
elif kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
continue
elif kind == 'SKIP':
continue
elif kind == 'MISMATCH':
raise RuntimeError(f'{value!r} unexpected on line {line_num}')
yield Token(kind, value, line_num, column)

statements = '''
IF quantity THEN
total := total + price * quantity;
tax := price * 0.05;
ENDIF;
'''

for token in tokenize(statements):
print(token)

The tokenizer produces the following output:标记器生成以下输出:

Token(type='IF', value='IF', line=2, column=4)
Token(type='ID', value='quantity', line=2, column=7)
Token(type='THEN', value='THEN', line=2, column=16)
Token(type='ID', value='total', line=3, column=8)
Token(type='ASSIGN', value=':=', line=3, column=14)
Token(type='ID', value='total', line=3, column=17)
Token(type='OP', value='+', line=3, column=23)
Token(type='ID', value='price', line=3, column=25)
Token(type='OP', value='*', line=3, column=31)
Token(type='ID', value='quantity', line=3, column=33)
Token(type='END', value=';', line=3, column=41)
Token(type='ID', value='tax', line=4, column=8)
Token(type='ASSIGN', value=':=', line=4, column=12)
Token(type='ID', value='price', line=4, column=15)
Token(type='OP', value='*', line=4, column=21)
Token(type='NUMBER', value=0.05, line=4, column=23)
Token(type='END', value=';', line=4, column=27)
Token(type='ENDIF', value='ENDIF', line=5, column=4)
Token(type='END', value=';', line=5, column=9)
Frie09

Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O’Reilly Media, 2009. 弗里德尔,杰弗里:《掌握正则表达式》第三版,O'Reilly Media,2009年。The third edition of the book no longer covers Python at all, but the first edition covered writing good regular expression patterns in great detail.这本书的第三版不再涵盖Python,但第一版详细介绍了如何编写良好的正则表达式模式。