difflib
— Helpers for computing deltas计算Delta的助手¶
Source code: Lib/difflib.py
This module provides classes and functions for comparing sequences. 此模块提供用于比较序列的类和函数。It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs. 例如,它可以用于比较文件,并可以以各种格式生成有关文件差异的信息,包括HTML和上下文以及统一差异。For comparing directories and files, see also, the 有关比较目录和文件的信息,请参阅filecmp
module.filecmp
模块。
-
class
difflib.
SequenceMatcher
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.这是一个灵活的类,用于比较任何类型的序列对,只要序列元素是可散列的。The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.”基本算法比Ratcliff和Obershelp在20世纪80年代末发布的一个名为“格式塔模式匹配”的双曲线算法要早,而且有点花哨。The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace.其思想是找到不包含“垃圾”元素的最长连续匹配子序列;这些“垃圾”元素在某种意义上是无趣的,例如空行或空格。(Handling junk is an extension to the Ratcliff and Obershelp algorithm.)(处理垃圾是Ratcliff和Obershelp算法的扩展。)The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence.然后将相同的思想递归地应用于匹配子序列左侧和右侧的序列片段。This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.这不会产生最小的编辑序列,但会产生人们“看得对”的匹配。Timing:时间安排:The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case.Ratcliff-Obershelp的基本算法在最坏情况下是三次时间,在预期情况下是二次时间。SequenceMatcher
is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.是最坏情况下的二次时间,预期情况行为以复杂的方式取决于序列共有多少个元素;最佳情况下的时间是线性的。Automatic junk heuristic:自动垃圾启发:SequenceMatcher
supports a heuristic that automatically treats certain sequence items as junk.支持自动将某些序列项视为垃圾的启发式方法。The heuristic counts how many times each individual item appears in the sequence.启发式计算每个项目在序列中出现的次数。If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching.如果一个项目的重复项(在第一个项目之后)占序列的1%以上,且序列长度至少为200个项目,则该项目被标记为“热门”,并被视为垃圾,以便进行序列匹配。This heuristic can be turned off by setting the创建autojunk
argument toFalse
when creating theSequenceMatcher
.SequenceMatcher
时,可以通过将autojunk
参数设置为False
来关闭此启发式。New in version 3.2.版本3.2中新增。The autojunk parameter.autojunk参数。
-
class
difflib.
Differ
¶ This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas.这是一个用于比较文本行序列并产生人类可读的差异或增量的类。Differ uses有别于使用SequenceMatcher
both to compare sequences of lines, and to compare sequences of characters within similar (near-matching) lines.SequenceMatcher
既可以比较行序列,也可以比较相似(接近匹配)行中的字符序列。Each line of aDiffer
delta begins with a two-letter code:Differ
增量的每一行都以两个字母的代码开头:Code密码Meaning含意'- '
line unique to sequence 1序列1的唯一行'+ '
line unique to sequence 2序列2的唯一行' '
line common to both sequences两个序列共用的线'? '
line not present in either input sequence任一输入序列中都不存在行Lines beginning with ‘以“?
’ attempt to guide the eye to intraline differences, and were not present in either input sequence.?
”开头的行试图引导眼睛观察线内差异,但在两个输入序列中均不存在。These lines can be confusing if the sequences contain tab characters.如果序列包含制表符,则这些行可能会混淆。
-
class
difflib.
HtmlDiff
¶ This class can be used to create an HTML table (or a complete HTML file containing the table) showing a side by side, line by line comparison of text with inter-line and intra-line change highlights.此类可用于创建一个HTML表(或包含该表的完整HTML文件),显示文本的并排、逐行比较以及行间和行内更改突出显示。The table can be generated in either full or contextual difference mode.表格可以在完整模式或上下文差异模式下生成。The constructor for this class is:此类的构造函数是:-
__init__
(tabsize=8, wrapcolumn=None, linejunk=None, charjunk=IS_CHARACTER_JUNK)¶ Initializes instance of初始化HtmlDiff
.HtmlDiff
的实例。tabsize
is an optional keyword argument to specify tab stop spacing and defaults to是一个可选的关键字参数,用于指定制表位间距,默认值为8
.8
。wrapcolumn is an optional keyword to specify column number where lines are broken and wrapped, defaults towrapcolumn是一个可选关键字,用于指定断行和换行处的列号,未换行处的默认值为None
where lines are not wrapped.None
。linejunk and charjunk are optional keyword arguments passed intolinejunk和charjunk是传递到ndiff()
(used byHtmlDiff
to generate the side by side HTML differences).ndiff()
的可选关键字参数(由HtmlDiff
用于生成并排HTML差异)。See有关参数默认值和说明,请参阅ndiff()
documentation for argument default values and descriptions.ndiff()
文档。
The following methods are public:以下方法是公开的:-
make_file
(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5, *, charset='utf-8')¶ Compares fromlines and tolines (lists of strings) and returns a string which is a complete HTML file containing a table showing line by line differences with inter-line and intra-line changes highlighted.比较fromlines和tolines(字符串列表),并返回一个字符串,该字符串是一个完整的HTML文件,其中包含一个表,显示逐行的差异,并突出显示行间和行内的更改。fromdesc and todesc are optional keyword arguments to specify from/to file column header strings (both default to an empty string).fromdesc和todesc是可选的关键字参数,用于指定从/到文件列标题字符串(都默认为空字符串)。context and numlines are both optional keyword arguments.context和numline都是可选的关键字参数。Set context to当要显示上下文差异时,将context设置为True
when contextual differences are to be shown, else the default isFalse
to show the full files. numlines defaults to5
.True
,否则默认设置为False
以显示完整文件。numlines默认为5。When context is当context为True
numlines controls the number of context lines which surround the difference highlights.True
时,numlines控制围绕差异高亮显示的上下文线的数量。When context is当context为False
numlines controls the number of lines which are shown before a difference highlight when using the “next” hyperlinks (setting to zero would cause the “next” hyperlinks to place the next difference highlight at the top of the browser without any leading context).False
时,numlines控制使用“下一个”超链接时在差异突出显示之前显示的行数(设置为零将导致“下一个”超链接将下一个差异突出显示在浏览器顶部,而没有任何前导上下文)。Note
fromdesc and todesc are interpreted as unescaped HTML and should be properly escaped while receiving input from untrusted sources.fromdesc和todesc被解释为未转义的HTML,在接收来自不受信任源的输入时应正确转义。Changed in version 3.5:版本3.5中更改:charset keyword-only argument was added.只添加了charset关键字参数。The default charset of HTML document changed fromHTML文档的默认字符集从'ISO-8859-1'
to'utf-8'
.'ISO-8859-1'
更改为'utf-8'
。
-
make_table
(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5)¶ Compares fromlines and tolines (lists of strings) and returns a string which is a complete HTML table showing line by line differences with inter-line and intra-line changes highlighted.比较fromlines和tolines(字符串列表),并返回一个字符串,该字符串是一个完整的HTML表格,逐行显示差异,并突出显示行间和行内更改。The arguments for this method are the same as those for the此方法的参数与make_file()
method.make_file()
方法的参数相同。
Tools/scripts/diff.py
is a command-line front-end to this class and contains a good example of its use.是此类的命令行前端,并包含一个很好的使用示例。-
-
difflib.
context_diff
(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n')¶ Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in context diff format.比较a和b(字符串列表);以上下文差异格式返回增量(生成增量线的生成器)。Context diffs are a compact way of showing just the lines that have changed plus a few lines of context.上下文差异是一种简洁的方式,只显示已更改的行和几行上下文。The changes are shown in a before/after style.更改以“之前/之后”样式显示。The number of context lines is set by n which defaults to three.上下文行数由n设置,默认为三行。By default, the diff control lines (those with默认情况下,差分控制行(带***
or---
) are created with a trailing newline.***
或---
)是用尾随换行符创建的。This is helpful so that inputs created from这很有帮助,因为输入和输出都有尾随的换行符,所以从io.IOBase.readlines()
result in diffs that are suitable for use withio.IOBase.writelines()
since both the inputs and outputs have trailing newlines.io.IOBase.readlines()
创建的输入会产生适合与io.IOBase.writelines()
一起使用的差异。For inputs that do not have trailing newlines, set the lineterm argument to对于没有尾随换行符的输入,请将lineterm参数设置为""
so that the output will be uniformly newline free.""
,以便输出一致无换行符。The context diff format normally has a header for filenames and modification times.上下文差异格式通常有一个文件名和修改时间的标题。Any or all of these may be specified using strings for fromfile, tofile, fromfiledate, and tofiledate.可以使用fromfile、tofile、fromfiledate和tofiledate的字符串来指定其中任何一个或全部。The modification times are normally expressed in the ISO 8601 format.修改时间通常以ISO 8601格式表示。If not specified, the strings default to blanks.如果未指定,则字符串默认为空白。>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> sys.stdout.writelines(context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
*** before.py
--- after.py
***************
*** 1,4 ****
! bacon
! eggs
! ham
guido
--- 1,4 ----
! python
! eggy
! hamster
guidoSee A command-line interface to difflib for a more detailed example.有关更详细的示例,请参阅difflib的命令行界面。
-
difflib.
get_close_matches
(word, possibilities, n=3, cutoff=0.6)¶ Return a list of the best “good enough” matches.返回最佳“足够好”匹配的列表。word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).word是一个需要紧密匹配的序列(通常是一个字符串),而possibilities是一个与word匹配的序列列表(通常是一个字符串列表)。Optional argument n (default可选参数n(默认值3
) is the maximum number of close matches to return; n must be greater than0
.3
)是要返回的最大接近匹配数;n必须大于0
。Optional argument cutoff (default可选参数cutoff(默认值为0.6
) is a float in the range [0, 1].0.6
)是范围[0, 1]中的浮点值。Possibilities that don’t score at least that similar to word are ignored.评分至少与word不相似的可能性被忽略。The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.在一个列表中返回可能性中的最佳(不超过n个)匹配,按相似性分数排序,最相似的优先。>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('pineapple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
-
difflib.
ndiff
(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK)¶ Compare a and b (lists of strings); return a比较a和b(字符串列表);返回不同Differ
-style delta (a generator generating the delta lines).Differ
样式的增量(生成增量线的生成器)。Optional keyword parameters linejunk and charjunk are filtering functions (or可选关键字参数linejunk和charjunk是筛选函数(或无):None
):linejunk
: A function that accepts a single string argument, and returns true if the string is junk, or false if not.:接受单个字符串参数的函数,如果字符串是垃圾,则返回true
;如果不是垃圾,则返回false
。The default is默认值为None
.None
。There is also a module-level function还有一个模块级函数IS_LINE_JUNK()
, which filters out lines without visible characters, except for at most one pound character ('#'
) – however the underlyingSequenceMatcher
class does a dynamic analysis of which lines are so frequent as to constitute noise, and this usually works better than using this function.IS_LINE_JUNK()
,它筛选掉没有可见字符的行,但最多只有一磅字符('#'
)的行除外-然而,底层SequenceMatcher
类对哪些行如此频繁以至于构成噪声进行动态分析,这通常比使用此函数效果更好。charjunk
: A function that accepts a character (a string of length 1), and returns if the character is junk, or false if not.:接受字符(长度为1的字符串)的函数,如果字符是垃圾,则返回;如果不是垃圾,则返回false
。The default is module-level function默认的是模块级函数IS_CHARACTER_JUNK()
, which filters out whitespace characters (a blank or tab; it’s a bad idea to include newline in this!).IS_CHARACTER_JUNK()
,它筛选掉空白字符(空白或制表符;在其中包含换行符是个坏主意!)。Tools/scripts/ndiff.py
is a command-line front-end to this function.是此函数的命令行前端。>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
... 'ore\ntree\nemu\n'.splitlines(keepends=True))
>>> print(''.join(diff), end="")
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
-
difflib.
restore
(sequence, which)¶ Return one of the two sequences that generated a delta.返回生成增量的两个序列之一。Given a sequence produced by给定Differ.compare()
orndiff()
, extract lines originating from file 1 or 2 (parameter which), stripping off line prefixes.Differ.compare()
或ndiff()
生成的sequence,提取来自文件1或2的行(参数which),去掉行前缀。Example:例子:>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
... 'ore\ntree\nemu\n'.splitlines(keepends=True))
>>> diff = list(diff) # materialize the generated delta into a list
>>> print(''.join(restore(diff, 1)), end="")
one
two
three
>>> print(''.join(restore(diff, 2)), end="")
ore
tree
emu
-
difflib.
unified_diff
(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n')¶ Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in unified diff format.比较a和b(字符串列表);返回统一diff格式的增量(生成增量线的生成器)。Unified diffs are a compact way of showing just the lines that have changed plus a few lines of context.统一差异是一种紧凑的方式,只显示已更改的行加上几行上下文。The changes are shown in an inline style (instead of separate before/after blocks).更改以内联样式显示(而不是在块之前/之后单独显示)。The number of context lines is set by n which defaults to three.上下文行数由n设置,默认为3。By default, the diff control lines (those with默认情况下,差异控制线(带有---
,+++
, or@@
) are created with a trailing newline.---
、+++
或@@
)的控制线)使用尾随换行符创建。This is helpful so that inputs created from这很有帮助,因为输入和输出都有尾随的换行符,所以从io.IOBase.readlines()
result in diffs that are suitable for use withio.IOBase.writelines()
since both the inputs and outputs have trailing newlines.io.IOBase.readlines()
创建的输入会产生适合与io.IOBase.writelines()
一起使用的差异。For inputs that do not have trailing newlines, set the lineterm argument to对于没有尾随换行符的输入,请将lineterm参数设置为""
so that the output will be uniformly newline free.""
,以便输出一致无换行符。The context diff format normally has a header for filenames and modification times.上下文差异格式通常有一个文件名和修改时间的标题。Any or all of these may be specified using strings for fromfile, tofile, fromfiledate, and tofiledate.可以使用fromfile、tofile、fromfiledate和tofiledate的字符串来指定其中任何一个或全部。The modification times are normally expressed in the ISO 8601 format. If not specified, the strings default to blanks.修改时间通常以ISO 8601格式表示。如果未指定,字符串默认为空白。>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> sys.stdout.writelines(unified_diff(s1, s2, fromfile='before.py', tofile='after.py'))
--- before.py
+++ after.py
@@ -1,4 +1,4 @@
-bacon
-eggs
-ham
+python
+eggy
+hamster
guidoSee A command-line interface to difflib for a more detailed example.有关更详细的示例,请参阅difflib的命令行界面。
-
difflib.
diff_bytes
(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n')¶ Compare a and b (lists of bytes objects) using dfunc; yield a sequence of delta lines (also bytes) in the format returned by dfunc.使用dfunc比较a和b(字节对象列表);以dfunc返回的格式生成一个增量行序列(也是字节)。dfunc must be a callable, typically eitherdfunc必须是可调用的,通常是unified_diff()
orcontext_diff()
.unified_diff()
或context_diff()
。Allows you to compare data with unknown or inconsistent encoding.允许您将数据与未知或不一致的编码进行比较。All inputs except n must be bytes objects, not str.除n之外的所有输入都必须是bytes对象,而不是str。Works by losslessly converting all inputs (except n) to str, and calling其工作原理是将所有输入(n除外)无损地转换为str,并调用dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm)
.dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm)
。The output of dfunc is then converted back to bytes, so the delta lines that you receive have the same unknown/inconsistent encodings as a and b.然后将dfunc的输出转换回字节,因此您收到的增量线具有与a和b相同的未知/不一致编码。New in version 3.5.版本3.5中新增。
-
difflib.
IS_LINE_JUNK
(line)¶ Return对于可忽略的行,返回True
for ignorable lines.True
。The line line is ignorable if line is blank or contains a single如果line为空或包含单个'#'
, otherwise it is not ignorable.'#'
,则该行可忽略,否则不可忽略。Used as a default for parameter linejunk in在旧版本中,用作ndiff()
in older versions.ndiff()
中参数linejunk的默认值。
-
difflib.
IS_CHARACTER_JUNK
(ch)¶ Return对于可忽略的字符,返回True
for ignorable characters.True
。The character ch is ignorable if ch is a space or tab, otherwise it is not ignorable.如果ch是空格或制表符,则字符ch可忽略,否则它不可忽略。Used as a default for parameter charjunk in用作ndiff()
.ndiff()
中参数charjunk的默认值。
See also另请参见
Pattern Matching: The Gestalt Approach模式匹配:格式塔方法Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener.John W.Ratcliff和D.E.Metzener对类似算法的讨论。This was published in Dr. Dobb’s Journal in July, 1988.这篇文章发表在1988年7月的《Dobb博士》杂志上。
SequenceMatcher
Objects对象¶
The SequenceMatcher
class has this constructor:类具有此构造函数:
-
class
difflib.
SequenceMatcher
(isjunk=None, a='', b='', autojunk=True)¶ Optional argument isjunk must be可选参数isjunk必须为None
(the default) or a one-argument function that takes a sequence element and returns true if and only if the element is “junk” and should be ignored.None
(默认值)或一个单参数函数,该函数接受序列元素并在且仅当该元素为“junk”且应忽略时返回true
。Passing为isjunk传递None
for isjunk is equivalent to passinglambda x: False
; in other words, no elements are ignored.None
相当于传递lambda x: False
;换句话说,不忽略任何元素。For example, pass:例如,通过:lambda x: x in " \t"
if you’re comparing lines as sequences of characters, and don’t want to synch up on blanks or hard tabs.如果您将行作为字符序列进行比较,并且不想在空白或硬制表符上同步。The optional arguments a and b are sequences to be compared; both default to empty strings.可选参数a和b是要比较的序列;两者都默认为空字符串。The elements of both sequences must be hashable.两个序列的元素都必须是可散列的。The optional argument autojunk can be used to disable the automatic junk heuristic.可选参数autojunk可用于禁用自动垃圾启发。New in version 3.2.版本3.2中新增。The autojunk parameter.autojunk参数。SequenceMatcher objects get three data attributes: bjunk is the set of elements of b for which isjunk isSequenceMatcher对象有三个数据属性:bjunk是b的元素集,其中isjunk为True
; bpopular is the set of non-junk elements considered popular by the heuristic (if it is not disabled); b2j is a dict mapping the remaining elements of b to a list of positions where they occur.True
;bpopular是启发式算法认为受欢迎的一组非垃圾元素(如果未禁用);b2j是将b的其余元素映射到它们出现的位置列表的dict。All three are reset whenever b is reset with只要b用set_seqs()
orset_seq2()
.set_seqs()
或set_seq2()
重置,这三个都会重置。New in version 3.2.版本3.2中新增。The bjunk and bpopular attributes.bjunk和bpopular属性。SequenceMatcher
objects have the following methods:对象具有以下方法:-
set_seqs
(a, b)¶ Set the two sequences to be compared.设置要比较的两个序列。
SequenceMatcher
computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use计算并缓存关于第二个序列的详细信息,因此,如果要将一个序列与多个序列进行比较,请使用set_seq2()
to set the commonly used sequence once and callset_seq1()
repeatedly, once for each of the other sequences.set_seq2()
将常用序列设置一次,并重复调用set_seq1()
,对其他每个序列一次。-
set_seq1
(a)¶ Set the first sequence to be compared.设置要比较的第一个序列。The second sequence to be compared is not changed.要比较的第二个序列不变。
-
set_seq2
(b)¶ Set the second sequence to be compared.设置要比较的第二个序列。The first sequence to be compared is not changed.要比较的第一个序列不变。
-
find_longest_match
(alo=0, ahi=None, blo=0, bhi=None)¶ Find longest matching block in在a[alo:ahi]
andb[blo:bhi]
.a[alo:ahi]
和b[blo:bhi]
中查找最长的匹配块。If isjunk was omitted or如果isjunk被省略或为None
,find_longest_match()
returns(i, j, k)
such thata[i:i+k]
is equal tob[j:j+k]
, wherealo <= i <= i+k <= ahi
andblo <= j <= j+k <= bhi
.None
,则find_longest_match()
返回(i, j, k)
,使得a[i:i+k]
等于b[j:j+k]
,其中alo <= i <= i+k <= ahi
和blo <= j <= j+k <= bhi
。For all对于满足这些条件的所有(i', j', k')
meeting those conditions, the additional conditionsk >= k'
,i <= i'
, and ifi == i'
,j <= j'
are also met.(i', j', k')
,也满足附加条件k >= k'
、i <= i'
,如果i == i'
,则也满足j <= j'
。In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.换句话说,在所有最大匹配块中,返回在a中开始最早的块,在所有在a中开始最早的最大匹配块中,返回在b中开始最早的块。>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
Match(a=0, b=4, size=5)If isjunk was provided, first the longest matching block is determined as above, but with the additional restriction that no junk element appears in the block.如果提供了isjunk,首先如上所述确定最长的匹配块,但附加限制是块中不出现任何垃圾元素。Then that block is extended as far as possible by matching (only) junk elements on both sides.然后,通过匹配(仅)两侧的垃圾元素,尽可能地扩展该块。So the resulting block never matches on junk except as identical junk happens to be adjacent to an interesting match.因此,生成的块永远不会与垃圾匹配,除非相同的垃圾恰好与有趣的匹配相邻。Here’s the same example as before, but considering blanks to be junk.这里的示例与之前相同,但将空格视为垃圾。That prevents这可以防止' abcd'
from matching the' abcd'
at the tail end of the second sequence directly.' abcd'
直接与第二个序列末尾的' abcd'
匹配。Instead only the并非只有'abcd'
can match, and matches the leftmost'abcd'
in the second sequence:'abcd'
可以匹配,并是还匹配第二个序列中最左边的'abcd'
:>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
Match(a=1, b=0, size=4)If no blocks match, this returns如果没有匹配的块,则返回(alo, blo, 0)
.(alo, blo, 0)
。This method returns a named tuple该方法返回命名元组Match(a, b, size)
.Match(a, b, size)
。Changed in version 3.9:版本3.9中更改:Added default arguments.添加了默认参数。
-
get_matching_blocks
()¶ Return list of triples describing non-overlapping matching subsequences.返回描述非重叠匹配子序列的三元组列表。Each triple is of the form每个三元组的形式为(i, j, n)
, and means thata[i:i+n] == b[j:j+n]
.(i, j, n)
,表示a[i:i+n] == b[j:j+n]
。The triples are monotonically increasing in i and j.三元组在i和j中单调增加。The last triple is a dummy, and has the value最后一个三元组是虚拟的,其值为(len(a), len(b), 0)
.(len(a), len(b), 0)
。It is the only triple with它是唯一一个n == 0
.n == 0
的三元组。If如果(i, j, n)
and(i', j', n')
are adjacent triples in the list, and the second is not the last triple in the list, theni+n < i'
orj+n < j'
; in other words, adjacent triples always describe non-adjacent equal blocks.(i, j, n)
和(i', j', n')
是列表中的相邻三元组,并且第二个不是列表中的最后一个三元组,则i+n < i'
或j+n < j'
;换句话说,相邻三元组总是描述非相邻的相等块。>>> s = SequenceMatcher(None, "abxcd", "abcd")
>>> s.get_matching_blocks()
[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
-
get_opcodes
()¶ Return list of 5-tuples describing how to turn a into b.返回描述如何将a转换为b的5元组列表。Each tuple is of the form每个元组的形式为(tag, i1, i2, j1, j2)
.(tag, i1, i2, j1, j2)
。The first tuple has第一个元组具有i1 == j1 == 0
, and remaining tuples have i1 equal to the i2 from the preceding tuple, and, likewise, j1 equal to the previous j2.i1==j1==0
,其余元组的i1等于前一个元组的i2,同样,j1等于前一个j2。The tag values are strings, with these meanings:tag值是字符串,具有以下含义:Value值Meaning意思'replace'
a[i1:i2]
should be replaced by应替换为b[j1:j2]
.b[j1:j2]
。'delete'
a[i1:i2]
should be deleted.应删除。Note that注意,在这种情况下,j1 == j2
in this case.j1==j2
。'insert'
b[j1:j2]
should be inserted at应插入a[i1:i1]
.a[i1:i1]
。Note that注意,在这种情况下,i1 == i2
in this case.i1==i2
。'equal'
a[i1:i2] == b[j1:j2]
(the sub-sequences are equal).(子序列相等)。For example:例如:>>> a = "qabxcd"
>>> b = "abycdf"
>>> s = SequenceMatcher(None, a, b)
>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
... print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
... tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
delete a[0:1] --> b[0:0] 'q' --> ''
equal a[1:3] --> b[0:2] 'ab' --> 'ab'
replace a[3:4] --> b[2:3] 'x' --> 'y'
equal a[4:6] --> b[3:5] 'cd' --> 'cd'
insert a[6:6] --> b[5:6] '' --> 'f'
-
get_grouped_opcodes
(n=3)¶ Return a generator of groups with up to n lines of context.返回最多包含n行上下文的组生成器。Starting with the groups returned by该方法从get_opcodes()
, this method splits out smaller change clusters and eliminates intervening ranges which have no changes.get_opcodes()
返回的组开始,分割出较小的更改簇,并消除没有更改的干预范围。The groups are returned in the same format as这些组的返回格式与get_opcodes()
.get_opcodes()
相同。
-
ratio
()¶ Return a measure of the sequences’ similarity as a float in the range [0, 1].返回序列相似性的度量值,作为范围[0, 1]中的浮点值。Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.其中T是两个序列中的元素总数,M是匹配数,这是2.0*M/T。Note that this is注意,如果序列相同,则为1.0
if the sequences are identical, and0.0
if they have nothing in common.1.0
;如果序列没有共同点,则为0.0
。This is expensive to compute if如果尚未调用get_matching_blocks()
orget_opcodes()
hasn’t already been called, in which case you may want to tryquick_ratio()
orreal_quick_ratio()
first to get an upper bound.get_matching_blocks()
或get_opcodes()
,则计算成本很高,在这种情况下,您可能需要先尝试quick_ratio()
或real_quick_ratio()
以获得上限。
-
The three methods that return the ratio of matching to total characters can give different results due to differing levels of approximation, although 虽然quick_ratio()
and real_quick_ratio()
are always at least as large as ratio()
:quick_ratio()
和real_quick_ratio()
始终至少与ratio()
一样大,但由于近似级别不同,返回匹配与总字符比率的三种方法可以给出不同的结果:
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0
SequenceMatcher
Examples示例¶
This example compares two strings, considering blanks to be “junk”:本例比较了两个字符串,将空格视为“垃圾”:
>>> s = SequenceMatcher(lambda x: x == " ",
... "private Thread currentThread;",
... "private volatile Thread currentThread;")
ratio()
returns a float in [0, 1], measuring the similarity of the sequences. 返回[0,1]中的浮点值,测量序列的相似性。As a rule of thumb, a 根据经验,ratio()
value over 0.6 means the sequences are close matches:ratio()
值大于0.6意味着序列是紧密匹配的:
>>> print(round(s.ratio(), 3))
0.866
If you’re only interested in where the sequences match, 如果您只对序列的匹配位置感兴趣,那么get_matching_blocks()
is handy:get_matching_blocks()
很方便:
>>> for block in s.get_matching_blocks():
... print("a[%d] and b[%d] match for %d elements" % block)
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements
Note that the last tuple returned by 注意,get_matching_blocks()
is always a dummy, (len(a), len(b), 0)
, and this is the only case in which the last tuple element (number of elements matched) is 0
.get_matching_blocks()
返回的最后一个元组始终是伪元组,(len(a), len(b), 0)
,这是唯一一种最后一个元组元素(匹配的元素数)为0的情况。
If you want to know how to change the first sequence into the second, use 如果您想知道如何将第一个序列更改为第二个序列,请使用get_opcodes()
:get_opcodes()
:
>>> for opcode in s.get_opcodes():
... print("%6s a[%d:%d] b[%d:%d]" % opcode)
equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
equal a[8:29] b[17:38]
See also
The该模块中的get_close_matches()
function in this module which shows how simple code building onSequenceMatcher
can be used to do useful work.get_close_matches()
函数显示了如何使用SequenceMatcher
上构建的简单代码来完成有用的工作。Simple version control recipe简单版本控制配方for a small application built with对于使用SequenceMatcher
.SequenceMatcher
构建的小型应用程序。
Differ ObjectsDiffer
对象¶
Note that 注意,Differ
-generated deltas make no claim to be minimal diffs. Differ
生成的Delta并不声称是最小差异。To the contrary, minimal diffs are often counter-intuitive, because they synch up anywhere possible, sometimes accidental matches 100 pages apart. 相反,最小的差异通常是违反直觉的,因为它们在任何可能的地方同步,有时会意外地相隔100页。Restricting synch points to contiguous matches preserves some notion of locality, at the occasional cost of producing a longer diff.将同步点限制为连续匹配保留了一些局部性概念,但偶尔会产生更长的差异。
The Differ
class has this constructor:Differ
类具有此构造函数:
-
class
difflib.
Differ
(linejunk=None, charjunk=None) Optional keyword parameters linejunk and charjunk are for filter functions (or可选关键字参数linejunk和charjunk用于筛选函数(或无):None
):linejunk
: A function that accepts a single string argument, and returns true if the string is junk.:接受单个字符串参数的函数,如果字符串是垃圾,则返回true
。The default is默认值为None
, meaning that no line is considered junk.None
,这意味着没有行被视为垃圾。charjunk
: A function that accepts a single character argument (a string of length 1), and returns true if the character is junk.:接受单字符参数(长度为1的字符串)的函数,如果字符是垃圾,则返回true
。The default is默认值为None
, meaning that no character is considered junk.None
,这意味着没有字符被视为垃圾。These junk-filtering functions speed up matching to find differences and do not cause any differing lines or characters to be ignored.这些垃圾筛选功能可以加快匹配以查找差异,并且不会导致忽略任何不同的行或字符。Read the description of the阅读find_longest_match()
method’s isjunk parameter for an explanation.find_longest_match()
方法的isjunk参数的描述以获取解释。通过单一方法使用Differ
objects are used (deltas generated) via a single method:Differ
对象(生成增量):-
compare
(a, b)¶ Compare two sequences of lines, and generate the delta (a sequence of lines).比较两个直线序列,并生成增量(直线序列)。Each sequence must contain individual single-line strings ending with newlines.每个序列必须包含以换行符结尾的单个单行字符串。Such sequences can be obtained from the这样的序列可以从类文件对象的readlines()
method of file-like objects.readlines()
方法中获得。The delta generated also consists of newline-terminated strings, ready to be printed as-is via the生成的增量还包括以换行符结尾的字符串,可以通过类似文件的对象的writelines()
method of a file-like object.writelines()
方法按原样打印。
-
Differ
Example示例¶
This example compares two texts. 这个例子比较了两个文本。First we set up the texts, sequences of individual single-line strings ending with newlines (such sequences can also be obtained from the 首先,我们设置文本,以换行符结尾的单个单行字符串的序列(此类序列也可以从类文件对象的readlines()
method of file-like objects):readlines()
方法获得):
>>> text1 = ''' 1. Beautiful is better than ugly.
... 2. Explicit is better than implicit.
... 3. Simple is better than complex.
... 4. Complex is better than complicated.
... '''.splitlines(keepends=True)
>>> len(text1)
4
>>> text1[0][-1]
'\n'
>>> text2 = ''' 1. Beautiful is better than ugly.
... 3. Simple is better than complex.
... 4. Complicated is better than complex.
... 5. Flat is better than nested.
... '''.splitlines(keepends=True)
Next we instantiate a Differ object:接下来,我们实例化一个不同的对象:
>>> d = Differ()
Note that when instantiating a 注意,当实例化一个Differ
object we may pass functions to filter out line and character “junk.” Differ
对象时,我们可以传递函数来筛选掉行和字符“junk”。See the 有关详细信息,请参阅Differ()
constructor for details.Differ()
构造函数。
Finally, we compare the two:最后,我们将两者进行比较:
>>> result = list(d.compare(text1, text2))
result
is a list of strings, so let’s pretty-print it:是一个字符串列表,所以让我们漂亮地打印它:
>>> from pprint import pprint
>>> pprint(result)
[' 1. Beautiful is better than ugly.\n',
'- 2. Explicit is better than implicit.\n',
'- 3. Simple is better than complex.\n',
'+ 3. Simple is better than complex.\n',
'? ++\n',
'- 4. Complex is better than complicated.\n',
'? ^ ---- ^\n',
'+ 4. Complicated is better than complex.\n',
'? ++++ ^ ^\n',
'+ 5. Flat is better than nested.\n']
As a single multi-line string it looks like this:作为一个多行字符串,它如下所示:
>>> import sys
>>> sys.stdout.writelines(result)
1. Beautiful is better than ugly.
- 2. Explicit is better than implicit.
- 3. Simple is better than complex.
+ 3. Simple is better than complex.
? ++
- 4. Complex is better than complicated.
? ^ ---- ^
+ 4. Complicated is better than complex.
? ++++ ^ ^
+ 5. Flat is better than nested.
A command-line interface to difflibdifflib
的命令行界面¶
This example shows how to use difflib to create a 此示例显示如何使用difflib创建类似diff
-like utility. diff
的实用程序。It is also contained in the Python source distribution, as 它还以Tools/scripts/diff.py
.Tools/scripts/diff.py
的形式包含在Python源代码发行版中。
#!/usr/bin/env python3
""" Command line interface to difflib.py providing diffs in four formats:
* ndiff: lists every line and highlights interline changes.
* context: highlights clusters of changes in a before/after format.
* unified: highlights clusters of changes in an inline format.
* html: generates side by side comparison with change highlights.
"""
import sys, os, difflib, argparse
from datetime import datetime, timezone
def file_mtime(path):
t = datetime.fromtimestamp(os.stat(path).st_mtime,
timezone.utc)
return t.astimezone().isoformat()
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-c', action='store_true', default=False,
help='Produce a context format diff (default)')
parser.add_argument('-u', action='store_true', default=False,
help='Produce a unified format diff')
parser.add_argument('-m', action='store_true', default=False,
help='Produce HTML side by side diff '
'(can use -c and -l in conjunction)')
parser.add_argument('-n', action='store_true', default=False,
help='Produce a ndiff format diff')
parser.add_argument('-l', '--lines', type=int, default=3,
help='Set number of context lines (default 3)')
parser.add_argument('fromfile')
parser.add_argument('tofile')
options = parser.parse_args()
n = options.lines
fromfile = options.fromfile
tofile = options.tofile
fromdate = file_mtime(fromfile)
todate = file_mtime(tofile)
with open(fromfile) as ff:
fromlines = ff.readlines()
with open(tofile) as tf:
tolines = tf.readlines()
if options.u:
diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)
elif options.n:
diff = difflib.ndiff(fromlines, tolines)
elif options.m:
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile,context=options.c,numlines=n)
else:
diff = difflib.context_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)
sys.stdout.writelines(diff)
if __name__ == '__main__':
main()