difflibHelpers for computing deltas计算Delta的助手

Source code: Lib/difflib.py


This module provides classes and functions for comparing sequences. 此模块提供用于比较序列的类和函数。It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs. 例如,它可以用于比较文件,并可以以各种格式生成有关文件差异的信息,包括HTML和上下文以及统一差异。For comparing directories and files, see also, the filecmp module.有关比较目录和文件的信息,请参阅filecmp模块。

classdifflib.SequenceMatcher

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. 这是一个灵活的类,用于比较任何类型的序列对,只要序列元素是可散列的The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” 基本算法比Ratcliff和Obershelp在20世纪80年代末发布的一个名为“格式塔模式匹配”的双曲线算法要早,而且有点花哨。The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. 其思想是找到不包含“垃圾”元素的最长连续匹配子序列;这些“垃圾”元素在某种意义上是无趣的,例如空行或空格。(Handling junk is an extension to the Ratcliff and Obershelp algorithm.) (处理垃圾是Ratcliff和Obershelp算法的扩展。)The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. 然后将相同的思想递归地应用于匹配子序列左侧和右侧的序列片段。This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.这不会产生最小的编辑序列,但会产生人们“看得对”的匹配。

Timing:时间安排: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. Ratcliff-Obershelp的基本算法在最坏情况下是三次时间,在预期情况下是二次时间。SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.是最坏情况下的二次时间,预期情况行为以复杂的方式取决于序列共有多少个元素;最佳情况下的时间是线性的。

Automatic junk heuristic:自动垃圾启发: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. 支持自动将某些序列项视为垃圾的启发式方法。The heuristic counts how many times each individual item appears in the sequence. 启发式计算每个项目在序列中出现的次数。If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. 如果一个项目的重复项(在第一个项目之后)占序列的1%以上,且序列长度至少为200个项目,则该项目被标记为“热门”,并被视为垃圾,以便进行序列匹配。This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.创建SequenceMatcher时,可以通过将autojunk参数设置为False来关闭此启发式。

New in version 3.2.版本3.2中新增。The autojunk parameter.autojunk参数。

classdifflib.Differ

This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas. 这是一个用于比较文本行序列并产生人类可读的差异或增量的类。Differ uses SequenceMatcher both to compare sequences of lines, and to compare sequences of characters within similar (near-matching) lines.有别于使用SequenceMatcher既可以比较行序列,也可以比较相似(接近匹配)行中的字符序列。

Each line of a Differ delta begins with a two-letter code:Differ增量的每一行都以两个字母的代码开头:

Code密码

Meaning含意

'- '

line unique to sequence 1序列1的唯一行

'+ '

line unique to sequence 2序列2的唯一行

' '

line common to both sequences两个序列共用的线

'? '

line not present in either input sequence任一输入序列中都不存在行

Lines beginning with ‘?’ attempt to guide the eye to intraline differences, and were not present in either input sequence. 以“?”开头的行试图引导眼睛观察线内差异,但在两个输入序列中均不存在。These lines can be confusing if the sequences contain tab characters.如果序列包含制表符,则这些行可能会混淆。

classdifflib.HtmlDiff

This class can be used to create an HTML table (or a complete HTML file containing the table) showing a side by side, line by line comparison of text with inter-line and intra-line change highlights. 此类可用于创建一个HTML表(或包含该表的完整HTML文件),显示文本的并排、逐行比较以及行间和行内更改突出显示。The table can be generated in either full or contextual difference mode.表格可以在完整模式或上下文差异模式下生成。

The constructor for this class is:此类的构造函数是:

__init__(tabsize=8, wrapcolumn=None, linejunk=None, charjunk=IS_CHARACTER_JUNK)

Initializes instance of HtmlDiff.初始化HtmlDiff的实例。

tabsize is an optional keyword argument to specify tab stop spacing and defaults to 8.是一个可选的关键字参数,用于指定制表位间距,默认值为8

wrapcolumn is an optional keyword to specify column number where lines are broken and wrapped, defaults to None where lines are not wrapped.wrapcolumn是一个可选关键字,用于指定断行和换行处的列号,未换行处的默认值为None

linejunk and charjunk are optional keyword arguments passed into ndiff() (used by HtmlDiff to generate the side by side HTML differences). linejunkcharjunk是传递到ndiff()的可选关键字参数(由HtmlDiff用于生成并排HTML差异)。See ndiff() documentation for argument default values and descriptions.有关参数默认值和说明,请参阅ndiff()文档。

The following methods are public:以下方法是公开的:

make_file(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5, *, charset='utf-8')

Compares fromlines and tolines (lists of strings) and returns a string which is a complete HTML file containing a table showing line by line differences with inter-line and intra-line changes highlighted.比较fromlinestolines(字符串列表),并返回一个字符串,该字符串是一个完整的HTML文件,其中包含一个表,显示逐行的差异,并突出显示行间和行内的更改。

fromdesc and todesc are optional keyword arguments to specify from/to file column header strings (both default to an empty string).fromdesctodesc是可选的关键字参数,用于指定从/到文件列标题字符串(都默认为空字符串)。

context and numlines are both optional keyword arguments. contextnumline都是可选的关键字参数。Set context to True when contextual differences are to be shown, else the default is False to show the full files. numlines defaults to 5. 当要显示上下文差异时,将context设置为True,否则默认设置为False以显示完整文件。numlines默认为5。When context is True numlines controls the number of context lines which surround the difference highlights. contextTrue时,numlines控制围绕差异高亮显示的上下文线的数量。When context is False numlines controls the number of lines which are shown before a difference highlight when using the “next” hyperlinks (setting to zero would cause the “next” hyperlinks to place the next difference highlight at the top of the browser without any leading context).contextFalse时,numlines控制使用“下一个”超链接时在差异突出显示之前显示的行数(设置为零将导致“下一个”超链接将下一个差异突出显示在浏览器顶部,而没有任何前导上下文)。

Note

fromdesc and todesc are interpreted as unescaped HTML and should be properly escaped while receiving input from untrusted sources.fromdesctodesc被解释为未转义的HTML,在接收来自不受信任源的输入时应正确转义。

Changed in version 3.5:版本3.5中更改: charset keyword-only argument was added. 只添加了charset关键字参数。The default charset of HTML document changed from 'ISO-8859-1' to 'utf-8'.HTML文档的默认字符集从'ISO-8859-1'更改为'utf-8'

make_table(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5)

Compares fromlines and tolines (lists of strings) and returns a string which is a complete HTML table showing line by line differences with inter-line and intra-line changes highlighted.比较fromlinestolines(字符串列表),并返回一个字符串,该字符串是一个完整的HTML表格,逐行显示差异,并突出显示行间和行内更改。

The arguments for this method are the same as those for the make_file() method.此方法的参数与make_file()方法的参数相同。

Tools/scripts/diff.py is a command-line front-end to this class and contains a good example of its use.是此类的命令行前端,并包含一个很好的使用示例。

difflib.context_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n')

Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in context diff format.比较ab(字符串列表);以上下文差异格式返回增量(生成增量线的生成器)。

Context diffs are a compact way of showing just the lines that have changed plus a few lines of context. 上下文差异是一种简洁的方式,只显示已更改的行和几行上下文。The changes are shown in a before/after style. 更改以“之前/之后”样式显示。The number of context lines is set by n which defaults to three.上下文行数由n设置,默认为三行。

By default, the diff control lines (those with *** or ---) are created with a trailing newline. 默认情况下,差分控制行(带***---)是用尾随换行符创建的。This is helpful so that inputs created from io.IOBase.readlines() result in diffs that are suitable for use with io.IOBase.writelines() since both the inputs and outputs have trailing newlines.这很有帮助,因为输入和输出都有尾随的换行符,所以从io.IOBase.readlines()创建的输入会产生适合与io.IOBase.writelines()一起使用的差异。

For inputs that do not have trailing newlines, set the lineterm argument to "" so that the output will be uniformly newline free.对于没有尾随换行符的输入,请将lineterm参数设置为"",以便输出一致无换行符。

The context diff format normally has a header for filenames and modification times. 上下文差异格式通常有一个文件名和修改时间的标题。Any or all of these may be specified using strings for fromfile, tofile, fromfiledate, and tofiledate. 可以使用fromfiletofilefromfiledatetofiledate的字符串来指定其中任何一个或全部。The modification times are normally expressed in the ISO 8601 format. 修改时间通常以ISO 8601格式表示。If not specified, the strings default to blanks.如果未指定,则字符串默认为空白。

>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> sys.stdout.writelines(context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
*** before.py
--- after.py
***************
*** 1,4 ****
! bacon
! eggs
! ham
guido
--- 1,4 ----
! python
! eggy
! hamster
guido

See A command-line interface to difflib for a more detailed example.有关更详细的示例,请参阅difflib的命令行界面

difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)

Return a list of the best “good enough” matches. 返回最佳“足够好”匹配的列表。word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).word是一个需要紧密匹配的序列(通常是一个字符串),而possibilities是一个与word匹配的序列列表(通常是一个字符串列表)。

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.可选参数n(默认值3)是要返回的最大接近匹配数;n必须大于0

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. 可选参数cutoff(默认值为0.6)是范围[0, 1]中的浮点值。Possibilities that don’t score at least that similar to word are ignored.评分至少与word不相似的可能性被忽略。

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.在一个列表中返回可能性中的最佳(不超过n个)匹配,按相似性分数排序,最相似的优先。

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('pineapple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
difflib.ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK)

Compare a and b (lists of strings); return a Differ-style delta (a generator generating the delta lines).比较ab(字符串列表);返回不同Differ样式的增量(生成增量线的生成器)。

Optional keyword parameters linejunk and charjunk are filtering functions (or None):可选关键字参数linejunkcharjunk是筛选函数(或无):

linejunk: A function that accepts a single string argument, and returns true if the string is junk, or false if not. :接受单个字符串参数的函数,如果字符串是垃圾,则返回true;如果不是垃圾,则返回falseThe default is None. 默认值为NoneThere is also a module-level function IS_LINE_JUNK(), which filters out lines without visible characters, except for at most one pound character ('#') – however the underlying SequenceMatcher class does a dynamic analysis of which lines are so frequent as to constitute noise, and this usually works better than using this function.还有一个模块级函数IS_LINE_JUNK(),它筛选掉没有可见字符的行,但最多只有一磅字符('#')的行除外-然而,底层SequenceMatcher类对哪些行如此频繁以至于构成噪声进行动态分析,这通常比使用此函数效果更好。

charjunk: A function that accepts a character (a string of length 1), and returns if the character is junk, or false if not. :接受字符(长度为1的字符串)的函数,如果字符是垃圾,则返回;如果不是垃圾,则返回falseThe default is module-level function IS_CHARACTER_JUNK(), which filters out whitespace characters (a blank or tab; it’s a bad idea to include newline in this!).默认的是模块级函数IS_CHARACTER_JUNK(),它筛选掉空白字符(空白或制表符;在其中包含换行符是个坏主意!)。

Tools/scripts/ndiff.py is a command-line front-end to this function.是此函数的命令行前端。

>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
... 'ore\ntree\nemu\n'.splitlines(keepends=True))
>>> print(''.join(diff), end="")
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
difflib.restore(sequence, which)

Return one of the two sequences that generated a delta.返回生成增量的两个序列之一。

Given a sequence produced by Differ.compare() or ndiff(), extract lines originating from file 1 or 2 (parameter which), stripping off line prefixes.给定Differ.compare()ndiff()生成的sequence,提取来自文件1或2的行(参数which),去掉行前缀。

Example:例子:

>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
... 'ore\ntree\nemu\n'.splitlines(keepends=True))
>>> diff = list(diff) # materialize the generated delta into a list
>>> print(''.join(restore(diff, 1)), end="")
one
two
three
>>> print(''.join(restore(diff, 2)), end="")
ore
tree
emu
difflib.unified_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n')

Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in unified diff format.比较ab(字符串列表);返回统一diff格式的增量(生成增量线的生成器)。

Unified diffs are a compact way of showing just the lines that have changed plus a few lines of context. 统一差异是一种紧凑的方式,只显示已更改的行加上几行上下文。The changes are shown in an inline style (instead of separate before/after blocks). 更改以内联样式显示(而不是在块之前/之后单独显示)。The number of context lines is set by n which defaults to three.上下文行数由n设置,默认为3。

By default, the diff control lines (those with ---, +++, or @@) are created with a trailing newline. 默认情况下,差异控制线(带有---+++@@)的控制线)使用尾随换行符创建。This is helpful so that inputs created from io.IOBase.readlines() result in diffs that are suitable for use with io.IOBase.writelines() since both the inputs and outputs have trailing newlines.这很有帮助,因为输入和输出都有尾随的换行符,所以从io.IOBase.readlines()创建的输入会产生适合与io.IOBase.writelines()一起使用的差异。

For inputs that do not have trailing newlines, set the lineterm argument to "" so that the output will be uniformly newline free.对于没有尾随换行符的输入,请将lineterm参数设置为"",以便输出一致无换行符。

The context diff format normally has a header for filenames and modification times. 上下文差异格式通常有一个文件名和修改时间的标题。Any or all of these may be specified using strings for fromfile, tofile, fromfiledate, and tofiledate. 可以使用fromfiletofilefromfiledatetofiledate的字符串来指定其中任何一个或全部。The modification times are normally expressed in the ISO 8601 format. If not specified, the strings default to blanks.修改时间通常以ISO 8601格式表示。如果未指定,字符串默认为空白。

>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> sys.stdout.writelines(unified_diff(s1, s2, fromfile='before.py', tofile='after.py'))
--- before.py
+++ after.py
@@ -1,4 +1,4 @@
-bacon
-eggs
-ham
+python
+eggy
+hamster
guido

See A command-line interface to difflib for a more detailed example.有关更详细的示例,请参阅difflib的命令行界面

difflib.diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n')

Compare a and b (lists of bytes objects) using dfunc; yield a sequence of delta lines (also bytes) in the format returned by dfunc. 使用dfunc比较ab(字节对象列表);以dfunc返回的格式生成一个增量行序列(也是字节)。dfunc must be a callable, typically either unified_diff() or context_diff().dfunc必须是可调用的,通常是unified_diff()context_diff()

Allows you to compare data with unknown or inconsistent encoding. 允许您将数据与未知或不一致的编码进行比较。All inputs except n must be bytes objects, not str. n之外的所有输入都必须是bytes对象,而不是str。Works by losslessly converting all inputs (except n) to str, and calling dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm). 其工作原理是将所有输入(n除外)无损地转换为str,并调用dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm)The output of dfunc is then converted back to bytes, so the delta lines that you receive have the same unknown/inconsistent encodings as a and b.然后将dfunc的输出转换回字节,因此您收到的增量线具有与ab相同的未知/不一致编码。

New in version 3.5.版本3.5中新增。

difflib.IS_LINE_JUNK(line)

Return True for ignorable lines. 对于可忽略的行,返回TrueThe line line is ignorable if line is blank or contains a single '#', otherwise it is not ignorable. 如果line为空或包含单个'#',则该行可忽略,否则不可忽略。Used as a default for parameter linejunk in ndiff() in older versions.在旧版本中,用作ndiff()中参数linejunk的默认值。

difflib.IS_CHARACTER_JUNK(ch)

Return True for ignorable characters. 对于可忽略的字符,返回TrueThe character ch is ignorable if ch is a space or tab, otherwise it is not ignorable. 如果ch是空格或制表符,则字符ch可忽略,否则它不可忽略。Used as a default for parameter charjunk in ndiff().用作ndiff()中参数charjunk的默认值。

See also另请参见

Pattern Matching: The Gestalt Approach模式匹配:格式塔方法

Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener. John W.Ratcliff和D.E.Metzener对类似算法的讨论。This was published in Dr. Dobb’s Journal in July, 1988.这篇文章发表在1988年7月的《Dobb博士》杂志上。

SequenceMatcher Objects对象

The SequenceMatcher class has this constructor:类具有此构造函数:

classdifflib.SequenceMatcher(isjunk=None, a='', b='', autojunk=True)

Optional argument isjunk must be None (the default) or a one-argument function that takes a sequence element and returns true if and only if the element is “junk” and should be ignored. 可选参数isjunk必须为None(默认值)或一个单参数函数,该函数接受序列元素并在且仅当该元素为“junk”且应忽略时返回truePassing None for isjunk is equivalent to passing lambda x: False; in other words, no elements are ignored. isjunk传递None相当于传递lambda x: False;换句话说,不忽略任何元素。For example, pass:例如,通过:

lambda x: x in " \t"

if you’re comparing lines as sequences of characters, and don’t want to synch up on blanks or hard tabs.如果您将行作为字符序列进行比较,并且不想在空白或硬制表符上同步。

The optional arguments a and b are sequences to be compared; both default to empty strings. 可选参数ab是要比较的序列;两者都默认为空字符串。The elements of both sequences must be hashable.两个序列的元素都必须是可散列的

The optional argument autojunk can be used to disable the automatic junk heuristic.可选参数autojunk可用于禁用自动垃圾启发。

New in version 3.2.版本3.2中新增。The autojunk parameter.autojunk参数。

SequenceMatcher objects get three data attributes: bjunk is the set of elements of b for which isjunk is True; bpopular is the set of non-junk elements considered popular by the heuristic (if it is not disabled); b2j is a dict mapping the remaining elements of b to a list of positions where they occur. SequenceMatcher对象有三个数据属性:bjunkb的元素集,其中isjunkTruebpopular是启发式算法认为受欢迎的一组非垃圾元素(如果未禁用);b2j是将b的其余元素映射到它们出现的位置列表的dict。All three are reset whenever b is reset with set_seqs() or set_seq2().只要bset_seqs()set_seq2()重置,这三个都会重置。

New in version 3.2.版本3.2中新增。The bjunk and bpopular attributes.bjunkbpopular属性。

SequenceMatcher objects have the following methods:对象具有以下方法:

set_seqs(a, b)

Set the two sequences to be compared.设置要比较的两个序列。

SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.计算并缓存关于第二个序列的详细信息,因此,如果要将一个序列与多个序列进行比较,请使用set_seq2()将常用序列设置一次,并重复调用set_seq1(),对其他每个序列一次。

set_seq1(a)

Set the first sequence to be compared. 设置要比较的第一个序列。The second sequence to be compared is not changed.要比较的第二个序列不变。

set_seq2(b)

Set the second sequence to be compared. 设置要比较的第二个序列。The first sequence to be compared is not changed.要比较的第一个序列不变。

find_longest_match(alo=0, ahi=None, blo=0, bhi=None)

Find longest matching block in a[alo:ahi] and b[blo:bhi].a[alo:ahi]b[blo:bhi]中查找最长的匹配块。

If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. 如果isjunk被省略或为None,则find_longest_match()返回(i, j, k),使得a[i:i+k]等于b[j:j+k],其中alo <= i <= i+k <= ahiblo <= j <= j+k <= bhiFor all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. 对于满足这些条件的所有(i', j', k'),也满足附加条件k >= k'i <= i',如果i == i',则也满足j <= j'In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.换句话说,在所有最大匹配块中,返回在a中开始最早的块,在所有在a中开始最早的最大匹配块中,返回在b中开始最早的块。

>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
Match(a=0, b=4, size=5)

If isjunk was provided, first the longest matching block is determined as above, but with the additional restriction that no junk element appears in the block. 如果提供了isjunk,首先如上所述确定最长的匹配块,但附加限制是块中不出现任何垃圾元素。Then that block is extended as far as possible by matching (only) junk elements on both sides. 然后,通过匹配(仅)两侧的垃圾元素,尽可能地扩展该块。So the resulting block never matches on junk except as identical junk happens to be adjacent to an interesting match.因此,生成的块永远不会与垃圾匹配,除非相同的垃圾恰好与有趣的匹配相邻。

Here’s the same example as before, but considering blanks to be junk. 这里的示例与之前相同,但将空格视为垃圾。That prevents ' abcd' from matching the ' abcd' at the tail end of the second sequence directly. 这可以防止' abcd'直接与第二个序列末尾的' abcd'匹配。Instead only the 'abcd' can match, and matches the leftmost 'abcd' in the second sequence:并非只有'abcd'可以匹配,并是还匹配第二个序列中最左边的'abcd'

>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
Match(a=1, b=0, size=4)

If no blocks match, this returns (alo, blo, 0).如果没有匹配的块,则返回(alo, blo, 0)

This method returns a named tuple Match(a, b, size).该方法返回命名元组Match(a, b, size)

Changed in version 3.9:版本3.9中更改: Added default arguments.添加了默认参数。

get_matching_blocks()

Return list of triples describing non-overlapping matching subsequences. 返回描述非重叠匹配子序列的三元组列表。Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. 每个三元组的形式为(i, j, n),表示a[i:i+n] == b[j:j+n]The triples are monotonically increasing in i and j.三元组在ij中单调增加。

The last triple is a dummy, and has the value (len(a), len(b), 0). 最后一个三元组是虚拟的,其值为(len(a), len(b), 0)It is the only triple with n == 0. 它是唯一一个n == 0的三元组。If (i, j, n) and (i', j', n') are adjacent triples in the list, and the second is not the last triple in the list, then i+n < i' or j+n < j'; in other words, adjacent triples always describe non-adjacent equal blocks.如果(i, j, n)(i', j', n')是列表中的相邻三元组,并且第二个不是列表中的最后一个三元组,则i+n < i'j+n < j';换句话说,相邻三元组总是描述非相邻的相等块。

>>> s = SequenceMatcher(None, "abxcd", "abcd")
>>> s.get_matching_blocks()
[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
get_opcodes()

Return list of 5-tuples describing how to turn a into b. 返回描述如何将a转换为b的5元组列表。Each tuple is of the form (tag, i1, i2, j1, j2). 每个元组的形式为(tag, i1, i2, j1, j2)The first tuple has i1 == j1 == 0, and remaining tuples have i1 equal to the i2 from the preceding tuple, and, likewise, j1 equal to the previous j2.第一个元组具有i1==j1==0,其余元组的i1等于前一个元组的i2,同样,j1等于前一个j2

The tag values are strings, with these meanings:tag值是字符串,具有以下含义:

Value

Meaning意思

'replace'

a[i1:i2] should be replaced by b[j1:j2].应替换为b[j1:j2]

'delete'

a[i1:i2] should be deleted. 应删除。Note that j1 == j2 in this case.注意,在这种情况下,j1==j2

'insert'

b[j1:j2] should be inserted at a[i1:i1]. 应插入a[i1:i1]Note that i1 == i2 in this case.注意,在这种情况下,i1==i2

'equal'

a[i1:i2] == b[j1:j2] (the sub-sequences are equal).(子序列相等)。

For example:例如:

>>> a = "qabxcd"
>>> b = "abycdf"
>>> s = SequenceMatcher(None, a, b)
>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
... print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
... tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
delete a[0:1] --> b[0:0] 'q' --> ''
equal a[1:3] --> b[0:2] 'ab' --> 'ab'
replace a[3:4] --> b[2:3] 'x' --> 'y'
equal a[4:6] --> b[3:5] 'cd' --> 'cd'
insert a[6:6] --> b[5:6] '' --> 'f'
get_grouped_opcodes(n=3)

Return a generator of groups with up to n lines of context.返回最多包含n行上下文的组生成器

Starting with the groups returned by get_opcodes(), this method splits out smaller change clusters and eliminates intervening ranges which have no changes.该方法从get_opcodes()返回的组开始,分割出较小的更改簇,并消除没有更改的干预范围。

The groups are returned in the same format as get_opcodes().这些组的返回格式与get_opcodes()相同。

ratio()

Return a measure of the sequences’ similarity as a float in the range [0, 1].返回序列相似性的度量值,作为范围[0, 1]中的浮点值。

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. 其中T是两个序列中的元素总数,M是匹配数,这是2.0*M/T。Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.注意,如果序列相同,则为1.0;如果序列没有共同点,则为0.0

This is expensive to compute if get_matching_blocks() or get_opcodes() hasn’t already been called, in which case you may want to try quick_ratio() or real_quick_ratio() first to get an upper bound.如果尚未调用get_matching_blocks()get_opcodes(),则计算成本很高,在这种情况下,您可能需要先尝试quick_ratio()real_quick_ratio()以获得上限。

Note

Caution: The result of a ratio() call may depend on the order of the arguments. 注意:ratio()调用的结果可能取决于参数的顺序。For instance:例如:

>>> SequenceMatcher(None, 'tide', 'diet').ratio()
0.25
>>> SequenceMatcher(None, 'diet', 'tide').ratio()
0.5
quick_ratio()

Return an upper bound on ratio() relatively quickly.相对快速地返回ratio()的上限。

real_quick_ratio()

Return an upper bound on ratio() very quickly.快速返回ratio()的上限。

The three methods that return the ratio of matching to total characters can give different results due to differing levels of approximation, although quick_ratio() and real_quick_ratio() are always at least as large as ratio():虽然quick_ratio()real_quick_ratio()始终至少与ratio()一样大,但由于近似级别不同,返回匹配与总字符比率的三种方法可以给出不同的结果:

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0

SequenceMatcher Examples示例

This example compares two strings, considering blanks to be “junk”:本例比较了两个字符串,将空格视为“垃圾”:

>>> s = SequenceMatcher(lambda x: x == " ",
... "private Thread currentThread;",
... "private volatile Thread currentThread;")

ratio() returns a float in [0, 1], measuring the similarity of the sequences. 返回[0,1]中的浮点值,测量序列的相似性。As a rule of thumb, a ratio() value over 0.6 means the sequences are close matches:根据经验,ratio()值大于0.6意味着序列是紧密匹配的:

>>> print(round(s.ratio(), 3))
0.866

If you’re only interested in where the sequences match, get_matching_blocks() is handy:如果您只对序列的匹配位置感兴趣,那么get_matching_blocks()很方便:

>>> for block in s.get_matching_blocks():
... print("a[%d] and b[%d] match for %d elements" % block)
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by get_matching_blocks() is always a dummy, (len(a), len(b), 0), and this is the only case in which the last tuple element (number of elements matched) is 0.注意,get_matching_blocks()返回的最后一个元组始终是伪元组,(len(a), len(b), 0),这是唯一一种最后一个元组元素(匹配的元素数)为0的情况。

If you want to know how to change the first sequence into the second, use get_opcodes():如果您想知道如何将第一个序列更改为第二个序列,请使用get_opcodes()

>>> for opcode in s.get_opcodes():
... print("%6s a[%d:%d] b[%d:%d]" % opcode)
equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
equal a[8:29] b[17:38]

See also

Differ ObjectsDiffer对象

Note that Differ-generated deltas make no claim to be minimal diffs. 注意,Differ生成的Delta并不声称是最小差异。To the contrary, minimal diffs are often counter-intuitive, because they synch up anywhere possible, sometimes accidental matches 100 pages apart. 相反,最小的差异通常是违反直觉的,因为它们在任何可能的地方同步,有时会意外地相隔100页。Restricting synch points to contiguous matches preserves some notion of locality, at the occasional cost of producing a longer diff.将同步点限制为连续匹配保留了一些局部性概念,但偶尔会产生更长的差异。

The Differ class has this constructor:Differ类具有此构造函数:

classdifflib.Differ(linejunk=None, charjunk=None)

Optional keyword parameters linejunk and charjunk are for filter functions (or None):可选关键字参数linejunkcharjunk用于筛选函数(或无):

linejunk: A function that accepts a single string argument, and returns true if the string is junk. :接受单个字符串参数的函数,如果字符串是垃圾,则返回trueThe default is None, meaning that no line is considered junk.默认值为None,这意味着没有行被视为垃圾。

charjunk: A function that accepts a single character argument (a string of length 1), and returns true if the character is junk. :接受单字符参数(长度为1的字符串)的函数,如果字符是垃圾,则返回trueThe default is None, meaning that no character is considered junk.默认值为None,这意味着没有字符被视为垃圾。

These junk-filtering functions speed up matching to find differences and do not cause any differing lines or characters to be ignored. 这些垃圾筛选功能可以加快匹配以查找差异,并且不会导致忽略任何不同的行或字符。Read the description of the find_longest_match() method’s isjunk parameter for an explanation.阅读find_longest_match()方法的isjunk参数的描述以获取解释。

Differ objects are used (deltas generated) via a single method:通过单一方法使用Differ对象(生成增量):

compare(a, b)

Compare two sequences of lines, and generate the delta (a sequence of lines).比较两个直线序列,并生成增量(直线序列)。

Each sequence must contain individual single-line strings ending with newlines. 每个序列必须包含以换行符结尾的单个单行字符串。Such sequences can be obtained from the readlines() method of file-like objects. 这样的序列可以从类文件对象的readlines()方法中获得。The delta generated also consists of newline-terminated strings, ready to be printed as-is via the writelines() method of a file-like object.生成的增量还包括以换行符结尾的字符串,可以通过类似文件的对象的writelines()方法按原样打印。

Differ Example示例

This example compares two texts. 这个例子比较了两个文本。First we set up the texts, sequences of individual single-line strings ending with newlines (such sequences can also be obtained from the readlines() method of file-like objects):首先,我们设置文本,以换行符结尾的单个单行字符串的序列(此类序列也可以从类文件对象的readlines()方法获得):

>>> text1 = '''  1. Beautiful is better than ugly.
... 2. Explicit is better than implicit.
... 3. Simple is better than complex.
... 4. Complex is better than complicated.
... '''.splitlines(keepends=True)
>>> len(text1)
4
>>> text1[0][-1]
'\n'
>>> text2 = ''' 1. Beautiful is better than ugly.
... 3. Simple is better than complex.
... 4. Complicated is better than complex.
... 5. Flat is better than nested.
... '''.splitlines(keepends=True)

Next we instantiate a Differ object:接下来,我们实例化一个不同的对象:

>>> d = Differ()

Note that when instantiating a Differ object we may pass functions to filter out line and character “junk.” 注意,当实例化一个Differ对象时,我们可以传递函数来筛选掉行和字符“junk”。See the Differ() constructor for details.有关详细信息,请参阅Differ()构造函数。

Finally, we compare the two:最后,我们将两者进行比较:

>>> result = list(d.compare(text1, text2))

result is a list of strings, so let’s pretty-print it:是一个字符串列表,所以让我们漂亮地打印它:

>>> from pprint import pprint
>>> pprint(result)
[' 1. Beautiful is better than ugly.\n',
'- 2. Explicit is better than implicit.\n',
'- 3. Simple is better than complex.\n',
'+ 3. Simple is better than complex.\n',
'? ++\n',
'- 4. Complex is better than complicated.\n',
'? ^ ---- ^\n',
'+ 4. Complicated is better than complex.\n',
'? ++++ ^ ^\n',
'+ 5. Flat is better than nested.\n']

As a single multi-line string it looks like this:作为一个多行字符串,它如下所示:

>>> import sys
>>> sys.stdout.writelines(result)
1. Beautiful is better than ugly.
- 2. Explicit is better than implicit.
- 3. Simple is better than complex.
+ 3. Simple is better than complex.
? ++
- 4. Complex is better than complicated.
? ^ ---- ^
+ 4. Complicated is better than complex.
? ++++ ^ ^
+ 5. Flat is better than nested.

A command-line interface to difflibdifflib的命令行界面

This example shows how to use difflib to create a diff-like utility. 此示例显示如何使用difflib创建类似diff的实用程序。It is also contained in the Python source distribution, as Tools/scripts/diff.py.它还以Tools/scripts/diff.py的形式包含在Python源代码发行版中。

#!/usr/bin/env python3
""" Command line interface to difflib.py providing diffs in four formats:
* ndiff: lists every line and highlights interline changes.
* context: highlights clusters of changes in a before/after format.
* unified: highlights clusters of changes in an inline format.
* html: generates side by side comparison with change highlights.

"""

import sys, os, difflib, argparse
from datetime import datetime, timezone

def file_mtime(path):
t = datetime.fromtimestamp(os.stat(path).st_mtime,
timezone.utc)
return t.astimezone().isoformat()

def main():

parser = argparse.ArgumentParser()
parser.add_argument('-c', action='store_true', default=False,
help='Produce a context format diff (default)')
parser.add_argument('-u', action='store_true', default=False,
help='Produce a unified format diff')
parser.add_argument('-m', action='store_true', default=False,
help='Produce HTML side by side diff '
'(can use -c and -l in conjunction)')
parser.add_argument('-n', action='store_true', default=False,
help='Produce a ndiff format diff')
parser.add_argument('-l', '--lines', type=int, default=3,
help='Set number of context lines (default 3)')
parser.add_argument('fromfile')
parser.add_argument('tofile')
options = parser.parse_args()

n = options.lines
fromfile = options.fromfile
tofile = options.tofile

fromdate = file_mtime(fromfile)
todate = file_mtime(tofile)
with open(fromfile) as ff:
fromlines = ff.readlines()
with open(tofile) as tf:
tolines = tf.readlines()

if options.u:
diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)
elif options.n:
diff = difflib.ndiff(fromlines, tolines)
elif options.m:
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile,context=options.c,numlines=n)
else:
diff = difflib.context_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)

sys.stdout.writelines(diff)

if __name__ == '__main__':
main()