unicodedataUnicode DatabaseUnicode数据库


This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. 此模块提供对Unicode字符数据库(UCD)的访问,该数据库定义了所有Unicode字符的字符属性。The data contained in this database is compiled from the UCD version 13.0.0.该数据库中包含的数据是从UCD version 13.0.0版编译而来的。

The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”. 该模块使用Unicode标准附录#44“Unicode字符数据库”中定义的相同名称和符号。It defines the following functions:它定义了以下功能:

unicodedata.lookup(name)

Look up character by name. If a character with the given name is found, return the corresponding character. 按名称查找字符。如果找到具有给定名称的字符,请返回相应的字符。If not found, KeyError is raised.如果未找到,则引发KeyError

Changed in version 3.3:版本3.3中更改: Support for name aliases 1 and named sequences 2 has been added.添加了对名称别名1和命名序列2的支持。

unicodedata.name(chr[, default])

Returns the name assigned to the character chr as a string. 以字符串形式返回指定给字符chr的名称。If no name is defined, default is returned, or, if not given, ValueError is raised.如果未定义名称,则返回default,如果未给定,则引发ValueError

unicodedata.decimal(chr[, default])

Returns the decimal value assigned to the character chr as integer. 返回作为整数分配给字符chr的十进制值。If no such value is defined, default is returned, or, if not given, ValueError is raised.如果未定义此类值,则返回default,如果未给定,则引发ValueError

unicodedata.digit(chr[, default])

Returns the digit value assigned to the character chr as integer. 返回指定给字符chr的整数值。If no such value is defined, default is returned, or, if not given, ValueError is raised.如果未定义此类值,则返回default,如果未给定,则引发ValueError

unicodedata.numeric(chr[, default])

Returns the numeric value assigned to the character chr as float. 以浮点形式返回指定给字符chr的数值。If no such value is defined, default is returned, or, if not given, ValueError is raised.如果未定义此类值,则返回default,如果未给定,则引发ValueError

unicodedata.category(chr)

Returns the general category assigned to the character chr as string.将指定给字符chr的常规类别作为字符串返回。

unicodedata.bidirectional(chr)

Returns the bidirectional class assigned to the character chr as string. 将分配给字符chr的双向类作为字符串返回。If no such value is defined, an empty string is returned.如果未定义此类值,则返回空字符串。

unicodedata.combining(chr)

Returns the canonical combining class assigned to the character chr as integer. 返回作为整数分配给字符chr的规范组合类。Returns 0 if no combining class is defined.如果未定义组合类,则返回0

unicodedata.east_asian_width(chr)

Returns the east asian width assigned to the character chr as string.将指定给字符chr的东亚宽度作为字符串返回。

unicodedata.mirrored(chr)

Returns the mirrored property assigned to the character chr as integer. 将指定给字符chr的镜像属性作为整数返回。Returns 1 if the character has been identified as a “mirrored” character in bidirectional text, 0 otherwise.如果字符在双向文本中被标识为“镜像”字符,则返回1,否则返回0

unicodedata.decomposition(chr)

Returns the character decomposition mapping assigned to the character chr as string. 将分配给字符chr的字符分解映射作为字符串返回。An empty string is returned in case no such mapping is defined.如果未定义此类映射,则返回空字符串。

unicodedata.normalize(form, unistr)

Return the normal form form for the Unicode string unistr. 返回Unicode字符串unistr的标准形式formValid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.form的有效值为“NFC”、“NFKC”、“NFD”和“NFKD”。

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. Unicode标准基于规范等价和兼容性等价的定义,定义了Unicode字符串的各种规范化形式。In Unicode, several characters can be expressed in various way. 在Unicode中,几个字符可以用不同的方式表示。For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).例如,字符U+00C7(带CEDILLA的拉丁文大写字母C)也可以表示为序列U+0043(拉丁文大写字母C)U+0327(组合CEDILLA)。

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. 对于每个字符,有两种范式:范式C和范式D。范式D(NFD)也称为规范分解,并将每个字符转换为其分解形式。Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.范式C(NFC)首先应用正则分解,然后再次组合预组合字符。

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. 除了这两种形式外,还有两种基于兼容性等价的额外范式。In Unicode, certain characters are supported which normally would be unified with other characters. 在Unicode中,支持某些字符,这些字符通常与其他字符统一。For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). 例如,U+2160(罗马数字一)实际上与U+0049(拉丁文大写字母I)相同。However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).然而,由于与现有字符集(如gb2312)的兼容性,Unicode支持它。

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. 范式KD(NFKD)将应用兼容性分解,即用其等价物替换所有兼容性字符。The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.标准形式KC(NFKC)首先应用相容性分解,然后应用正则合成。

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.即使两个unicode字符串被规范化,并且在人类读者看来是相同的,如果其中一个具有组合字符,而另一个没有,它们也可能不相等。

unicodedata.is_normalized(form, unistr)

Return whether the Unicode string unistr is in the normal form form. 返回Unicode字符串unistr是否为标准格式。Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.form的有效值为“NFC”、“NFKC”、“NFD”和“NFKD”。

New in version 3.8.版本3.8中新增。

In addition, the module exposes the following constant:此外,该模块还公开了以下常数:

unicodedata.unidata_version

The version of the Unicode database used in this module.本模块中使用的Unicode数据库的版本。

unicodedata.ucd_3_2_0

This is an object that has the same methods as the entire module, but uses the Unicode database version 3.2 instead, for applications that require this specific version of the Unicode database (such as IDNA).这是一个与整个模块具有相同方法的对象,但对于需要此特定版本的Unicode数据库(如IDNA)的应用程序,它使用Unicode数据库版本3.2。

Examples:示例:

>>> import unicodedata
>>> unicodedata.lookup('LEFT CURLY BRACKET')
'{'
>>> unicodedata.name('/')
'SOLIDUS'
>>> unicodedata.decimal('9')
9
>>> unicodedata.decimal('a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not a decimal
>>> unicodedata.category('A') # 'L'etter, 'u'ppercase
'Lu'
>>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
'AN'

Footnotes

1

https://www.unicode.org/Public/13.0.0/ucd/NameAliases.txt

2

https://www.unicode.org/Public/13.0.0/ucd/NamedSequences.txt