codecsCodec registry and base classes编解码器注册表和基类

Source code: Lib/codecs.py


This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry, which manages the codec and error handling lookup process. 该模块为标准Python编解码器(编码器和解码器)定义基类,并提供对内部Python编解码器注册表的访问,该注册表管理编解码器和错误处理查找过程。Most standard codecs are text encodings, which encode text to bytes, but there are also codecs provided that encode text to text, and bytes to bytes. 大多数标准编解码器是文本编码,将文本编码为字节,但也提供了将文本编码为文本和字节编码为字节的编解码器。Custom codecs may encode and decode between arbitrary types, but some module features are restricted to use specifically with text encodings, or with codecs that encode to bytes.自定义编解码器可以在任意类型之间进行编码和解码,但某些模块功能仅限于与文本编码轼或编码到bytes的编解码器一起使用。

The module defines the following functions for encoding and decoding with any codec:该模块定义了以下用于使用任何编解码器进行编码和解码的功能:

codecs.encode(obj, encoding='utf-8', errors='strict')

Encodes obj using the codec registered for encoding.使用注册用于encoding的编解码器对obj进行编码。

Errors may be given to set the desired error handling scheme. 可以给出Errors来设置所需的错误处理方案。The default error handler is 'strict' meaning that encoding errors raise ValueError (or a more codec specific subclass, such as UnicodeEncodeError). 默认的错误处理程序是'strict',这意味着编码错误会引发ValueError(或更特定于编解码器的子类,例如UnicodeEncodeError)。Refer to Codec Base Classes for more information on codec error handling.有关编解码器错误处理的更多信息,请参阅编解码器基类

codecs.decode(obj, encoding='utf-8', errors='strict')

Decodes obj using the codec registered for encoding.使用注册encoding的编解码器解码obj

Errors may be given to set the desired error handling scheme. 可以设置所需的错误处理方案。The default error handler is 'strict' meaning that decoding errors raise ValueError (or a more codec specific subclass, such as UnicodeDecodeError). 默认的错误处理程序是'strict'的,这意味着解码错误会引起ValueError(或更特定于编解码器的子类,例如UnicodeDecodeError)。Refer to Codec Base Classes for more information on codec error handling.有关编解码器错误处理的更多信息,请参阅编解码器基类

The full details for each codec can also be looked up directly:还可以直接查找每个编解码器的完整详细信息:

codecs.lookup(encoding)

Looks up the codec info in the Python codec registry and returns a CodecInfo object as defined below.在Python编解码器注册表中查找编解码器信息,并返回如下定义的CodecInfo对象。

Encodings are first looked up in the registry’s cache. 编码首先在注册表的缓存中查找。If not found, the list of registered search functions is scanned. 如果未找到,则扫描已注册搜索功能的列表。If no CodecInfo object is found, a LookupError is raised. 如果未找到CodecInfo对象,则引发LookupErrorOtherwise, the CodecInfo object is stored in the cache and returned to the caller.否则,CodecInfo对象存储在缓存中并返回给调用方。

classcodecs.CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)

Codec details when looking up the codec registry. 查找编解码器注册表时的编解码器详细信息。The constructor arguments are stored in attributes of the same name:构造函数参数存储在同名的属性中:

name

The name of the encoding.编码的名称。

encode
decode

The stateless encoding and decoding functions. 无状态编码和解码功能。These must be functions or methods which have the same interface as the encode() and decode() methods of Codec instances (see Codec Interface). 这些函数或方法必须与编解码器实例的encode()decode()方法具有相同的接口(请参见编解码器接口)。The functions or methods are expected to work in a stateless mode.这些函数或方法应在无状态模式下工作。

incrementalencoder
incrementaldecoder

Incremental encoder and decoder classes or factory functions. 增量编码器和解码器类或工厂函数。These have to provide the interface defined by the base classes IncrementalEncoder and IncrementalDecoder, respectively. 它们必须分别提供由基类IncrementalEncoderIncrementalDecoder定义的接口。Incremental codecs can maintain state.增量编解码器可以保持状态。

streamwriter
streamreader

Stream writer and reader classes or factory functions. 流编写器和读取器类或工厂函数。These have to provide the interface defined by the base classes StreamWriter and StreamReader, respectively. 它们必须分别提供由基类StreamWriterStreamReader定义的接口。Stream codecs can maintain state.流编解码器可以保持状态。

To simplify access to the various codec components, the module provides these additional functions which use lookup() for the codec lookup:为了简化对各种编解码器组件的访问,模块提供了以下附加函数,这些函数使用lookup()进行编解码器查找:

codecs.getencoder(encoding)

Look up the codec for the given encoding and return its encoder function.查找给定编码的编解码器,并返回其编码器函数。

Raises a LookupError in case the encoding cannot be found.如果找不到编码,则引发LookupError

codecs.getdecoder(encoding)

Look up the codec for the given encoding and return its decoder function.查找给定编码的编解码器,并返回其解码器函数。

Raises a LookupError in case the encoding cannot be found.如果找不到编码,则引发LookupError

codecs.getincrementalencoder(encoding)

Look up the codec for the given encoding and return its incremental encoder class or factory function.查找给定编码的编解码器,并返回其增量编码器类或工厂函数。

Raises a LookupError in case the encoding cannot be found or the codec doesn’t support an incremental encoder.如果找不到编码或编解码器不支持增量编码器,则引发LookupError

codecs.getincrementaldecoder(encoding)

Look up the codec for the given encoding and return its incremental decoder class or factory function.查找给定编码的编解码器,并返回其增量解码器类或工厂函数。

Raises a LookupError in case the encoding cannot be found or the codec doesn’t support an incremental decoder.如果找不到编码或编解码器不支持增量解码器,则引发LookupError

codecs.getreader(encoding)

Look up the codec for the given encoding and return its StreamReader class or factory function.查找给定编码的编解码器,并返回其StreamReader类或工厂函数。

Raises a LookupError in case the encoding cannot be found.如果找不到编码,则引发LookupError

codecs.getwriter(encoding)

Look up the codec for the given encoding and return its StreamWriter class or factory function.查找给定编码的编解码器,并返回其StreamWriter类或工厂函数。

Raises a LookupError in case the encoding cannot be found.如果找不到编码,则引发LookupError

Custom codecs are made available by registering a suitable codec search function:通过注册合适的编解码器搜索功能,可以使用自定义编解码器:

codecs.register(search_function)

Register a codec search function. 注册编解码器搜索功能。Search functions are expected to take one argument, being the encoding name in all lower case letters with hyphens and spaces converted to underscores, and return a CodecInfo object. 搜索函数需要一个参数,即所有小写字母中的编码名称,并将连字符和空格转换为下划线,然后返回CodecInfo对象。In case a search function cannot find a given encoding, it should return None.如果搜索函数无法找到给定的编码,则应返回None

Changed in version 3.9:版本3.9中更改: Hyphens and spaces are converted to underscore.连字符和空格转换为下划线。

codecs.unregister(search_function)

Unregister a codec search function and clear the registry’s cache. 注销编解码器搜索功能并清除注册表的缓存。If the search function is not registered, do nothing.如果未注册搜索功能,请不要执行任何操作。

New in version 3.10.版本3.10中新增。

While the builtin open() and the associated io module are the recommended approach for working with encoded text files, this module provides additional utility functions and classes that allow the use of a wider range of codecs when working with binary files:虽然建议使用内置的open()和相关的io模块来处理编码的文本文件,但该模块提供了额外的实用程序函数和类,允许在处理二进制文件时使用更广泛的编解码器:

codecs.open(filename, mode='r', encoding=None, errors='strict', buffering=- 1)

Open an encoded file using the given mode and return an instance of StreamReaderWriter, providing transparent encoding/decoding. 使用给定mode打开编码文件并返回StreamReaderWriter的实例,提供透明的编码/解码。The default file mode is 'r', meaning to open the file in read mode.默认文件模式为'r',表示以读取模式打开文件。

Note

Underlying encoded files are always opened in binary mode. 底层编码文件始终以二进制模式打开。No automatic conversion of '\n' is done on reading and writing. 在读写时没有自动转换'\n'The mode argument may be any binary mode acceptable to the built-in open() function; the 'b' is automatically added.mode参数可以是内置open()函数可以接受的任何二进制模式;自动添加'b'

encoding specifies the encoding which is to be used for the file. 指定要用于文件的编码。Any encoding that encodes to and decodes from bytes is allowed, and the data types supported by the file methods depend on the codec used.允许对字节进行编码和解码,文件方法支持的数据类型取决于使用的编解码器。

errors may be given to define the error handling. 可以定义错误处理。It defaults to 'strict' which causes a ValueError to be raised in case an encoding error occurs.它默认为'strict',在发生编码错误时会引发ValueError

buffering has the same meaning as for the built-in open() function. 与内置的open()函数具有相同的含义。It defaults to -1 which means that the default buffer size will be used.它默认为-1,这意味着将使用默认的缓冲区大小。

codecs.EncodedFile(file, data_encoding, file_encoding=None, errors='strict')

Return a StreamRecoder instance, a wrapped version of file which provides transparent transcoding. 返回StreamRecoder实例,这是一个file的包装版本,提供了透明的转码。The original file is closed when the wrapped version is closed.当封装版本关闭时,原始文件将关闭。

Data written to the wrapped file is decoded according to the given data_encoding and then written to the original file as bytes using file_encoding. 写入包装文件的数据根据给定的data_encoding进行解码,然后使用file_encoding作为字节写入原始文件。Bytes read from the original file are decoded according to file_encoding, and the result is encoded using data_encoding.从原始文件读取的字节根据file_encoding进行解码,结果使用data_encoding进行编码。

If file_encoding is not given, it defaults to data_encoding.如果未给出file_encoding,则默认为data_encoding

errors may be given to define the error handling. 可以给出errors来定义错误处理。It defaults to 'strict', which causes ValueError to be raised in case an encoding error occurs.它默认为'strict',这会在发生编码错误时引发ValueError

codecs.iterencode(iterator, encoding, errors='strict', **kwargs)

Uses an incremental encoder to iteratively encode the input provided by iterator. 使用增量编码器对iterator提供的输入进行迭代编码。This function is a generator. 此函数是一个生成器The errors argument (as well as any other keyword argument) is passed through to the incremental encoder.errors参数(以及任何其他关键字参数)传递给增量编码器。

This function requires that the codec accept text str objects to encode. 此函数要求编解码器接受要编码的文本str对象。Therefore it does not support bytes-to-bytes encoders such as base64_codec.因此,它不支持字节到字节编码器,例如base64_codec

codecs.iterdecode(iterator, encoding, errors='strict', **kwargs)

Uses an incremental decoder to iteratively decode the input provided by iterator. 使用增量解码器对iterator提供的输入进行迭代解码。This function is a generator. 此函数是一个生成器The errors argument (as well as any other keyword argument) is passed through to the incremental decoder.errors参数(以及任何其他关键字参数)传递给增量解码器。

This function requires that the codec accept bytes objects to decode. 此函数要求编解码器接受要解码的bytes对象。Therefore it does not support text-to-text encoders such as rot_13, although rot_13 may be used equivalently with iterencode().因此,它不支持诸如rot_13之类的文本到文本编码器,尽管rot_13可以与iterencode()等效使用。

The module also provides the following constants which are useful for reading and writing to platform dependent files:该模块还提供了以下常数,这些常数对于读取和写入平台相关文件非常有用:

codecs.BOM
codecs.BOM_BE
codecs.BOM_LE
codecs.BOM_UTF8
codecs.BOM_UTF16
codecs.BOM_UTF16_BE
codecs.BOM_UTF16_LE
codecs.BOM_UTF32
codecs.BOM_UTF32_BE
codecs.BOM_UTF32_LE

These constants define various byte sequences, being Unicode byte order marks (BOMs) for several encodings. 这些常数定义了各种字节序列,即几种编码的Unicode字节顺序标记(BOM)。They are used in UTF-16 and UTF-32 data streams to indicate the byte order used, and in UTF-8 as a Unicode signature. 它们在UTF-16和UTF-32数据流中用于指示使用的字节顺序,在UTF-8中用作Unicode签名。BOM_UTF16 is either BOM_UTF16_BE or BOM_UTF16_LE depending on the platform’s native byte order, BOM is an alias for BOM_UTF16, BOM_LE for BOM_UTF16_LE and BOM_BE for BOM_UTF16_BE. BOM_UTF16BOM_UTF16_BEBOM_UTF16_LE,具体取决于平台的本机字节顺序,BOMBOM_UTF16的别名,BOM_LEBOM_UTF16_LE的别名,BOM_BEBOM_UTF16_BE的别名。The others represent the BOM in UTF-8 and UTF-32 encodings.其他表示UTF-8和UTF-32编码的BOM。

Codec Base Classes编解码器基类

The codecs module defines a set of base classes which define the interfaces for working with codec objects, and can also be used as the basis for custom codec implementations.codecs模块定义了一组基类,这些基类定义了使用编解码器对象的接口,还可以用作自定义编解码器实现的基础。

Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer. 每个编解码器必须定义四个接口,以使其可用作Python中的编解码器:无状态编码器、无状态解码器、流读取器和流编写器。The stream reader and writers typically reuse the stateless encoder/decoder to implement the file protocols. 流读取器和写入器通常重用无状态编码器/解码器来实现文件协议。Codec authors also need to define how the codec will handle encoding and decoding errors.编解码器作者还需要定义编解码器将如何处理编码和解码错误。

Error Handlers错误处理程序

To simplify and standardize error handling, codecs may implement different error handling schemes by accepting the errors string argument. 为了简化和标准化错误处理,编解码器可以通过接受errors字符串参数来实现不同的错误处理方案。The following string values are defined and implemented by all standard Python codecs:以下字符串值由所有标准Python编解码器定义和实现:

Value

Meaning含义

'strict'

Raise UnicodeError (or a subclass); this is the default. 引发UnicodeError(或子类);这是默认设置。Implemented in strict_errors().strict_errors()中实现。

'ignore'

Ignore the malformed data and continue without further notice. 忽略格式错误的数据并继续,恕不另行通知。Implemented in ignore_errors().ignore_errors()中实现。

The following error handlers are only applicable to text encodings:以下错误处理程序仅适用于文本编码

Value

Meaning含义

'replace'

Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding. 用合适的更换标记进行更换;Python将在解码时使用官方U+FFFD替换字符作为内置编解码器,并使用“?”关于编码。Implemented in replace_errors().replace_errors()中实现。

'xmlcharrefreplace'

Replace with the appropriate XML character reference (only for encoding). 替换为适当的XML字符引用(仅用于编码)。Implemented in xmlcharrefreplace_errors().xmlcharrefreplace_errors()中实现。

'backslashreplace'

Replace with backslashed escape sequences. 替换为反斜杠转义序列。Implemented in backslashreplace_errors().backslashreplace_errors()中实现。

'namereplace'

Replace with \N{...} escape sequences (only for encoding). 替换为\N{...}转义序列(仅用于编码)。Implemented in namereplace_errors().namereplace_errors()中实现。

'surrogateescape'

On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. 解码时,用从U+DC80U+DCFF的单个代理代码替换字节。This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. 当编码数据时使用'surrogateescape'错误处理程序时,该代码将返回到同一字节。(See PEP 383 for more.)(详见PEP 383。)

In addition, the following error handler is specific to the given codecs:此外,以下错误处理程序特定于给定的编解码器:

Value

Codecs编解码器

Meaning含义

'surrogatepass'

utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le

Allow encoding and decoding of surrogate codes. 允许对代理代码进行编码和解码。These codecs normally treat the presence of surrogates as an error.这些编解码器通常将代理的存在视为错误。

New in version 3.1.版本3.1中新增。The 'surrogateescape' and 'surrogatepass' error handlers.'surrogateescape''surrogatepass'错误处理程序。

Changed in version 3.4:版本3.4中更改: The 'surrogatepass' error handlers now works with utf-16* and utf-32* codecs.'surrogatepass'错误处理程序现在可用于utf-16*和utf-32*编解码器。

New in version 3.5.版本3.5中新增。The 'namereplace' error handler.'namereplace'错误处理程序。

Changed in version 3.5:版本3.5中更改: The 'backslashreplace' error handlers now works with decoding and translating.'backslashreplace'错误处理程序现在可以用于解码和翻译。

The set of allowed values can be extended by registering a new named error handler:可以通过注册新的命名错误处理程序来扩展允许值集:

codecs.register_error(name, error_handler)

Register the error handling function error_handler under the name name. name下注册错误处理函数error_handlerThe error_handler argument will be called during encoding and decoding in case of an error, when name is specified as the errors parameter.name被指定为errors参数时,在编码和解码过程中,如果出现错误,将调用error_handler参数。

For encoding, error_handler will be called with a UnicodeEncodeError instance, which contains information about the location of the error. 对于编码,将使用UnicodeEncodeError实例调用error_handler,该实例包含有关错误位置的信息。The error handler must either raise this or a different exception, or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. 错误处理程序必须引发此异常或其他异常,或者返回一个元组,替换输入中不可编码的部分,并返回编码应继续的位置。The replacement may be either str or bytes. 替换可以是strbytesIf the replacement is bytes, the encoder will simply copy them into the output buffer. 如果替换为字节,编码器将简单地将其复制到输出缓冲区。If the replacement is a string, the encoder will encode the replacement. 如果替换为字符串,编码器将对替换进行编码。Encoding continues on original input at the specified position. 编码在指定位置的原始输入上继续。Negative position values will be treated as being relative to the end of the input string. 负位置值将被视为相对于输入字符串的末尾。If the resulting position is out of bound an IndexError will be raised.如果结果位置超出界限,则会引发IndexError

Decoding and translating works similarly, except UnicodeDecodeError or UnicodeTranslateError will be passed to the handler and that the replacement from the error handler will be put into the output directly.解码和翻译的工作原理类似,除了UnicodeDecodeErrorUnicodeTranslateError将传递给处理程序,并且错误处理程序的替换将直接放入输出。

Previously registered error handlers (including the standard error handlers) can be looked up by name:可以按名称查找以前注册的错误处理程序(包括标准错误处理程序):

codecs.lookup_error(name)

Return the error handler previously registered under the name name.返回以前在名称name下注册的错误处理程序。

Raises a LookupError in case the handler cannot be found.如果找不到处理程序,则引发LookupError

The following standard error handlers are also made available as module level functions:以下标准错误处理程序也作为模块级函数提供:

codecs.strict_errors(exception)

Implements the 'strict' error handling: each encoding or decoding error raises a UnicodeError.实现'strict'错误处理:每个编码或解码错误都会引发UnicodeError

codecs.replace_errors(exception)

Implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors.实现'replace'错误处理(仅适用于文本编码):替换'?'用于编码错误(由编解码器编码),以及用于解码错误的'\ufffd'(Unicode替换字符)。

codecs.ignore_errors(exception)

Implements the 'ignore' error handling: malformed data is ignored and encoding or decoding is continued without further notice.实现'ignore'错误处理:忽略格式错误的数据,继续编码或解码,恕不另行通知。

codecs.xmlcharrefreplace_errors(exception)

Implements the 'xmlcharrefreplace' error handling (for encoding with text encodings only): the unencodable character is replaced by an appropriate XML character reference.实现'xmlcharrefreplace'错误处理(仅用于使用文本编码进行编码):不可编码的字符被适当的XML字符引用替换。

codecs.backslashreplace_errors(exception)

Implements the 'backslashreplace' error handling (for text encodings only): malformed data is replaced by a backslashed escape sequence.实现'backslashreplace'错误处理(仅适用于文本编码):格式错误的数据被反斜杠转义序列替换。

codecs.namereplace_errors(exception)

Implements the 'namereplace' error handling (for encoding with text encodings only): the unencodable character is replaced by a \N{...} escape sequence.实现'namereplace'错误处理(仅用于文本编码):不可编码字符被替换为\N{...}转义序列。

New in version 3.5.版本3.5中新增。

Stateless Encoding and Decoding无状态编码和解码

The base Codec class defines these methods which also define the function interfaces of the stateless encoder and decoder:基本Codec类定义了这些方法,这些方法还定义了无状态编码器和解码器的功能接口:

Codec.encode(input[, errors])

Encodes the object input and returns a tuple (output object, length consumed). 对对象input进行编码并返回元组(输出对象,消耗的长度)。For instance, text encoding converts a string object to a bytes object using a particular character set encoding (e.g., cp1252 or iso-8859-1).例如,文本编码使用特定的字符集编码(例如,cp1252iso-8859-1)将字符串对象转换为字节对象。

The errors argument defines the error handling to apply. errors参数定义了要应用的错误处理。It defaults to 'strict' handling.它默认为'strict'处理。

The method may not store state in the Codec instance. 该方法可能不会将状态存储在Codec实例中。Use StreamWriter for codecs which have to keep state in order to make encoding efficient.对于必须保持状态以提高编码效率的编解码器,请使用StreamWriter

The encoder must be able to handle zero length input and return an empty object of the output object type in this situation.编码器必须能够处理零长度输入,并在这种情况下返回输出对象类型的空对象。

Codec.decode(input[, errors])

Decodes the object input and returns a tuple (output object, length consumed). 解码对象input并返回元组(输出对象,消耗的长度)。For instance, for a text encoding, decoding converts a bytes object encoded using a particular character set encoding to a string object.例如,对于文本编码,解码将使用特定字符集编码编码的字节对象转换为字符串对象。

For text encodings and bytes-to-bytes codecs, input must be a bytes object or one which provides the read-only buffer interface – for example, buffer objects and memory mapped files.对于文本编码和字节到字节编解码器,input必须是字节对象或提供只读缓冲区接口的对象,例如缓冲区对象和内存映射文件。

The errors argument defines the error handling to apply. errors参数定义了要应用的错误处理。It defaults to 'strict' handling.它默认为'strict'处理。

The method may not store state in the Codec instance. 该方法可能不会将状态存储在Codec实例中。Use StreamReader for codecs which have to keep state in order to make decoding efficient.对于必须保持状态以提高解码效率的编解码器,请使用StreamReader

The decoder must be able to handle zero length input and return an empty object of the output object type in this situation.在这种情况下,解码器必须能够处理零长度输入并返回输出对象类型的空对象。

Incremental Encoding and Decoding增量编码和解码

The IncrementalEncoder and IncrementalDecoder classes provide the basic interface for incremental encoding and decoding. IncrementalEncoder类和IncrementalDecoder类为增量编码和解码提供了基本接口。Encoding/decoding the input isn’t done with one call to the stateless encoder/decoder function, but with multiple calls to the encode()/decode() method of the incremental encoder/decoder. 编码/解码输入不是通过一次调用无状态编码器/解码器函数来完成的,而是通过多次调用增量编码器/解码器的encode()/decode()方法来完成的。The incremental encoder/decoder keeps track of the encoding/decoding process during method calls.增量编码器/解码器在方法调用期间跟踪编码/解码过程。

The joined output of calls to the encode()/decode() method is the same as if all the single inputs were joined into one, and this input was encoded/decoded with the stateless encoder/decoder.调用encode()/decode()方法的连接输出与将所有单个输入连接成一个输入相同,并且该输入是使用无状态编码器/解码器编码/解码的。

IncrementalEncoder Objects对象

The IncrementalEncoder class is used for encoding an input in multiple steps. IncrementalEncoder类用于在多个步骤中对输入进行编码。It defines the following methods which every incremental encoder must define in order to be compatible with the Python codec registry.它定义了每个增量编码器必须定义的以下方法,以便与Python编解码器注册表兼容。

classcodecs.IncrementalEncoder(errors='strict')

Constructor for an IncrementalEncoder instance.IncrementalEncoder实例的构造函数。

All incremental encoders must provide this constructor interface. 所有增量编码器必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。

The IncrementalEncoder may implement different error handling schemes by providing the errors keyword argument. IncrementalEncoder可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for possible values.有关可能的值,请参阅错误处理程序

The errors argument will be assigned to an attribute of the same name. errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the IncrementalEncoder object.分配给该属性可以在IncrementalEncoder对象的生命周期内在不同的错误处理策略之间切换。

encode(object[, final])

Encodes object (taking the current state of the encoder into account) and returns the resulting encoded object. object进行编码(考虑编码器的当前状态)并返回生成的编码对象。If this is the last call to encode() final must be true (the default is false).如果这是对encode()的最后一次调用,则final必须为true(默认值为false)。

reset()

Reset the encoder to the initial state. 将编码器重置为初始状态。The output is discarded: call .encode(object, final=True), passing an empty byte or text string if necessary, to reset the encoder and to get the output.输出被丢弃:调用.encode(object, final=True),必要时传递一个空字节或文本字符串,以重置编码器并获得输出。

getstate()

Return the current state of the encoder which must be an integer. 返回编码器的当前状态,该状态必须为整数。The implementation should make sure that 0 is the most common state. 实现应该确保0是最常见的状态。(States that are more complicated than integers can be converted into an integer by marshaling/pickling the state and encoding the bytes of the resulting string into an integer.)(比整数更复杂的状态可以通过封送/酸洗状态并将结果字符串的字节编码为整数来转换为整数。)

setstate(state)

Set the state of the encoder to state. 将编码器的状态设置为statestate must be an encoder state returned by getstate().state必须是getstate()返回的编码器状态。

IncrementalDecoder Objects对象

The IncrementalDecoder class is used for decoding an input in multiple steps. IncrementalDecoder类用于在多个步骤中对输入进行解码。It defines the following methods which every incremental decoder must define in order to be compatible with the Python codec registry.它定义了每个增量解码器必须定义的以下方法,以便与Python编解码器注册表兼容。

classcodecs.IncrementalDecoder(errors='strict')

Constructor for an IncrementalDecoder instance.IncrementalDecoder实例的构造函数。

All incremental decoders must provide this constructor interface. 所有增量解码器必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。

The IncrementalDecoder may implement different error handling schemes by providing the errors keyword argument. IncrementalDecoder可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for possible values.有关可能的值,请参阅错误处理程序

The errors argument will be assigned to an attribute of the same name. errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the IncrementalDecoder object.为该属性赋值可以在IncrementalDecoder对象的生存期内在不同的错误处理策略之间切换。

decode(object[, final])

Decodes object (taking the current state of the decoder into account) and returns the resulting decoded object. 解码object(考虑解码器的当前状态)并返回得到的解码对象。If this is the last call to decode() final must be true (the default is false). 如果这是对decode()的最后一次调用,final必须为true(默认值为false)。If final is true the decoder must decode the input completely and must flush all buffers. 如果finaltrue,则解码器必须完全解码输入,并且必须刷新所有缓冲区。If this isn’t possible (e.g. because of incomplete byte sequences at the end of the input) it must initiate error handling just like in the stateless case (which might raise an exception).如果这不可能(例如,由于输入端的字节序列不完整),它必须像在无状态情况下一样启动错误处理(这可能会引发异常)。

reset()

Reset the decoder to the initial state.将解码器重置为初始状态。

getstate()

Return the current state of the decoder. 返回解码器的当前状态。This must be a tuple with two items, the first must be the buffer containing the still undecoded input. 这必须是一个包含两项的元组,第一项必须是包含仍然未编码的输入的缓冲区。The second must be an integer and can be additional state info. 第二个必须是整数,可以是其他状态信息。(The implementation should make sure that 0 is the most common additional state info.) (实现应确保0是最常见的附加状态信息。)If this additional state info is 0 it must be possible to set the decoder to the state which has no input buffered and 0 as the additional state info, so that feeding the previously buffered input to the decoder returns it to the previous state without producing any output. 如果此附加状态信息为0,则必须可以将解码器设置为没有输入缓冲的状态,并将0设置为附加状态信息,以便将先前缓冲的输入馈送到解码器,使其返回到先前状态,而不产生任何输出。(Additional state info that is more complicated than integers can be converted into an integer by marshaling/pickling the info and encoding the bytes of the resulting string into an integer.)(比整数更复杂的其他状态信息可以通过编组/酸洗信息并将结果字符串的字节编码为整数来转换为整数。)

setstate(state)

Set the state of the decoder to state. 将解码器的状态设置为statestate must be a decoder state returned by getstate().必须是getstate()返回的解码器状态。

Stream Encoding and Decoding流编码和解码

The StreamWriter and StreamReader classes provide generic working interfaces which can be used to implement new encoding submodules very easily. StreamWriterStreamReader类提供了通用工作接口,可用于非常轻松地实现新的编码子模块。See encodings.utf_8 for an example of how this is done.请参阅encodings.utf_8以获取如何完成此操作的示例。

StreamWriter Objects对象

The StreamWriter class is a subclass of Codec and defines the following methods which every stream writer must define in order to be compatible with the Python codec registry.StreamWriter类是编解码器的一个子类,它定义了每个流编写器必须定义的以下方法,以便与Python编解码器注册表兼容。

classcodecs.StreamWriter(stream, errors='strict')

Constructor for a StreamWriter instance.StreamWriter实例的构造函数。

All stream writers must provide this constructor interface. 所有流编写器都必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。

The stream argument must be a file-like object open for writing text or binary data, as appropriate for the specific codec.stream参数必须是一个类似文件的对象,用于写入文本或二进制数据,适用于特定的编解码器。

The StreamWriter may implement different error handling schemes by providing the errors keyword argument. StreamWriter可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for the standard error handlers the underlying stream codec may support.有关底层流编解码器可能支持的标准错误处理程序,请参阅错误处理程序

The errors argument will be assigned to an attribute of the same name. errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the StreamWriter object.为该属性赋值可以在StreamWriter对象的生存期内在不同的错误处理策略之间切换。

write(object)

Writes the object’s contents encoded to the stream.将编码的对象内容写入流。

writelines(list)

Writes the concatenated iterable of strings to the stream (possibly by reusing the write() method). 将串接的iterable字符串写入流(可能通过重用write()方法)。Infinite or very large iterables are not supported. 不支持无限或非常大的ITerable。The standard bytes-to-bytes codecs do not support this method.标准的字节到字节编解码器不支持这种方法。

reset()

Resets the codec buffers used for keeping internal state.重置用于保持内部状态的编解码器缓冲区。

Calling this method should ensure that the data on the output is put into a clean state that allows appending of new fresh data without having to rescan the whole stream to recover state.调用此方法应确保输出上的数据处于干净状态,允许附加新的新数据,而无需重新扫描整个流以恢复状态。

In addition to the above methods, the StreamWriter must also inherit all other methods and attributes from the underlying stream.除了上述方法外,StreamWriter 还必须从底层流继承所有其他方法和属性。

StreamReader Objects对象

The StreamReader class is a subclass of Codec and defines the following methods which every stream reader must define in order to be compatible with the Python codec registry.StreamReader类是Codec的一个子类,它定义了每个流阅读器必须定义的以下方法,以便与Python编解码器注册表兼容。

classcodecs.StreamReader(stream, errors='strict')

Constructor for a StreamReader instance.StreamReader实例的构造函数。

All stream readers must provide this constructor interface. 所有流读取器都必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。

The stream argument must be a file-like object open for reading text or binary data, as appropriate for the specific codec.stream参数必须是一个类似文件的对象,用于读取文本或二进制数据,具体视具体编解码器而定。

The StreamReader may implement different error handling schemes by providing the errors keyword argument. StreamReader可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for the standard error handlers the underlying stream codec may support.有关底层流编解码器可能支持的标准错误处理程序,请参阅错误处理程序

The errors argument will be assigned to an attribute of the same name. errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the StreamReader object.分配给该属性可以在StreamReader对象的生命周期内在不同的错误处理策略之间切换。

The set of allowed values for the errors argument can be extended with register_error().可以使用register_error()扩展errors参数的允许值集。

read([size[, chars[, firstline]]])

Decodes data from the stream and returns the resulting object.解码流中的数据并返回结果对象。

The chars argument indicates the number of decoded code points or bytes to return. chars参数表示要返回的解码代码点或字节数。The read() method will never return more data than requested, but it might return less, if there is not enough available.read()方法返回的数据永远不会超过请求的数据量,但如果没有足够的可用数据,则返回的数据可能会更少。

The size argument indicates the approximate maximum number of encoded bytes or code points to read for decoding. size参数表示为解码而读取的编码字节或代码点的近似最大数量。The decoder can modify this setting as appropriate. 解码器可以根据需要修改此设置。The default value -1 indicates to read and decode as much as possible. 默认值-1表示尽可能多地读取和解码。This parameter is intended to prevent having to decode huge files in one step.此参数旨在防止必须在一个步骤中解码大型文件。

The firstline flag indicates that it would be sufficient to only return the first line, if there are decoding errors on later lines.firstline标志表示,如果后面的行中存在解码错误,则只返回第一行就足够了。

The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given size, e.g. if optional encoding endings or state markers are available on the stream, these should be read too.该方法应该使用贪婪读取策略,这意味着它应该读取编码定义和给定大小内允许的尽可能多的数据,例如,如果流上有可选的编码结尾或状态标记,则也应该读取这些数据。

readline([size[, keepends]])

Read one line from the input stream and return the decoded data.从输入流中读取一行并返回解码数据。

size, if given, is passed as size argument to the stream’s read() method.如果给定了size,则将其作为size参数传递给流的read()方法。

If keepends is false line-endings will be stripped from the lines returned.如果keependsfalse,则将从返回的行中删除行结束符。

readlines([sizehint[, keepends]])

Read all lines available on the input stream and return them as a list of lines.读取输入流上的所有可用行,并将其作为行列表返回。

Line-endings are implemented using the codec’s decode() method and are included in the list entries if keepends is true.行结尾使用编解码器的decode()方法实现,如果keependstrue,则包含在列表项中。

sizehint, if given, is passed as the size argument to the stream’s read() method.sizehint(如果给定)作为size参数传递给流的read()方法。

reset()

Resets the codec buffers used for keeping internal state.重置用于保持内部状态的编解码器缓冲区。

Note that no stream repositioning should take place. 请注意,不应重新定位流。This method is primarily intended to be able to recover from decoding errors.该方法的主要目的是能够从解码错误中恢复。

In addition to the above methods, the StreamReader must also inherit all other methods and attributes from the underlying stream.除了上述方法外,StreamReader还必须从底层流继承所有其他方法和属性。

StreamReaderWriter Objects对象

The StreamReaderWriter is a convenience class that allows wrapping streams which work in both read and write modes.StreamReaderWriter是一个方便的类,它允许包装在读写模式下工作的流。

The design is such that one can use the factory functions returned by the lookup() function to construct the instance.这种设计可以使用lookup()函数返回的工厂函数来构造实例。

classcodecs.StreamReaderWriter(stream, Reader, Writer, errors='strict')

Creates a StreamReaderWriter instance. 创建StreamReaderWriter实例。stream must be a file-like object. stream必须是类似文件的对象。Reader and Writer must be factory functions or classes providing the StreamReader and StreamWriter interface resp. ReaderWriter必须是工厂函数或类,分别提供StreamReaderStreamWriter接口。Error handling is done in the same way as defined for the stream readers and writers.错误处理的方式与为流读取器和写入器定义的方式相同。

StreamReaderWriter instances define the combined interfaces of StreamReader and StreamWriter classes. 实例定义StreamReader类和SStreamWriter类的组合接口。They inherit all other methods and attributes from the underlying stream.它们从底层流继承所有其他方法和属性。

StreamRecoder Objects对象

The StreamRecoder translates data from one encoding to another, which is sometimes useful when dealing with different encoding environments.StreamRecoder将数据从一种编码转换为另一种编码,这在处理不同的编码环境时有时很有用。

The design is such that one can use the factory functions returned by the lookup() function to construct the instance.这种设计可以使用lookup()函数返回的工厂函数来构造实例。

classcodecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict')

Creates a StreamRecoder instance which implements a two-way conversion: encode and decode work on the frontend — the data visible to code calling read() and write(), while Reader and Writer work on the backend — the data in stream.创建一个StreamRecoder实例,该实例实现了双向转换:在前端encodedecode工作-代码调用read()write()时可见的数据,而读写器在后端工作-流中的数据。

You can use these objects to do transparent transcodings, e.g., from Latin-1 to UTF-8 and back.您可以使用这些对象进行透明的转码,例如,从拉丁语-1到UTF-8再到UTF-8。

The stream argument must be a file-like object.stream参数必须是类似文件的对象。

The encode and decode arguments must adhere to the Codec interface. encodedecode参数必须遵循编解码器接口。Reader and Writer must be factory functions or classes providing objects of the StreamReader and StreamWriter interface respectively.ReaderWriter必须是工厂函数或类,分别提供StreamReaderStreamWriter接口的对象。

Error handling is done in the same way as defined for the stream readers and writers.错误处理的方式与为流读取器和写入器定义的方式相同。

StreamRecoder instances define the combined interfaces of StreamReader and StreamWriter classes. StreamRecoder实例定义StreamReader类和StreamWriter类的组合接口。They inherit all other methods and attributes from the underlying stream.它们从底层流继承所有其他方法和属性。

Encodings and Unicode编码和Unicode

Strings are stored internally as sequences of code points in range 0x00x10FFFF. 字符串在内部存储为0x0-0x10FFFF范围内的代码点序列。(See PEP 393 for more details about the implementation.) (有关实施的更多详细信息,请参阅PEP 393。)Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue. 一旦在CPU和内存之外使用字符串对象,endianness以及这些数组如何存储为字节就成了一个问题。As with other codecs, serialising a string into a sequence of bytes is known as encoding, and recreating the string from the sequence of bytes is known as decoding. 与其他编解码器一样,将字符串序列化为字节序列称为encoding,从字节序列重新创建字符串称为decodingThere are a variety of different text serialisation codecs, which are collectivity referred to as text encodings.有多种不同的文本序列化编解码器,统称为文本编码

The simplest text encoding (called 'latin-1' or 'iso-8859-1') maps the code points 0–255 to the bytes 0x00xff, which means that a string object that contains code points above U+00FF can’t be encoded with this codec. 最简单的文本编码(称为'latin-1''iso-8859-1')将代码点0-255映射到字节0x0-0xff,这意味着包含U+00FF以上代码点的字符串对象不能用此编解码器编码。Doing so will raise a UnicodeEncodeError that looks like the following (although the details of the error message may differ): UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256).这样做将引发如下的UnicodeEncodeError(尽管错误消息的详细信息可能不同):UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256)

There’s another group of encodings (the so called charmap encodings) that choose a different subset of all Unicode code points and how these code points are mapped to the bytes 0x00xff. 还有另一组编码(所谓的字符映射编码)选择所有Unicode代码点的不同子集,以及这些代码点如何映射到字节0x0-0xffTo see how this is done simply open e.g. encodings/cp1252.py (which is an encoding that is used primarily on Windows). 要了解这是如何做到的,只需打开例如encodings/cp1252.py(这是一种主要在Windows上使用的编码)。There’s a string constant with 256 characters that shows you which character is mapped to which byte value.有一个包含256个字符的字符串常量,显示哪个字符映射到哪个字节值。

All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. 所有这些编码只能对Unicode中定义的1114112个码点中的256个进行编码。A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes. 存储每个Unicode代码点的一种简单而直接的方法是将每个代码点存储为四个连续字节。There are two possibilities: store the bytes in big endian or in little endian order. 有两种可能:按大端顺序或小端顺序存储字节。These two encodings are called UTF-32-BE and UTF-32-LE respectively. 这两种编码分别称为UTF-32-BEUTF-32-LETheir disadvantage is that if e.g. you use UTF-32-BE on a little endian machine you will always have to swap bytes on encoding and decoding. 他们的缺点是,如果你在一个小端机器上使用UTF-32-BE,你总是需要在编码和解码时交换字节。UTF-32 avoids this problem: bytes will always be in natural endianness. UTF-32避免了这个问题:字节将始终处于自然终止状态。When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. 当CPU以不同的端序读取这些字节时,必须交换字节。To be able to detect the endianness of a UTF-16 or UTF-32 byte sequence, there’s the so called BOM (“Byte Order Mark”). 为了能够检测UTF-16UTF-32字节序列的端点,有一个所谓的BOM(“字节顺序标记”)。This is the Unicode character U+FEFF. 这是Unicode字符U+FEFFThis character can be prepended to every UTF-16 or UTF-32 byte sequence. 该字符可以在每个UTF-16UTF-32字节序列之前。The byte swapped version of this character (0xFFFE) is an illegal character that may not appear in a Unicode text. 此字符的字节交换版本(0xFFFE)是一个非法字符,可能不会出现在Unicode文本中。So when the first character in a UTF-16 or UTF-32 byte sequence appears to be a U+FFFE the bytes have to be swapped on decoding. 因此,当UTF-16UTF-32字节序列中的第一个字符似乎是U+FFFE时,必须在解码时交换字节。Unfortunately the character U+FEFF had a second purpose as a ZERO WIDTH NO-BREAK SPACE: a character that has no width and doesn’t allow a word to be split. 不幸的是,字符U+FEFF还有另一个用途,即“零宽度不间断空间”:一个没有宽度且不允许分割单词的字符。It can e.g. be used to give hints to a ligature algorithm. With Unicode 4.0 using U+FEFF as a ZERO WIDTH NO-BREAK SPACE has been deprecated (with U+2060 (WORD JOINER) assuming this role). 例如,它可以用来提示连字算法。在Unicode 4.0中,使用U+FEFF作为“零宽度不间断空间”已被弃用(U+2060(“字连接符”)承担此角色)。Nevertheless Unicode software still must be able to handle U+FEFF in both roles: as a BOM it’s a device to determine the storage layout of the encoded bytes, and vanishes once the byte sequence has been decoded into a string; as a ZERO WIDTH NO-BREAK SPACE it’s a normal character that will be decoded like any other.尽管如此,Unicode软件仍然必须能够在两个角色中处理U+FEFF:作为BOM,它是一种确定编码字节存储布局的设备,一旦字节序列解码成字符串,它就会消失;作为一个“零宽度的不间断空间”,它是一个普通字符,将像任何其他字符一样被解码。

There’s another encoding that is able to encode the full range of Unicode characters: UTF-8. 还有另一种编码方法可以对全范围的Unicode字符进行编码:UTF-8。UTF-8 is an 8-bit encoding, which means there are no issues with byte order in UTF-8. UTF-8是一种8位编码,这意味着UTF-8中的字节顺序没有问题。Each byte in a UTF-8 byte sequence consists of two parts: marker bits (the most significant bits) and payload bits. UTF-8字节序列中的每个字节由两部分组成:标记位(最高有效位)和有效负载位。The marker bits are a sequence of zero to four 1 bits followed by a 0 bit. 标记位是0到4个1位的序列,后跟一个0位。Unicode characters are encoded like this (with x being payload bits, which when concatenated give the Unicode character):Unicode字符是这样编码的(x是有效负载位,连接后得到Unicode字符):

Range范围

Encoding编码

U-00000000U-0000007F

0xxxxxxx

U-00000080U-000007FF

110xxxxx 10xxxxxx

U-00000800U-0000FFFF

1110xxxx 10xxxxxx 10xxxxxx

U-00010000U-0010FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The least significant bit of the Unicode character is the rightmost x bit.Unicode字符的最低有效位是最右边的x位。

As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE.由于UTF-8是一种8位编码,因此不需要BOM,解码字符串中的任何U+FEFF字符(即使是第一个字符)都被视为“零宽度无中断空间”。

Without external information it’s impossible to reliably determine which encoding was used for encoding a string. 如果没有外部信息,就不可能可靠地确定哪个编码用于编码字符串。Each charmap encoding can decode any random byte sequence. 每个字符映射编码可以解码任何随机字节序列。However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. 然而,这在UTF-8中是不可能的,因为UTF-8字节序列具有不允许任意字节序列的结构。To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. 为了提高检测UTF-8编码的可靠性,微软为其记事本程序发明了一种UTF-8变体(Python 2.5称之为"utf-8-sig"):在将任何Unicode字符写入文件之前,先写入一个UTF-8编码的BOM(看起来像字节序列:0xef0xbb0xbf)。As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to因为任何charmap编码的文件都不太可能以这些字节值开始(例如,映射到

LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. 在iso-8859-1)中,这增加了从字节序列中正确猜测utf-8-sig编码的概率。So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. 因此,在这里,BOM不是用来确定用于生成字节序列的字节顺序,而是作为有助于猜测编码的签名。On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. 编码时,utf-8-sig编解码器将0xef0xbb0xbf作为前三个字节写入文件。On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. 解码时,如果utf-8-sig在文件中显示为前三个字节,则会跳过这三个字节。In UTF-8, the use of the BOM is discouraged and should generally be avoided.在UTF-8中,不鼓励使用BOM,通常应避免使用BOM。

Standard Encodings标准编码

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. Python内置了许多编解码器,可以作为C函数实现,也可以使用字典作为映射表。The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. 下表按名称、几个常见别名以及可能使用编码的语言列出了编解码器。Neither the list of aliases nor the list of languages is meant to be exhaustive. 别名列表和语言列表都不是详尽无遗的。Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.请注意,只有大小写不同或使用连字符而不是下划线的拼写选项也是有效的别名;因此,例如,'utf-8''utf_8'编解码器的有效别名。

CPython implementation detail:CPython实施细节: Some common encodings can bypass the codecs lookup machinery to improve performance. 一些常见的编码可以绕过编解码器查找机制来提高性能。These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. 这些优化机会仅由CPython识别为有限的一组(不区分大小写)别名:utf-8、utf8、latin-1、latin1、iso-8859-1、iso8859-1、mbcs(仅限Windows)、ascii、us ascii、utf-16、utf16、utf-32、utf32,以及使用下划线而不是破折号的别名。Using alternative aliases for these encodings may result in slower execution.为这些编码使用替代别名可能会导致执行速度较慢。

Changed in version 3.6:版本3.6中更改: Optimization opportunity recognized for us-ascii.我们认识到优化机会ascii。

Many of the character sets support the same languages. 许多字符集支持相同的语言。They vary in individual characters (e.g. whether the EURO SIGN is supported or not), and in the assignment of characters to code positions. 它们在单个字符(例如,是否支持欧元符号)和字符到代码位置的分配方面有所不同。For the European languages in particular, the following variants typically exist:特别是对于欧洲语言,通常存在以下变体:

  • an ISO 8859 codesetISO 8859代码集

  • a Microsoft Windows code page, which is typically derived from an 8859 codeset, but replaces control characters with additional graphic charactersMicrosoft Windows代码页,通常从8859代码集派生,但用其他图形字符替换控制字符

  • an IBM EBCDIC code pageIBM EBCDIC代码页

  • an IBM PC code page, which is ASCII compatible与ASCII兼容的IBM PC代码页

Codec编解码器

Aliases别名

Languages语言

ascii

646, us-ascii

English英语

big5

big5-tw, csbig5

Traditional Chinese繁体中文

big5hkscs

big5-hkscs, hkscs

Traditional Chinese繁体中文

cp037

IBM037, IBM039

English英语

cp273

273, IBM273, csIBM273

German德语

New in version 3.4.版本3.4中新增。

cp424

EBCDIC-CP-HE, IBM424

Hebrew希伯来语

cp437

437, IBM437

English英语

cp500

EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500

Western Europe西欧语

cp720

Arabic阿拉伯语

cp737

Greek希腊语

cp775

IBM775

Baltic languages波罗的语族

cp850

850, IBM850

Western Europe西欧语

cp852

852, IBM852

Central and Eastern Europe中东欧语

cp855

855, IBM855

Bulgarian, Byelorussian, Macedonian, Russian, Serbian保加利亚语、白俄罗斯语、马其顿语、俄语、塞尔维亚语

cp856

Hebrew希伯来语

cp857

857, IBM857

Turkish土耳其语

cp858

858, IBM858

Western Europe西欧语

cp860

860, IBM860

Portuguese葡萄牙语

cp861

861, CP-IS, IBM861

Icelandic冰岛语

cp862

862, IBM862

Hebrew希伯来语

cp863

863, IBM863

Canadian加拿大语

cp864

IBM864

Arabic阿拉伯语

cp865

865, IBM865

Danish, Norwegian丹麦语、挪威语

cp866

866, IBM866

Russian俄语

cp869

869, CP-GR, IBM869

Greek希腊语

cp874

Thai泰语

cp875

Greek希腊语

cp932

932, ms932, mskanji, ms-kanji

Japanese日语

cp949

949, ms949, uhc

Korean韩语

cp950

950, ms950

Traditional Chinese繁体中文

cp1006

Urdu乌尔都语

cp1026

ibm1026

Turkish土耳其语

cp1125

1125, ibm1125, cp866u, ruscii

Ukrainian乌克兰语

New in version 3.4.版本3.4中新增。

cp1140

ibm1140

Western Europe西欧语

cp1250

windows-1250

Central and Eastern Europe中东欧语

cp1251

windows-1251

Bulgarian, Byelorussian, Macedonian, Russian, Serbian保加利亚语、白俄罗斯语、马其顿语、俄语、塞尔维亚语

cp1252

windows-1252

Western Europe西欧语

cp1253

windows-1253

Greek希腊语

cp1254

windows-1254

Turkish土耳其语

cp1255

windows-1255

Hebrew希伯来语

cp1256

windows-1256

Arabic阿拉伯语

cp1257

windows-1257

Baltic languages波罗的语族

cp1258

windows-1258

Vietnamese越南语

euc_jp

eucjp, ujis, u-jis

Japanese日语

euc_jis_2004

jisx0213, eucjis2004

Japanese日语

euc_jisx0213

eucjisx0213

Japanese日语

euc_kr

euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001

Korean韩语

gb2312

chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso-ir-58

Simplified Chinese简体中文

gbk

936, cp936, ms936

Unified Chinese统一中文

gb18030

gb18030-2000

Unified Chinese统一中文

hz

hzgb, hz-gb, hz-gb-2312

Simplified Chinese简体中文

iso2022_jp

csiso2022jp, iso2022jp, iso-2022-jp

Japanese日语

iso2022_jp_1

iso2022jp-1, iso-2022-jp-1

Japanese日语

iso2022_jp_2

iso2022jp-2, iso-2022-jp-2

Japanese, Korean, Simplified Chinese, Western Europe, Greek日语、韩语、简体中文、西欧、希腊语

iso2022_jp_2004

iso2022jp-2004, iso-2022-jp-2004

Japanese日语

iso2022_jp_3

iso2022jp-3, iso-2022-jp-3

Japanese日语

iso2022_jp_ext

iso2022jp-ext, iso-2022-jp-ext

Japanese日语

iso2022_kr

csiso2022kr, iso2022kr, iso-2022-kr

Korean韩语

latin_1

iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1

Western Europe西欧语

iso8859_2

iso-8859-2, latin2, L2

Central and Eastern Europe中东欧语

iso8859_3

iso-8859-3, latin3, L3

Esperanto, Maltese世界语,马耳他语

iso8859_4

iso-8859-4, latin4, L4

Baltic languages波罗的语族

iso8859_5

iso-8859-5, cyrillic

Bulgarian, Byelorussian, Macedonian, Russian, Serbian保加利亚语、白俄罗斯语、马其顿语、俄语、塞尔维亚语

iso8859_6

iso-8859-6, arabic

Arabic阿拉伯语

iso8859_7

iso-8859-7, greek, greek8

Greek希腊语

iso8859_8

iso-8859-8, hebrew

Hebrew

iso8859_9

iso-8859-9, latin5, L5

Turkish

iso8859_10

iso-8859-10, latin6, L6

Nordic languages

iso8859_11

iso-8859-11, thai

Thai languages

iso8859_13

iso-8859-13, latin7, L7

Baltic languages

iso8859_14

iso-8859-14, latin8, L8

Celtic languages

iso8859_15

iso-8859-15, latin9, L9

Western Europe

iso8859_16

iso-8859-16, latin10, L10

South-Eastern Europe

johab

cp1361, ms1361

Korean

koi8_r

Russian

koi8_t

Tajik

New in version 3.5.版本3.5中新增。

koi8_u

Ukrainian

kz1048

kz_1048, strk1048_2002, rk1048

Kazakh

New in version 3.5.版本3.5中新增。

mac_cyrillic

maccyrillic

Bulgarian, Byelorussian, Macedonian, Russian, Serbian保加利亚语、白俄罗斯语、马其顿语、俄语、塞尔维亚语

mac_greek

macgreek

Greek

mac_iceland

maciceland

Icelandic

mac_latin2

maclatin2, maccentraleurope, mac_centeuro

Central and Eastern Europe

mac_roman

macroman, macintosh

Western Europe

mac_turkish

macturkish

Turkish

ptcp154

csptcp154, pt154, cp154, cyrillic-asian

Kazakh

shift_jis

csshiftjis, shiftjis, sjis, s_jis

Japanese

shift_jis_2004

shiftjis2004, sjis_2004, sjis2004

Japanese

shift_jisx0213

shiftjisx0213, sjisx0213, s_jisx0213

Japanese

utf_32

U32, utf32

all languages

utf_32_be

UTF-32BE

all languages

utf_32_le

UTF-32LE

all languages

utf_16

U16, utf16

all languages

utf_16_be

UTF-16BE

all languages

utf_16_le

UTF-16LE

all languages

utf_7

U7, unicode-1-1-utf-7

all languages

utf_8

U8, UTF, utf8, cp65001

all languages

utf_8_sig

all languages

Changed in version 3.4:版本3.4中更改: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800U+DFFF) to be encoded. utf-16*和utf-32*编码器不再允许对代理码点(U+D800-U+DFFF)进行编码。The utf-32* decoders no longer decode byte sequences that correspond to surrogate code points.utf-32*解码器不再解码对应于代理码点的字节序列。

Changed in version 3.8:版本3.8中更改: cp65001 is now an alias to utf_8.cp65001现在是utf_8的别名。

Python Specific EncodingsPython特定编码

A number of predefined codecs are specific to Python, so their codec names have no meaning outside Python. 许多预定义的编解码器特定于Python,因此它们的编解码器名称在Python之外没有任何意义。These are listed in the tables below based on the expected input and output types (note that while text encodings are the most common use case for codecs, the underlying codec infrastructure supports arbitrary data transforms rather than just text encodings). 下表根据预期的输入和输出类型列出了这些类型(请注意,虽然文本编码是编解码器最常见的用例,但底层编解码器基础设施支持任意数据转换,而不仅仅是文本编码)。For asymmetric codecs, the stated meaning describes the encoding direction.对于非对称编解码器,所述含义描述了编码方向。

Text Encodings文本编码

The following codecs provide str to bytes encoding and bytes-like object to str decoding, similar to the Unicode text encodings.以下编解码器提供strbytes编码和类似字节的对象str解码,类似于Unicode文本编码。

Codec编解码器

Aliases别名

Meaning含义

idna

Implement RFC 3490, see also encodings.idna. 实现RFC 3490,另请参阅encodings.idnaOnly errors='strict' is supported.仅支持errors='strict'

mbcs

ansi, dbcs

Windows only: Encode the operand according to the ANSI codepage (CP_ACP).仅限Windows:根据ANSI代码页(CP\U ACP)对操作数进行编码。

oem

Windows only: Encode the operand according to the OEM codepage (CP_OEMCP).仅限Windows:根据OEM代码页(CP\U OEMCP)对操作数进行编码。

New in version 3.6.版本3.6中新增。

palmos

Encoding of PalmOS 3.5.PalmOS 3.5的编码。

punycode

Implement RFC 3492. 实现RFC 3492Stateful codecs are not supported.不支持有状态编解码器。

raw_unicode_escape

Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. 拉丁文-1编码,用\uXXXX\UXXXXXXXX表示其他代码点。Existing backslashes are not escaped in any way. 现有反斜杠不会以任何方式转义。It is used in the Python pickle protocol.它用于Python pickle协议。

undefined

Raise an exception for all conversions, even empty strings. 为所有转换引发异常,即使是空字符串。The error handler is ignored.忽略错误处理程序。

unicode_escape

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. 在ASCII编码的Python源代码中,编码适合作为Unicode文字的内容,但引号不转义。Decode from Latin-1 source code. 从拉丁语-1源代码解码。Beware that Python source code actually uses UTF-8 by default.请注意,Python源代码在默认情况下实际上使用UTF-8。

Changed in version 3.8:版本3.8中更改: “unicode_internal” codec is removed.“unicode_internal”编解码器已删除。

Binary Transforms二进制变换

The following codecs provide binary transforms: bytes-like object to bytes mappings. 以下编解码器提供二进制转换:类似字节的对象bytes映射。They are not supported by bytes.decode() (which only produces str output).bytes.decode()不支持它们(它只生成str输出)。

Codec编解码器

Aliases别名

Meaning含义

Encoder / decoder编码器/解码器

base64_codec 1

base64, base_64

Convert the operand to multiline MIME base64 (the result always includes a trailing '\n').将操作数转换为多行MIME base64(结果始终包含尾随的'\n')。

Changed in version 3.4:版本3.4中更改: accepts any bytes-like object as input for encoding and decoding接受任何类似字节的对象作为编码和解码的输入

base64.encodebytes() / base64.decodebytes()

bz2_codec

bz2

Compress the operand using bz2.使用bz2压缩操作数。

bz2.compress() / bz2.decompress()

hex_codec

hex

Convert the operand to hexadecimal representation, with two digits per byte.将操作数转换为十六进制表示,每个字节两位数。

binascii.b2a_hex() / binascii.a2b_hex()

quopri_codec

quopri, quotedprintable, quoted_printablequopri,quotedprintable,quoted_printable

Convert the operand to MIME quoted printable.将操作数转换为MIME引号可打印。

quopri.encode() with quotetabs=True / quopri.decode()

uu_codec

uu

Convert the operand using uuencode.使用uuencode转换操作数。

uu.encode() / uu.decode()

zlib_codec

zip, zlib

Compress the operand using gzip.使用gzip压缩操作数。

zlib.compress() / zlib.decompress()

1

In addition to bytes-like objects, 'base64_codec' also accepts ASCII-only instances of str for decoding除了类似字节的对象外,'base64_codec'还接受仅限ASCII的str实例进行解码

New in version 3.2.版本3.2中新增。Restoration of the binary transforms.二进制变换的恢复。

Changed in version 3.4:版本3.4中更改: Restoration of the aliases for the binary transforms.恢复二进制变换的别名。

Text Transforms文本转换

The following codec provides a text transform: a str to str mapping. 以下编解码器提供了文本转换:strstr的映射。It is not supported by str.encode() (which only produces bytes output).str.encode()不支持它(它只生成bytes输出)。

Codec编解码器

Aliases别名

Meaning含义

rot_13

rot13

Return the Caesar-cypher encryption of the operand.返回操作数的凯撒密码加密。

New in version 3.2.版本3.2中新增。Restoration of the rot_13 text transform.恢复rot_13文本转换。

Changed in version 3.4:版本3.4中更改: Restoration of the rot13 alias.恢复rot13别名。

encodings.idnaInternationalized Domain Names in Applications应用程序中的国际化域名

This module implements RFC 3490 (Internationalized Domain Names in Applications) and RFC 3492 (Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)). 本模块实现RFC 3490(应用程序中的国际化域名)和RFC 3492(Nameprep:国际化域名(IDN)的Stringprep配置文件)。It builds upon the punycode encoding and stringprep.它基于punycode编码和stringprep

If you need the IDNA 2008 standard from RFC 5891 and RFC 5895, use the third-party idna module.如果需要RFC 5891RFC 5895中的IDNA 2008标准,请使用第三方IDNA模块

These RFCs together define a protocol to support non-ASCII characters in domain names. 这些RFC共同定义了一个协议,以支持域名中的非ASCII字符。A domain name containing non-ASCII characters (such as www.Alliancefrançaise.nu) is converted into an ASCII-compatible encoding (ACE, such as www.xn--alliancefranaise-npb.nu). 包含非ASCII字符的域名(如www.Alliancefrançaise.nu)被转换为ASCII兼容编码(ACE,如www.xn--alliancefranaise-npb.nu)。The ACE form of the domain name is then used in all places where arbitrary characters are not allowed by the protocol, such as DNS queries, HTTP Host fields, and so on. 然后在协议不允许使用任意字符的所有地方使用域名的ACE形式,例如DNS查询、HTTPHost字段等。This conversion is carried out in the application; if possible invisible to the user: The application should transparently convert Unicode domain labels to IDNA on the wire, and convert back ACE labels to Unicode before presenting them to the user.此转换在应用程序中执行;如果用户可能看不见:应用程序应在网上透明地将Unicode域标签转换为IDNA,并将ACE标签转换回Unicode,然后再将其呈现给用户。

Python supports this conversion in several ways: the idna codec performs conversion between Unicode and ACE, separating an input string into labels based on the separator characters defined in section 3.1 of RFC 3490 and converting each label to ACE as required, and conversely separating an input byte string into labels based on the . separator and converting any ACE labels found into unicode. Python以几种方式支持这种转换:idna编解码器执行Unicode和ACE之间的转换,根据section 3.1 of RFC 3490中定义的分隔符将输入字符串分离为标签,并根据需要将每个标签转换为ACE,反之,根据.分隔符并将找到的任何ACE标签转换为unicode。Furthermore, the socket module transparently converts Unicode host names to ACE, so that applications need not be concerned about converting host names themselves when they pass them to the socket module. 此外,socket模块将Unicode主机名透明地转换为ACE,因此应用程序在将主机名传递给socket模块时无需关心主机名本身的转换。On top of that, modules that have host names as function parameters, such as http.client and ftplib, accept Unicode host names (http.client then also transparently sends an IDNA hostname in the Host field if it sends that field at all).最重要的是,以主机名作为功能参数的模块,如http.clientftplib,接受Unicode主机名(如果http.client发送了主机字段,则还会在Host字段中透明地发送IDNA主机名)。

When receiving host names from the wire (such as in reverse name lookup), no automatic conversion to Unicode is performed: applications wishing to present such host names to the user should decode them to Unicode.当从导线接收主机名时(例如在反向名称查找中),不会执行到Unicode的自动转换:希望向用户呈现此类主机名的应用程序应将其解码为Unicode。

The module encodings.idna also implements the nameprep procedure, which performs certain normalizations on host names, to achieve case-insensitivity of international domain names, and to unify similar characters. encodings.idna模块还实现了nameprep过程,该过程对主机名执行某些规范化,以实现国际域名的大小写不敏感,并统一相似字符。The nameprep functions can be used directly if desired.如果需要,可以直接使用nameprep函数。

encodings.idna.nameprep(label)

Return the nameprepped version of label. 返回label的nameprepped版本。The implementation currently assumes query strings, so AllowUnassigned is true.该实现当前假设查询字符串,因此AllowUnassignedtrue

encodings.idna.ToASCII(label)

Convert a label to ASCII, as specified in RFC 3490. 按照RFC 3490中的规定,将标签转换为ASCII。UseSTD3ASCIIRules is assumed to be false.假设UseSTD3ASCIIRules为假。

encodings.idna.ToUnicode(label)

Convert a label to Unicode, as specified in RFC 3490.按照RFC 3490中的规定,将标签转换为Unicode。

encodings.mbcsWindows ANSI codepageWindows ANSI代码页

This module implements the ANSI codepage (CP_ACP).该模块实现ANSI代码页(CP\U ACP)。

Availability可用性: Windows only.:仅限Windows。

Changed in version 3.3:版本3.3中更改: Support any error handler.支持任何错误处理程序。

Changed in version 3.2:版本3.2中更改: Before 3.2, the errors argument was ignored; 'replace' was always used to encode, and 'ignore' to decode.在3.2之前,errors参数被忽略;'replace'始终用于编码,'ignore'用于解码。

encodings.utf_8_sigUTF-8 codec with BOM signature具有BOM签名的UTF-8编解码器

This module implements a variant of the UTF-8 codec. 该模块实现了UTF-8编解码器的一种变体。On encoding, a UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. 编码时,UTF-8编码的BOM将前置到UTF-8编码的字节。For the stateful encoder this is only done once (on the first write to the byte stream). 对于有状态编码器,这只执行一次(在第一次写入字节流时)。On decoding, an optional UTF-8 encoded BOM at the start of the data will be skipped.解码时,将跳过数据开头的可选UTF-8编码BOM。