codecs
— Codec registry and base classes编解码器注册表和基类¶
Source code: Lib/codecs.py
This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry, which manages the codec and error handling lookup process. 该模块为标准Python编解码器(编码器和解码器)定义基类,并提供对内部Python编解码器注册表的访问,该注册表管理编解码器和错误处理查找过程。Most standard codecs are text encodings, which encode text to bytes, but there are also codecs provided that encode text to text, and bytes to bytes. 大多数标准编解码器是文本编码,将文本编码为字节,但也提供了将文本编码为文本和字节编码为字节的编解码器。Custom codecs may encode and decode between arbitrary types, but some module features are restricted to use specifically with text encodings, or with codecs that encode to 自定义编解码器可以在任意类型之间进行编码和解码,但某些模块功能仅限于与文本编码轼或编码到bytes
.bytes
的编解码器一起使用。
The module defines the following functions for encoding and decoding with any codec:该模块定义了以下用于使用任何编解码器进行编码和解码的功能:
-
codecs.
encode
(obj, encoding='utf-8', errors='strict')¶ Encodes obj using the codec registered for encoding.使用注册用于encoding的编解码器对obj进行编码。Errors may be given to set the desired error handling scheme.可以给出Errors来设置所需的错误处理方案。The default error handler is默认的错误处理程序是'strict'
meaning that encoding errors raiseValueError
(or a more codec specific subclass, such asUnicodeEncodeError
).'strict'
,这意味着编码错误会引发ValueError
(或更特定于编解码器的子类,例如UnicodeEncodeError
)。Refer to Codec Base Classes for more information on codec error handling.有关编解码器错误处理的更多信息,请参阅编解码器基类。
-
codecs.
decode
(obj, encoding='utf-8', errors='strict')¶ Decodes obj using the codec registered for encoding.使用注册encoding的编解码器解码obj。Errors
may be given to set the desired error handling scheme.可以设置所需的错误处理方案。The default error handler is默认的错误处理程序是'strict'
meaning that decoding errors raiseValueError
(or a more codec specific subclass, such asUnicodeDecodeError
).'strict'
的,这意味着解码错误会引起ValueError
(或更特定于编解码器的子类,例如UnicodeDecodeError
)。Refer to Codec Base Classes for more information on codec error handling.有关编解码器错误处理的更多信息,请参阅编解码器基类。
The full details for each codec can also be looked up directly:还可以直接查找每个编解码器的完整详细信息:
-
codecs.
lookup
(encoding)¶ Looks up the codec info in the Python codec registry and returns a在Python编解码器注册表中查找编解码器信息,并返回如下定义的CodecInfo
object as defined below.CodecInfo
对象。Encodings are first looked up in the registry’s cache.编码首先在注册表的缓存中查找。If not found, the list of registered search functions is scanned.如果未找到,则扫描已注册搜索功能的列表。If no如果未找到CodecInfo
object is found, aLookupError
is raised.CodecInfo
对象,则引发LookupError
。Otherwise, the否则,CodecInfo
object is stored in the cache and returned to the caller.CodecInfo
对象存储在缓存中并返回给调用方。
-
class
codecs.
CodecInfo
(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)¶ Codec details when looking up the codec registry.查找编解码器注册表时的编解码器详细信息。The constructor arguments are stored in attributes of the same name:构造函数参数存储在同名的属性中:-
name
¶ The name of the encoding.编码的名称。
-
encode
¶ -
decode
¶ The stateless encoding and decoding functions.无状态编码和解码功能。These must be functions or methods which have the same interface as the这些函数或方法必须与编解码器实例的encode()
anddecode()
methods of Codec instances (see Codec Interface).encode()
和decode()
方法具有相同的接口(请参见编解码器接口)。The functions or methods are expected to work in a stateless mode.这些函数或方法应在无状态模式下工作。
-
incrementalencoder
¶ -
incrementaldecoder
¶ Incremental encoder and decoder classes or factory functions.增量编码器和解码器类或工厂函数。These have to provide the interface defined by the base classes它们必须分别提供由基类IncrementalEncoder
andIncrementalDecoder
, respectively.IncrementalEncoder
和IncrementalDecoder
定义的接口。Incremental codecs can maintain state.增量编解码器可以保持状态。
-
streamwriter
¶ -
streamreader
¶ Stream writer and reader classes or factory functions.流编写器和读取器类或工厂函数。These have to provide the interface defined by the base classes它们必须分别提供由基类StreamWriter
andStreamReader
, respectively.StreamWriter
和StreamReader
定义的接口。Stream codecs can maintain state.流编解码器可以保持状态。
-
To simplify access to the various codec components, the module provides these additional functions which use 为了简化对各种编解码器组件的访问,模块提供了以下附加函数,这些函数使用lookup()
for the codec lookup:lookup()
进行编解码器查找:
-
codecs.
getencoder
(encoding)¶ Look up the codec for the given encoding and return its encoder function.查找给定编码的编解码器,并返回其编码器函数。Raises a如果找不到编码,则引发LookupError
in case the encoding cannot be found.LookupError
。
-
codecs.
getdecoder
(encoding)¶ Look up the codec for the given encoding and return its decoder function.查找给定编码的编解码器,并返回其解码器函数。Raises a如果找不到编码,则引发LookupError
in case the encoding cannot be found.LookupError
。
-
codecs.
getincrementalencoder
(encoding)¶ Look up the codec for the given encoding and return its incremental encoder class or factory function.查找给定编码的编解码器,并返回其增量编码器类或工厂函数。Raises a如果找不到编码或编解码器不支持增量编码器,则引发LookupError
in case the encoding cannot be found or the codec doesn’t support an incremental encoder.LookupError
。
-
codecs.
getincrementaldecoder
(encoding)¶ Look up the codec for the given encoding and return its incremental decoder class or factory function.查找给定编码的编解码器,并返回其增量解码器类或工厂函数。Raises a如果找不到编码或编解码器不支持增量解码器,则引发LookupError
in case the encoding cannot be found or the codec doesn’t support an incremental decoder.LookupError
。
-
codecs.
getreader
(encoding)¶ Look up the codec for the given encoding and return its查找给定编码的编解码器,并返回其StreamReader
class or factory function.StreamReader
类或工厂函数。Raises a如果找不到编码,则引发LookupError
in case the encoding cannot be found.LookupError
。
-
codecs.
getwriter
(encoding)¶ Look up the codec for the given encoding and return its查找给定编码的编解码器,并返回其StreamWriter
class or factory function.StreamWriter
类或工厂函数。Raises a如果找不到编码,则引发LookupError
in case the encoding cannot be found.LookupError
。
Custom codecs are made available by registering a suitable codec search function:通过注册合适的编解码器搜索功能,可以使用自定义编解码器:
-
codecs.
register
(search_function)¶ Register a codec search function.注册编解码器搜索功能。Search functions are expected to take one argument, being the encoding name in all lower case letters with hyphens and spaces converted to underscores, and return a搜索函数需要一个参数,即所有小写字母中的编码名称,并将连字符和空格转换为下划线,然后返回CodecInfo
object.CodecInfo
对象。In case a search function cannot find a given encoding, it should return如果搜索函数无法找到给定的编码,则应返回None
.None
。Changed in version 3.9:版本3.9中更改:Hyphens and spaces are converted to underscore.连字符和空格转换为下划线。
-
codecs.
unregister
(search_function)¶ Unregister a codec search function and clear the registry’s cache.注销编解码器搜索功能并清除注册表的缓存。If the search function is not registered, do nothing.如果未注册搜索功能,请不要执行任何操作。New in version 3.10.版本3.10中新增。
While the builtin 虽然建议使用内置的open()
and the associated io
module are the recommended approach for working with encoded text files, this module provides additional utility functions and classes that allow the use of a wider range of codecs when working with binary files:open()
和相关的io
模块来处理编码的文本文件,但该模块提供了额外的实用程序函数和类,允许在处理二进制文件时使用更广泛的编解码器:
-
codecs.
open
(filename, mode='r', encoding=None, errors='strict', buffering=- 1)¶ Open an encoded file using the given mode and return an instance of使用给定mode打开编码文件并返回StreamReaderWriter
, providing transparent encoding/decoding.StreamReaderWriter
的实例,提供透明的编码/解码。The default file mode is默认文件模式为'r'
, meaning to open the file in read mode.'r'
,表示以读取模式打开文件。Note
Underlying encoded files are always opened in binary mode.底层编码文件始终以二进制模式打开。No automatic conversion of在读写时没有自动转换'\n'
is done on reading and writing.'\n'
。The mode argument may be any binary mode acceptable to the built-inmode参数可以是内置open()
function; the'b'
is automatically added.open()
函数可以接受的任何二进制模式;自动添加'b'
。encoding
specifies the encoding which is to be used for the file.指定要用于文件的编码。Any encoding that encodes to and decodes from bytes is allowed, and the data types supported by the file methods depend on the codec used.允许对字节进行编码和解码,文件方法支持的数据类型取决于使用的编解码器。errors
may be given to define the error handling.可以定义错误处理。It defaults to它默认为'strict'
which causes aValueError
to be raised in case an encoding error occurs.'strict'
,在发生编码错误时会引发ValueError
。buffering
has the same meaning as for the built-in与内置的open()
function.open()
函数具有相同的含义。It defaults to -1 which means that the default buffer size will be used.它默认为-1,这意味着将使用默认的缓冲区大小。
-
codecs.
EncodedFile
(file, data_encoding, file_encoding=None, errors='strict')¶ Return a返回StreamRecoder
instance, a wrapped version of file which provides transparent transcoding.StreamRecoder
实例,这是一个file的包装版本,提供了透明的转码。The original file is closed when the wrapped version is closed.当封装版本关闭时,原始文件将关闭。Data written to the wrapped file is decoded according to the given data_encoding and then written to the original file as bytes using file_encoding.写入包装文件的数据根据给定的data_encoding进行解码,然后使用file_encoding作为字节写入原始文件。Bytes read from the original file are decoded according to file_encoding, and the result is encoded using data_encoding.从原始文件读取的字节根据file_encoding进行解码,结果使用data_encoding进行编码。If file_encoding is not given, it defaults to data_encoding.如果未给出file_encoding,则默认为data_encoding。errors may be given to define the error handling.可以给出errors来定义错误处理。It defaults to它默认为'strict'
, which causesValueError
to be raised in case an encoding error occurs.'strict'
,这会在发生编码错误时引发ValueError
。
-
codecs.
iterencode
(iterator, encoding, errors='strict', **kwargs)¶ Uses an incremental encoder to iteratively encode the input provided by iterator.使用增量编码器对iterator提供的输入进行迭代编码。This function is a generator.此函数是一个生成器。The errors argument (as well as any other keyword argument) is passed through to the incremental encoder.errors参数(以及任何其他关键字参数)传递给增量编码器。This function requires that the codec accept text此函数要求编解码器接受要编码的文本str
objects to encode.str
对象。Therefore it does not support bytes-to-bytes encoders such as因此,它不支持字节到字节编码器,例如base64_codec
.base64_codec
。
-
codecs.
iterdecode
(iterator, encoding, errors='strict', **kwargs)¶ Uses an incremental decoder to iteratively decode the input provided by iterator.使用增量解码器对iterator提供的输入进行迭代解码。This function is a generator.此函数是一个生成器。The errors argument (as well as any other keyword argument) is passed through to the incremental decoder.errors参数(以及任何其他关键字参数)传递给增量解码器。This function requires that the codec accept此函数要求编解码器接受要解码的bytes
objects to decode.bytes
对象。Therefore it does not support text-to-text encoders such as因此,它不支持诸如rot_13
, althoughrot_13
may be used equivalently withiterencode()
.rot_13
之类的文本到文本编码器,尽管rot_13
可以与iterencode()
等效使用。
The module also provides the following constants which are useful for reading and writing to platform dependent files:该模块还提供了以下常数,这些常数对于读取和写入平台相关文件非常有用:
-
codecs.
BOM
¶ -
codecs.
BOM_BE
¶ -
codecs.
BOM_LE
¶ -
codecs.
BOM_UTF8
¶ -
codecs.
BOM_UTF16
¶ -
codecs.
BOM_UTF16_BE
¶ -
codecs.
BOM_UTF16_LE
¶ -
codecs.
BOM_UTF32
¶ -
codecs.
BOM_UTF32_BE
¶ -
codecs.
BOM_UTF32_LE
¶ These constants define various byte sequences, being Unicode byte order marks (BOMs) for several encodings.这些常数定义了各种字节序列,即几种编码的Unicode字节顺序标记(BOM)。They are used in UTF-16 and UTF-32 data streams to indicate the byte order used, and in UTF-8 as a Unicode signature.它们在UTF-16和UTF-32数据流中用于指示使用的字节顺序,在UTF-8中用作Unicode签名。BOM_UTF16
is eitherBOM_UTF16_BE
orBOM_UTF16_LE
depending on the platform’s native byte order,BOM
is an alias forBOM_UTF16
,BOM_LE
forBOM_UTF16_LE
andBOM_BE
forBOM_UTF16_BE
.BOM_UTF16
是BOM_UTF16_BE
或BOM_UTF16_LE
,具体取决于平台的本机字节顺序,BOM
是BOM_UTF16
的别名,BOM_LE
是BOM_UTF16_LE
的别名,BOM_BE
是BOM_UTF16_BE
的别名。The others represent the BOM in UTF-8 and UTF-32 encodings.其他表示UTF-8和UTF-32编码的BOM。
Codec Base Classes编解码器基类¶
The codecs
module defines a set of base classes which define the interfaces for working with codec objects, and can also be used as the basis for custom codec implementations.codecs
模块定义了一组基类,这些基类定义了使用编解码器对象的接口,还可以用作自定义编解码器实现的基础。
Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer. 每个编解码器必须定义四个接口,以使其可用作Python中的编解码器:无状态编码器、无状态解码器、流读取器和流编写器。The stream reader and writers typically reuse the stateless encoder/decoder to implement the file protocols. 流读取器和写入器通常重用无状态编码器/解码器来实现文件协议。Codec authors also need to define how the codec will handle encoding and decoding errors.编解码器作者还需要定义编解码器将如何处理编码和解码错误。
Error Handlers错误处理程序¶
To simplify and standardize error handling, codecs may implement different error handling schemes by accepting the errors string argument. 为了简化和标准化错误处理,编解码器可以通过接受errors字符串参数来实现不同的错误处理方案。The following string values are defined and implemented by all standard Python codecs:以下字符串值由所有标准Python编解码器定义和实现:
|
|
---|---|
|
|
|
|
The following error handlers are only applicable to text encodings:以下错误处理程序仅适用于文本编码:
|
|
---|---|
|
|
|
|
|
|
|
|
|
|
In addition, the following error handler is specific to the given codecs:此外,以下错误处理程序特定于给定的编解码器:
|
|
|
---|---|---|
|
utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le |
|
New in version 3.1.版本3.1中新增。The 'surrogateescape'
and 'surrogatepass'
error handlers.'surrogateescape'
和'surrogatepass'
错误处理程序。
Changed in version 3.4:版本3.4中更改: The 'surrogatepass'
error handlers now works with utf-16* and utf-32* codecs.'surrogatepass'
错误处理程序现在可用于utf-16*和utf-32*编解码器。
New in version 3.5.版本3.5中新增。The 'namereplace'
error handler.'namereplace'
错误处理程序。
Changed in version 3.5:版本3.5中更改: The 'backslashreplace'
error handlers now works with decoding and translating.'backslashreplace'
错误处理程序现在可以用于解码和翻译。
The set of allowed values can be extended by registering a new named error handler:可以通过注册新的命名错误处理程序来扩展允许值集:
-
codecs.
register_error
(name, error_handler)¶ Register the error handling function error_handler under the name name.在name下注册错误处理函数error_handler。The error_handler argument will be called during encoding and decoding in case of an error, when name is specified as the errors parameter.当name被指定为errors参数时,在编码和解码过程中,如果出现错误,将调用error_handler参数。For encoding, error_handler will be called with a对于编码,将使用UnicodeEncodeError
instance, which contains information about the location of the error.UnicodeEncodeError
实例调用error_handler,该实例包含有关错误位置的信息。The error handler must either raise this or a different exception, or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue.错误处理程序必须引发此异常或其他异常,或者返回一个元组,替换输入中不可编码的部分,并返回编码应继续的位置。The replacement may be either替换可以是str
orbytes
.str
或bytes
。If the replacement is bytes, the encoder will simply copy them into the output buffer.如果替换为字节,编码器将简单地将其复制到输出缓冲区。If the replacement is a string, the encoder will encode the replacement.如果替换为字符串,编码器将对替换进行编码。Encoding continues on original input at the specified position.编码在指定位置的原始输入上继续。Negative position values will be treated as being relative to the end of the input string.负位置值将被视为相对于输入字符串的末尾。If the resulting position is out of bound an如果结果位置超出界限,则会引发IndexError
will be raised.IndexError
。Decoding and translating works similarly, except解码和翻译的工作原理类似,除了UnicodeDecodeError
orUnicodeTranslateError
will be passed to the handler and that the replacement from the error handler will be put into the output directly.UnicodeDecodeError
或UnicodeTranslateError
将传递给处理程序,并且错误处理程序的替换将直接放入输出。
Previously registered error handlers (including the standard error handlers) can be looked up by name:可以按名称查找以前注册的错误处理程序(包括标准错误处理程序):
-
codecs.
lookup_error
(name)¶ Return the error handler previously registered under the name name.返回以前在名称name下注册的错误处理程序。Raises a如果找不到处理程序,则引发LookupError
in case the handler cannot be found.LookupError
。
The following standard error handlers are also made available as module level functions:以下标准错误处理程序也作为模块级函数提供:
-
codecs.
strict_errors
(exception)¶ Implements the实现'strict'
error handling: each encoding or decoding error raises aUnicodeError
.'strict'
错误处理:每个编码或解码错误都会引发UnicodeError
。
-
codecs.
replace_errors
(exception)¶ Implements the实现'replace'
error handling (for text encodings only): substitutes'?'
for encoding errors (to be encoded by the codec), and'\ufffd'
(the Unicode replacement character) for decoding errors.'replace'
错误处理(仅适用于文本编码):替换'?'
用于编码错误(由编解码器编码),以及用于解码错误的'\ufffd'
(Unicode替换字符)。
-
codecs.
ignore_errors
(exception)¶ Implements the实现'ignore'
error handling: malformed data is ignored and encoding or decoding is continued without further notice.'ignore'
错误处理:忽略格式错误的数据,继续编码或解码,恕不另行通知。
-
codecs.
xmlcharrefreplace_errors
(exception)¶ Implements the实现'xmlcharrefreplace'
error handling (for encoding with text encodings only): the unencodable character is replaced by an appropriate XML character reference.'xmlcharrefreplace'
错误处理(仅用于使用文本编码进行编码):不可编码的字符被适当的XML字符引用替换。
-
codecs.
backslashreplace_errors
(exception)¶ Implements the实现'backslashreplace'
error handling (for text encodings only): malformed data is replaced by a backslashed escape sequence.'backslashreplace'
错误处理(仅适用于文本编码):格式错误的数据被反斜杠转义序列替换。
-
codecs.
namereplace_errors
(exception)¶ Implements the实现'namereplace'
error handling (for encoding with text encodings only): the unencodable character is replaced by a\N{...}
escape sequence.'namereplace'
错误处理(仅用于文本编码):不可编码字符被替换为\N{...}
转义序列。New in version 3.5.版本3.5中新增。
Stateless Encoding and Decoding无状态编码和解码¶
The base 基本Codec
class defines these methods which also define the function interfaces of the stateless encoder and decoder:Codec
类定义了这些方法,这些方法还定义了无状态编码器和解码器的功能接口:
-
Codec.
encode
(input[, errors])¶ Encodes the object input and returns a tuple (output object, length consumed).对对象input进行编码并返回元组(输出对象,消耗的长度)。For instance, text encoding converts a string object to a bytes object using a particular character set encoding (e.g.,例如,文本编码使用特定的字符集编码(例如,cp1252
oriso-8859-1
).cp1252
或iso-8859-1
)将字符串对象转换为字节对象。The errors argument defines the error handling to apply.errors参数定义了要应用的错误处理。It defaults to它默认为'strict'
handling.'strict'
处理。The method may not store state in the该方法可能不会将状态存储在Codec
instance.Codec
实例中。Use对于必须保持状态以提高编码效率的编解码器,请使用StreamWriter
for codecs which have to keep state in order to make encoding efficient.StreamWriter
。The encoder must be able to handle zero length input and return an empty object of the output object type in this situation.编码器必须能够处理零长度输入,并在这种情况下返回输出对象类型的空对象。
-
Codec.
decode
(input[, errors])¶ Decodes the object input and returns a tuple (output object, length consumed).解码对象input并返回元组(输出对象,消耗的长度)。For instance, for a text encoding, decoding converts a bytes object encoded using a particular character set encoding to a string object.例如,对于文本编码,解码将使用特定字符集编码编码的字节对象转换为字符串对象。For text encodings and bytes-to-bytes codecs, input must be a bytes object or one which provides the read-only buffer interface – for example, buffer objects and memory mapped files.对于文本编码和字节到字节编解码器,input必须是字节对象或提供只读缓冲区接口的对象,例如缓冲区对象和内存映射文件。The errors argument defines the error handling to apply.errors参数定义了要应用的错误处理。It defaults to它默认为'strict'
handling.'strict'
处理。The method may not store state in the该方法可能不会将状态存储在Codec
instance.Codec
实例中。Use对于必须保持状态以提高解码效率的编解码器,请使用StreamReader
for codecs which have to keep state in order to make decoding efficient.StreamReader
。The decoder must be able to handle zero length input and return an empty object of the output object type in this situation.在这种情况下,解码器必须能够处理零长度输入并返回输出对象类型的空对象。
Incremental Encoding and Decoding增量编码和解码¶
The IncrementalEncoder
and IncrementalDecoder
classes provide the basic interface for incremental encoding and decoding. IncrementalEncoder
类和IncrementalDecoder
类为增量编码和解码提供了基本接口。Encoding/decoding the input isn’t done with one call to the stateless encoder/decoder function, but with multiple calls to the 编码/解码输入不是通过一次调用无状态编码器/解码器函数来完成的,而是通过多次调用增量编码器/解码器的encode()
/decode()
method of the incremental encoder/decoder. encode()
/decode()
方法来完成的。The incremental encoder/decoder keeps track of the encoding/decoding process during method calls.增量编码器/解码器在方法调用期间跟踪编码/解码过程。
The joined output of calls to the 调用encode()
/decode()
method is the same as if all the single inputs were joined into one, and this input was encoded/decoded with the stateless encoder/decoder.encode()
/decode()
方法的连接输出与将所有单个输入连接成一个输入相同,并且该输入是使用无状态编码器/解码器编码/解码的。
IncrementalEncoder
Objects对象¶
The IncrementalEncoder
class is used for encoding an input in multiple steps. IncrementalEncoder
类用于在多个步骤中对输入进行编码。It defines the following methods which every incremental encoder must define in order to be compatible with the Python codec registry.它定义了每个增量编码器必须定义的以下方法,以便与Python编解码器注册表兼容。
-
class
codecs.
IncrementalEncoder
(errors='strict')¶ Constructor for anIncrementalEncoder
instance.IncrementalEncoder
实例的构造函数。All incremental encoders must provide this constructor interface.所有增量编码器必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。TheIncrementalEncoder
may implement different error handling schemes by providing the errors keyword argument.IncrementalEncoder
可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for possible values.有关可能的值,请参阅错误处理程序。The errors argument will be assigned to an attribute of the same name.errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the分配给该属性可以在IncrementalEncoder
object.IncrementalEncoder
对象的生命周期内在不同的错误处理策略之间切换。-
encode
(object[, final])¶ Encodes object (taking the current state of the encoder into account) and returns the resulting encoded object.对object进行编码(考虑编码器的当前状态)并返回生成的编码对象。If this is the last call to如果这是对encode()
final must be true (the default is false).encode()
的最后一次调用,则final必须为true
(默认值为false
)。
-
reset
()¶ Reset the encoder to the initial state.将编码器重置为初始状态。The output is discarded: call输出被丢弃:调用.encode(object, final=True)
, passing an empty byte or text string if necessary, to reset the encoder and to get the output..encode(object, final=True)
,必要时传递一个空字节或文本字符串,以重置编码器并获得输出。
-
getstate
()¶ Return the current state of the encoder which must be an integer.返回编码器的当前状态,该状态必须为整数。The implementation should make sure that实现应该确保0
is the most common state.0
是最常见的状态。(States that are more complicated than integers can be converted into an integer by marshaling/pickling the state and encoding the bytes of the resulting string into an integer.)(比整数更复杂的状态可以通过封送/酸洗状态并将结果字符串的字节编码为整数来转换为整数。)
-
setstate
(state)¶ Set the state of the encoder to state.将编码器的状态设置为state。state must be an encoder state returned bystate必须是getstate()
.getstate()
返回的编码器状态。
-
IncrementalDecoder
Objects对象¶
The IncrementalDecoder
class is used for decoding an input in multiple steps. IncrementalDecoder
类用于在多个步骤中对输入进行解码。It defines the following methods which every incremental decoder must define in order to be compatible with the Python codec registry.它定义了每个增量解码器必须定义的以下方法,以便与Python编解码器注册表兼容。
-
class
codecs.
IncrementalDecoder
(errors='strict')¶ Constructor for anIncrementalDecoder
instance.IncrementalDecoder
实例的构造函数。All incremental decoders must provide this constructor interface.所有增量解码器必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。TheIncrementalDecoder
may implement different error handling schemes by providing the errors keyword argument.IncrementalDecoder
可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for possible values.有关可能的值,请参阅错误处理程序。The errors argument will be assigned to an attribute of the same name.errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the为该属性赋值可以在IncrementalDecoder
object.IncrementalDecoder
对象的生存期内在不同的错误处理策略之间切换。-
decode
(object[, final])¶ Decodes object (taking the current state of the decoder into account) and returns the resulting decoded object.解码object(考虑解码器的当前状态)并返回得到的解码对象。If this is the last call to如果这是对decode()
final must be true (the default is false).decode()
的最后一次调用,final必须为true
(默认值为false
)。If final is true the decoder must decode the input completely and must flush all buffers.如果final为true
,则解码器必须完全解码输入,并且必须刷新所有缓冲区。If this isn’t possible (e.g. because of incomplete byte sequences at the end of the input) it must initiate error handling just like in the stateless case (which might raise an exception).如果这不可能(例如,由于输入端的字节序列不完整),它必须像在无状态情况下一样启动错误处理(这可能会引发异常)。
-
reset
()¶ Reset the decoder to the initial state.将解码器重置为初始状态。
-
getstate
()¶ Return the current state of the decoder.返回解码器的当前状态。This must be a tuple with two items, the first must be the buffer containing the still undecoded input.这必须是一个包含两项的元组,第一项必须是包含仍然未编码的输入的缓冲区。The second must be an integer and can be additional state info.第二个必须是整数,可以是其他状态信息。(The implementation should make sure that(实现应确保0
is the most common additional state info.)0
是最常见的附加状态信息。)If this additional state info is如果此附加状态信息为0
it must be possible to set the decoder to the state which has no input buffered and0
as the additional state info, so that feeding the previously buffered input to the decoder returns it to the previous state without producing any output.0
,则必须可以将解码器设置为没有输入缓冲的状态,并将0
设置为附加状态信息,以便将先前缓冲的输入馈送到解码器,使其返回到先前状态,而不产生任何输出。(Additional state info that is more complicated than integers can be converted into an integer by marshaling/pickling the info and encoding the bytes of the resulting string into an integer.)(比整数更复杂的其他状态信息可以通过编组/酸洗信息并将结果字符串的字节编码为整数来转换为整数。)
-
setstate
(state)¶ Set the state of the decoder to state.将解码器的状态设置为state。statemust be a decoder state returned by必须是getstate()
.getstate()
返回的解码器状态。
-
Stream Encoding and Decoding流编码和解码¶
The StreamWriter
and StreamReader
classes provide generic working interfaces which can be used to implement new encoding submodules very easily. StreamWriter
和StreamReader
类提供了通用工作接口,可用于非常轻松地实现新的编码子模块。See 请参阅encodings.utf_8
for an example of how this is done.encodings.utf_8
以获取如何完成此操作的示例。
StreamWriter
Objects对象¶
The StreamWriter
class is a subclass of Codec
and defines the following methods which every stream writer must define in order to be compatible with the Python codec registry.StreamWriter
类是编解码器的一个子类,它定义了每个流编写器必须定义的以下方法,以便与Python编解码器注册表兼容。
-
class
codecs.
StreamWriter
(stream, errors='strict')¶ Constructor for aStreamWriter
instance.StreamWriter
实例的构造函数。All stream writers must provide this constructor interface.所有流编写器都必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。The stream argument must be a file-like object open for writing text or binary data, as appropriate for the specific codec.stream参数必须是一个类似文件的对象,用于写入文本或二进制数据,适用于特定的编解码器。TheStreamWriter
may implement different error handling schemes by providing the errors keyword argument.StreamWriter
可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for the standard error handlers the underlying stream codec may support.有关底层流编解码器可能支持的标准错误处理程序,请参阅错误处理程序。The errors argument will be assigned to an attribute of the same name.errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the为该属性赋值可以在StreamWriter
object.StreamWriter
对象的生存期内在不同的错误处理策略之间切换。-
write
(object)¶ Writes the object’s contents encoded to the stream.将编码的对象内容写入流。
-
writelines
(list)¶ Writes the concatenated iterable of strings to the stream (possibly by reusing the将串接的iterable字符串写入流(可能通过重用write()
method).write()
方法)。Infinite or very large iterables are not supported.不支持无限或非常大的ITerable。The standard bytes-to-bytes codecs do not support this method.标准的字节到字节编解码器不支持这种方法。
-
reset
()¶ Resets the codec buffers used for keeping internal state.重置用于保持内部状态的编解码器缓冲区。Calling this method should ensure that the data on the output is put into a clean state that allows appending of new fresh data without having to rescan the whole stream to recover state.调用此方法应确保输出上的数据处于干净状态,允许附加新的新数据,而无需重新扫描整个流以恢复状态。
-
In addition to the above methods, the 除了上述方法外,StreamWriter
must also inherit all other methods and attributes from the underlying stream.StreamWriter
还必须从底层流继承所有其他方法和属性。
StreamReader
Objects对象¶
The StreamReader
class is a subclass of Codec
and defines the following methods which every stream reader must define in order to be compatible with the Python codec registry.StreamReader
类是Codec
的一个子类,它定义了每个流阅读器必须定义的以下方法,以便与Python编解码器注册表兼容。
-
class
codecs.
StreamReader
(stream, errors='strict')¶ Constructor for aStreamReader
instance.StreamReader
实例的构造函数。All stream readers must provide this constructor interface.所有流读取器都必须提供此构造函数接口。They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.它们可以自由添加其他关键字参数,但Python编解码器注册表仅使用此处定义的关键字参数。The stream argument must be a file-like object open for reading text or binary data, as appropriate for the specific codec.stream参数必须是一个类似文件的对象,用于读取文本或二进制数据,具体视具体编解码器而定。TheStreamReader
may implement different error handling schemes by providing the errors keyword argument.StreamReader
可以通过提供errors关键字参数来实现不同的错误处理方案。See Error Handlers for the standard error handlers the underlying stream codec may support.有关底层流编解码器可能支持的标准错误处理程序,请参阅错误处理程序。The errors argument will be assigned to an attribute of the same name.errors参数将分配给同名的属性。Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the分配给该属性可以在StreamReader
object.StreamReader
对象的生命周期内在不同的错误处理策略之间切换。The set of allowed values for the errors argument can be extended with可以使用register_error()
.register_error()
扩展errors参数的允许值集。-
read
([size[, chars[, firstline]]])¶ Decodes data from the stream and returns the resulting object.解码流中的数据并返回结果对象。The chars argument indicates the number of decoded code points or bytes to return.chars参数表示要返回的解码代码点或字节数。Theread()
method will never return more data than requested, but it might return less, if there is not enough available.read()
方法返回的数据永远不会超过请求的数据量,但如果没有足够的可用数据,则返回的数据可能会更少。The size argument indicates the approximate maximum number of encoded bytes or code points to read for decoding.size参数表示为解码而读取的编码字节或代码点的近似最大数量。The decoder can modify this setting as appropriate.解码器可以根据需要修改此设置。The default value -1 indicates to read and decode as much as possible.默认值-1表示尽可能多地读取和解码。This parameter is intended to prevent having to decode huge files in one step.此参数旨在防止必须在一个步骤中解码大型文件。The firstline flag indicates that it would be sufficient to only return the first line, if there are decoding errors on later lines.firstline标志表示,如果后面的行中存在解码错误,则只返回第一行就足够了。The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given size, e.g. if optional encoding endings or state markers are available on the stream, these should be read too.该方法应该使用贪婪读取策略,这意味着它应该读取编码定义和给定大小内允许的尽可能多的数据,例如,如果流上有可选的编码结尾或状态标记,则也应该读取这些数据。
-
readline
([size[, keepends]])¶ Read one line from the input stream and return the decoded data.从输入流中读取一行并返回解码数据。size, if given, is passed as size argument to the stream’s如果给定了size,则将其作为size参数传递给流的read()
method.read()
方法。If keepends is false line-endings will be stripped from the lines returned.如果keepends为false
,则将从返回的行中删除行结束符。
-
readlines
([sizehint[, keepends]])¶ Read all lines available on the input stream and return them as a list of lines.读取输入流上的所有可用行,并将其作为行列表返回。Line-endings are implemented using the codec’s行结尾使用编解码器的decode()
method and are included in the list entries if keepends is true.decode()
方法实现,如果keepends为true
,则包含在列表项中。sizehint, if given, is passed as the size argument to the stream’ssizehint(如果给定)作为size参数传递给流的read()
method.read()
方法。
-
reset
()¶ Resets the codec buffers used for keeping internal state.重置用于保持内部状态的编解码器缓冲区。Note that no stream repositioning should take place.请注意,不应重新定位流。This method is primarily intended to be able to recover from decoding errors.该方法的主要目的是能够从解码错误中恢复。
-
In addition to the above methods, the 除了上述方法外,StreamReader
must also inherit all other methods and attributes from the underlying stream.StreamReader
还必须从底层流继承所有其他方法和属性。
StreamReaderWriter
Objects对象¶
The StreamReaderWriter
is a convenience class that allows wrapping streams which work in both read and write modes.StreamReaderWriter
是一个方便的类,它允许包装在读写模式下工作的流。
The design is such that one can use the factory functions returned by the 这种设计可以使用lookup()
function to construct the instance.lookup()
函数返回的工厂函数来构造实例。
-
class
codecs.
StreamReaderWriter
(stream, Reader, Writer, errors='strict')¶ Creates a创建StreamReaderWriter
instance.StreamReaderWriter
实例。stream must be a file-like object.stream必须是类似文件的对象。Reader and Writer must be factory functions or classes providing theReader和Writer必须是工厂函数或类,分别提供StreamReader
andStreamWriter
interface resp.StreamReader
和StreamWriter
接口。Error handling is done in the same way as defined for the stream readers and writers.错误处理的方式与为流读取器和写入器定义的方式相同。
StreamReaderWriter
instances define the combined interfaces of 实例定义StreamReader
and StreamWriter
classes. StreamReader
类和SStreamWriter
类的组合接口。They inherit all other methods and attributes from the underlying stream.它们从底层流继承所有其他方法和属性。
StreamRecoder
Objects对象¶
The StreamRecoder
translates data from one encoding to another, which is sometimes useful when dealing with different encoding environments.StreamRecoder
将数据从一种编码转换为另一种编码,这在处理不同的编码环境时有时很有用。
The design is such that one can use the factory functions returned by the 这种设计可以使用lookup()
function to construct the instance.lookup()
函数返回的工厂函数来构造实例。
-
class
codecs.
StreamRecoder
(stream, encode, decode, Reader, Writer, errors='strict')¶ Creates a创建一个StreamRecoder
instance which implements a two-way conversion: encode and decode work on the frontend — the data visible to code callingread()
andwrite()
, while Reader and Writer work on the backend — the data in stream.StreamRecoder
实例,该实例实现了双向转换:在前端encode和decode工作-代码调用read()
和write()
时可见的数据,而读写器在后端工作-流中的数据。You can use these objects to do transparent transcodings, e.g., from Latin-1 to UTF-8 and back.您可以使用这些对象进行透明的转码,例如,从拉丁语-1到UTF-8再到UTF-8。The stream argument must be a file-like object.stream参数必须是类似文件的对象。The encode and decode arguments must adhere to theencode和decode参数必须遵循编解码器接口。Codec
interface.Reader and Writer must be factory functions or classes providing objects of theReader和Writer必须是工厂函数或类,分别提供StreamReader
andStreamWriter
interface respectively.StreamReader
和StreamWriter
接口的对象。Error handling is done in the same way as defined for the stream readers and writers.错误处理的方式与为流读取器和写入器定义的方式相同。
StreamRecoder
instances define the combined interfaces of StreamReader
and StreamWriter
classes. StreamRecoder
实例定义StreamReader
类和StreamWriter
类的组合接口。They inherit all other methods and attributes from the underlying stream.它们从底层流继承所有其他方法和属性。
Encodings and Unicode编码和Unicode¶
Strings are stored internally as sequences of code points in range 字符串在内部存储为0x0
–0x10FFFF
. 0x0
-0x10FFFF
范围内的代码点序列。(See PEP 393 for more details about the implementation.) (有关实施的更多详细信息,请参阅PEP 393。)Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue. 一旦在CPU和内存之外使用字符串对象,endianness以及这些数组如何存储为字节就成了一个问题。As with other codecs, serialising a string into a sequence of bytes is known as encoding, and recreating the string from the sequence of bytes is known as decoding. 与其他编解码器一样,将字符串序列化为字节序列称为encoding,从字节序列重新创建字符串称为decoding。There are a variety of different text serialisation codecs, which are collectivity referred to as text encodings.有多种不同的文本序列化编解码器,统称为文本编码。
The simplest text encoding (called 最简单的文本编码(称为'latin-1'
or 'iso-8859-1'
) maps the code points 0–255 to the bytes 0x0
–0xff
, which means that a string object that contains code points above U+00FF
can’t be encoded with this codec. 'latin-1'
或'iso-8859-1'
)将代码点0-255映射到字节0x0
-0xff
,这意味着包含U+00FF
以上代码点的字符串对象不能用此编解码器编码。Doing so will raise a 这样做将引发如下的UnicodeEncodeError
that looks like the following (although the details of the error message may differ): UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256)
.UnicodeEncodeError
(尽管错误消息的详细信息可能不同):UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256)
。
There’s another group of encodings (the so called charmap encodings) that choose a different subset of all Unicode code points and how these code points are mapped to the bytes 还有另一组编码(所谓的字符映射编码)选择所有Unicode代码点的不同子集,以及这些代码点如何映射到字节0x0
–0xff
. 0x0
-0xff
。To see how this is done simply open e.g. 要了解这是如何做到的,只需打开例如encodings/cp1252.py
(which is an encoding that is used primarily on Windows). encodings/cp1252.py
(这是一种主要在Windows上使用的编码)。There’s a string constant with 256 characters that shows you which character is mapped to which byte value.有一个包含256个字符的字符串常量,显示哪个字符映射到哪个字节值。
All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. 所有这些编码只能对Unicode中定义的1114112个码点中的256个进行编码。A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes. 存储每个Unicode代码点的一种简单而直接的方法是将每个代码点存储为四个连续字节。There are two possibilities: store the bytes in big endian or in little endian order. 有两种可能:按大端顺序或小端顺序存储字节。These two encodings are called 这两种编码分别称为UTF-32-BE
and UTF-32-LE
respectively. UTF-32-BE
和UTF-32-LE
。Their disadvantage is that if e.g. you use 他们的缺点是,如果你在一个小端机器上使用UTF-32-BE
on a little endian machine you will always have to swap bytes on encoding and decoding. UTF-32-BE
,你总是需要在编码和解码时交换字节。UTF-32
avoids this problem: bytes will always be in natural endianness. UTF-32
避免了这个问题:字节将始终处于自然终止状态。When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. 当CPU以不同的端序读取这些字节时,必须交换字节。To be able to detect the endianness of a 为了能够检测UTF-16
or UTF-32
byte sequence, there’s the so called BOM (“Byte Order Mark”). UTF-16
或UTF-32
字节序列的端点,有一个所谓的BOM(“字节顺序标记”)。This is the Unicode character 这是Unicode字符U+FEFF
. U+FEFF
。This character can be prepended to every 该字符可以在每个UTF-16
or UTF-32
byte sequence. UTF-16
或UTF-32
字节序列之前。The byte swapped version of this character (此字符的字节交换版本(0xFFFE
) is an illegal character that may not appear in a Unicode text. 0xFFFE
)是一个非法字符,可能不会出现在Unicode文本中。So when the first character in a 因此,当UTF-16
or UTF-32
byte sequence appears to be a U+FFFE
the bytes have to be swapped on decoding. UTF-16
或UTF-32
字节序列中的第一个字符似乎是U+FFFE
时,必须在解码时交换字节。Unfortunately the character 不幸的是,字符U+FEFF
had a second purpose as a ZERO WIDTH NO-BREAK SPACE
: a character that has no width and doesn’t allow a word to be split. U+FEFF
还有另一个用途,即“零宽度不间断空间”:一个没有宽度且不允许分割单词的字符。It can e.g. be used to give hints to a ligature algorithm. With Unicode 4.0 using 例如,它可以用来提示连字算法。在Unicode 4.0中,使用U+FEFF
as a ZERO WIDTH NO-BREAK SPACE
has been deprecated (with U+2060
(WORD JOINER
) assuming this role). U+FEFF
作为“零宽度不间断空间”已被弃用(U+2060
(“字连接符”)承担此角色)。Nevertheless Unicode software still must be able to handle 尽管如此,Unicode软件仍然必须能够在两个角色中处理U+FEFF
in both roles: as a BOM it’s a device to determine the storage layout of the encoded bytes, and vanishes once the byte sequence has been decoded into a string; as a ZERO WIDTH NO-BREAK SPACE
it’s a normal character that will be decoded like any other.U+FEFF
:作为BOM,它是一种确定编码字节存储布局的设备,一旦字节序列解码成字符串,它就会消失;作为一个“零宽度的不间断空间”,它是一个普通字符,将像任何其他字符一样被解码。
There’s another encoding that is able to encode the full range of Unicode characters: UTF-8. 还有另一种编码方法可以对全范围的Unicode字符进行编码:UTF-8。UTF-8 is an 8-bit encoding, which means there are no issues with byte order in UTF-8. UTF-8是一种8位编码,这意味着UTF-8中的字节顺序没有问题。Each byte in a UTF-8 byte sequence consists of two parts: marker bits (the most significant bits) and payload bits. UTF-8字节序列中的每个字节由两部分组成:标记位(最高有效位)和有效负载位。The marker bits are a sequence of zero to four 标记位是0到4个1
bits followed by a 0
bit. 1
位的序列,后跟一个0
位。Unicode characters are encoded like this (with x being payload bits, which when concatenated give the Unicode character):Unicode字符是这样编码的(x是有效负载位,连接后得到Unicode字符):
|
|
---|---|
|
0xxxxxxx |
|
110xxxxx 10xxxxxx |
|
1110xxxx 10xxxxxx 10xxxxxx |
|
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The least significant bit of the Unicode character is the rightmost x bit.Unicode字符的最低有效位是最右边的x位。
As UTF-8 is an 8-bit encoding no BOM is required and any 由于UTF-8是一种8位编码,因此不需要BOM,解码字符串中的任何U+FEFF
character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE
.U+FEFF
字符(即使是第一个字符)都被视为“零宽度无中断空间”。
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. 如果没有外部信息,就不可能可靠地确定哪个编码用于编码字符串。Each charmap encoding can decode any random byte sequence. 每个字符映射编码可以解码任何随机字节序列。However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. 然而,这在UTF-8中是不可能的,因为UTF-8字节序列具有不允许任意字节序列的结构。To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls 为了提高检测UTF-8编码的可靠性,微软为其记事本程序发明了一种UTF-8变体(Python 2.5称之为"utf-8-sig"
) for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef
, 0xbb
, 0xbf
) is written. "utf-8-sig"
):在将任何Unicode字符写入文件之前,先写入一个UTF-8编码的BOM(看起来像字节序列:0xef
、0xbb
、0xbf
)。As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to因为任何charmap编码的文件都不太可能以这些字节值开始(例如,映射到
LATIN SMALL LETTER I WITH DIAERESISRIGHT-POINTING DOUBLE ANGLE QUOTATION MARKINVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a 在iso-8859-1)中,这增加了从字节序列中正确猜测utf-8-sig
encoding can be correctly guessed from the byte sequence. utf-8-sig
编码的概率。So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. 因此,在这里,BOM不是用来确定用于生成字节序列的字节顺序,而是作为有助于猜测编码的签名。On encoding the utf-8-sig codec will write 编码时,utf-8-sig编解码器将0xef
, 0xbb
, 0xbf
as the first three bytes to the file. 0xef
、0xbb
、0xbf
作为前三个字节写入文件。On decoding 解码时,如果utf-8-sig
will skip those three bytes if they appear as the first three bytes in the file. utf-8-sig
在文件中显示为前三个字节,则会跳过这三个字节。In UTF-8, the use of the BOM is discouraged and should generally be avoided.在UTF-8中,不鼓励使用BOM,通常应避免使用BOM。
Standard Encodings标准编码¶
Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. Python内置了许多编解码器,可以作为C函数实现,也可以使用字典作为映射表。The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. 下表按名称、几个常见别名以及可能使用编码的语言列出了编解码器。Neither the list of aliases nor the list of languages is meant to be exhaustive. 别名列表和语言列表都不是详尽无遗的。Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 请注意,只有大小写不同或使用连字符而不是下划线的拼写选项也是有效的别名;因此,例如,'utf-8'
is a valid alias for the 'utf_8'
codec.'utf-8'
是'utf_8'
编解码器的有效别名。
CPython implementation detail:CPython实施细节: Some common encodings can bypass the codecs lookup machinery to improve performance. 一些常见的编码可以绕过编解码器查找机制来提高性能。These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. 这些优化机会仅由CPython识别为有限的一组(不区分大小写)别名:utf-8、utf8、latin-1、latin1、iso-8859-1、iso8859-1、mbcs(仅限Windows)、ascii、us ascii、utf-16、utf16、utf-32、utf32,以及使用下划线而不是破折号的别名。Using alternative aliases for these encodings may result in slower execution.为这些编码使用替代别名可能会导致执行速度较慢。
Changed in version 3.6:版本3.6中更改: Optimization opportunity recognized for us-ascii.我们认识到优化机会ascii。
Many of the character sets support the same languages. 许多字符集支持相同的语言。They vary in individual characters (e.g. whether the EURO SIGN is supported or not), and in the assignment of characters to code positions. 它们在单个字符(例如,是否支持欧元符号)和字符到代码位置的分配方面有所不同。For the European languages in particular, the following variants typically exist:特别是对于欧洲语言,通常存在以下变体:
an ISO 8859 codesetISO 8859代码集a Microsoft Windows code page, which is typically derived from an 8859 codeset, but replaces control characters with additional graphic charactersMicrosoft Windows代码页,通常从8859代码集派生,但用其他图形字符替换控制字符an IBM EBCDIC code pageIBM EBCDIC代码页an IBM PC code page, which is ASCII compatible与ASCII兼容的IBM PC代码页
|
|
|
---|---|---|
ascii |
646, us-ascii |
|
big5 |
big5-tw, csbig5 |
|
big5hkscs |
big5-hkscs, hkscs |
|
cp037 |
IBM037, IBM039 |
|
cp273 |
273, IBM273, csIBM273 |
|
cp424 |
EBCDIC-CP-HE, IBM424 |
|
cp437 |
437, IBM437 |
|
cp500 |
EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 |
|
cp720 |
|
|
cp737 |
|
|
cp775 |
IBM775 |
|
cp850 |
850, IBM850 |
|
cp852 |
852, IBM852 |
|
cp855 |
855, IBM855 |
|
cp856 |
|
|
cp857 |
857, IBM857 |
|
cp858 |
858, IBM858 |
|
cp860 |
860, IBM860 |
|
cp861 |
861, CP-IS, IBM861 |
|
cp862 |
862, IBM862 |
|
cp863 |
863, IBM863 |
|
cp864 |
IBM864 |
|
cp865 |
865, IBM865 |
|
cp866 |
866, IBM866 |
|
cp869 |
869, CP-GR, IBM869 |
|
cp874 |
|
|
cp875 |
|
|
cp932 |
932, ms932, mskanji, ms-kanji |
|
cp949 |
949, ms949, uhc |
|
cp950 |
950, ms950 |
|
cp1006 |
|
|
cp1026 |
ibm1026 |
|
cp1125 |
1125, ibm1125, cp866u, ruscii |
|
cp1140 |
ibm1140 |
|
cp1250 |
windows-1250 |
|
cp1251 |
windows-1251 |
|
cp1252 |
windows-1252 |
|
cp1253 |
windows-1253 |
|
cp1254 |
windows-1254 |
|
cp1255 |
windows-1255 |
|
cp1256 |
windows-1256 |
|
cp1257 |
windows-1257 |
|
cp1258 |
windows-1258 |
|
euc_jp |
eucjp, ujis, u-jis |
|
euc_jis_2004 |
jisx0213, eucjis2004 |
|
euc_jisx0213 |
eucjisx0213 |
|
euc_kr |
euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001 |
|
gb2312 |
chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso-ir-58 |
|
gbk |
936, cp936, ms936 |
|
gb18030 |
gb18030-2000 |
|
hz |
hzgb, hz-gb, hz-gb-2312 |
|
iso2022_jp |
csiso2022jp, iso2022jp, iso-2022-jp |
|
iso2022_jp_1 |
iso2022jp-1, iso-2022-jp-1 |
|
iso2022_jp_2 |
iso2022jp-2, iso-2022-jp-2 |
|
iso2022_jp_2004 |
iso2022jp-2004, iso-2022-jp-2004 |
|
iso2022_jp_3 |
iso2022jp-3, iso-2022-jp-3 |
|
iso2022_jp_ext |
iso2022jp-ext, iso-2022-jp-ext |
|
iso2022_kr |
csiso2022kr, iso2022kr, iso-2022-kr |
|
latin_1 |
iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1 |
|
iso8859_2 |
iso-8859-2, latin2, L2 |
|
iso8859_3 |
iso-8859-3, latin3, L3 |
|
iso8859_4 |
iso-8859-4, latin4, L4 |
|
iso8859_5 |
iso-8859-5, cyrillic |
|
iso8859_6 |
iso-8859-6, arabic |
|
iso8859_7 |
iso-8859-7, greek, greek8 |
|
iso8859_8 |
iso-8859-8, hebrew |
Hebrew |
iso8859_9 |
iso-8859-9, latin5, L5 |
Turkish |
iso8859_10 |
iso-8859-10, latin6, L6 |
Nordic languages |
iso8859_11 |
iso-8859-11, thai |
Thai languages |
iso8859_13 |
iso-8859-13, latin7, L7 |
Baltic languages |
iso8859_14 |
iso-8859-14, latin8, L8 |
Celtic languages |
iso8859_15 |
iso-8859-15, latin9, L9 |
Western Europe |
iso8859_16 |
iso-8859-16, latin10, L10 |
South-Eastern Europe |
johab |
cp1361, ms1361 |
Korean |
koi8_r |
Russian |
|
koi8_t |
Tajik
|
|
koi8_u |
Ukrainian |
|
kz1048 |
kz_1048, strk1048_2002, rk1048 |
Kazakh
|
mac_cyrillic |
maccyrillic |
|
mac_greek |
macgreek |
Greek |
mac_iceland |
maciceland |
Icelandic |
mac_latin2 |
maclatin2, maccentraleurope, mac_centeuro |
Central and Eastern Europe |
mac_roman |
macroman, macintosh |
Western Europe |
mac_turkish |
macturkish |
Turkish |
ptcp154 |
csptcp154, pt154, cp154, cyrillic-asian |
Kazakh |
shift_jis |
csshiftjis, shiftjis, sjis, s_jis |
Japanese |
shift_jis_2004 |
shiftjis2004, sjis_2004, sjis2004 |
Japanese |
shift_jisx0213 |
shiftjisx0213, sjisx0213, s_jisx0213 |
Japanese |
utf_32 |
U32, utf32 |
all languages |
utf_32_be |
UTF-32BE |
all languages |
utf_32_le |
UTF-32LE |
all languages |
utf_16 |
U16, utf16 |
all languages |
utf_16_be |
UTF-16BE |
all languages |
utf_16_le |
UTF-16LE |
all languages |
utf_7 |
U7, unicode-1-1-utf-7 |
all languages |
utf_8 |
U8, UTF, utf8, cp65001 |
all languages |
utf_8_sig |
all languages |
Changed in version 3.4:版本3.4中更改: The utf-16* and utf-32* encoders no longer allow surrogate code points (utf-16*和utf-32*编码器不再允许对代理码点(U+D800
–U+DFFF
) to be encoded. U+D800
-U+DFFF
)进行编码。The utf-32* decoders no longer decode byte sequences that correspond to surrogate code points.utf-32*解码器不再解码对应于代理码点的字节序列。
Changed in version 3.8:版本3.8中更改: cp65001
is now an alias to utf_8
.cp65001
现在是utf_8
的别名。
Python Specific EncodingsPython特定编码¶
A number of predefined codecs are specific to Python, so their codec names have no meaning outside Python. 许多预定义的编解码器特定于Python,因此它们的编解码器名称在Python之外没有任何意义。These are listed in the tables below based on the expected input and output types (note that while text encodings are the most common use case for codecs, the underlying codec infrastructure supports arbitrary data transforms rather than just text encodings). 下表根据预期的输入和输出类型列出了这些类型(请注意,虽然文本编码是编解码器最常见的用例,但底层编解码器基础设施支持任意数据转换,而不仅仅是文本编码)。For asymmetric codecs, the stated meaning describes the encoding direction.对于非对称编解码器,所述含义描述了编码方向。
Text Encodings文本编码¶
The following codecs provide 以下编解码器提供str
to bytes
encoding and bytes-like object to str
decoding, similar to the Unicode text encodings.str
到bytes
编码和类似字节的对象到str
解码,类似于Unicode文本编码。
|
|
|
---|---|---|
idna |
|
|
mbcs |
ansi, dbcs |
|
oem |
|
|
palmos |
|
|
punycode |
|
|
raw_unicode_escape |
|
|
undefined |
|
|
unicode_escape |
|
Changed in version 3.8:版本3.8中更改: “unicode_internal” codec is removed.“unicode_internal”编解码器已删除。
Binary Transforms二进制变换¶
The following codecs provide binary transforms: bytes-like object to 以下编解码器提供二进制转换:类似字节的对象到bytes
mappings. bytes
映射。They are not supported by bytes.decode()
(which only produces str
output).bytes.decode()
不支持它们(它只生成str
输出)。
|
|
|
|
---|---|---|---|
base64_codec 1 |
base64, base_64 |
|
|
bz2_codec |
bz2 |
|
|
hex_codec |
hex |
|
|
quopri_codec |
|
|
|
uu_codec |
uu |
|
|
zlib_codec |
zip, zlib |
|
- 1
In addition to bytes-like objects,除了类似字节的对象外,'base64_codec'
also accepts ASCII-only instances ofstr
for decoding'base64_codec'
还接受仅限ASCII的str
实例进行解码
New in version 3.2.版本3.2中新增。Restoration of the binary transforms.二进制变换的恢复。
Changed in version 3.4:版本3.4中更改: Restoration of the aliases for the binary transforms.恢复二进制变换的别名。
Text Transforms文本转换¶
The following codec provides a text transform: a 以下编解码器提供了文本转换:str
to str
mapping. str
到str
的映射。It is not supported by str.encode()
(which only produces bytes
output).str.encode()
不支持它(它只生成bytes
输出)。
|
|
|
---|---|---|
rot_13 |
rot13 |
|
New in version 3.2.版本3.2中新增。Restoration of the 恢复rot_13
text transform.rot_13
文本转换。
Changed in version 3.4:版本3.4中更改: Restoration of the 恢复rot13
alias.rot13
别名。
encodings.idna
— Internationalized Domain Names in Applications应用程序中的国际化域名¶
This module implements RFC 3490 (Internationalized Domain Names in Applications) and RFC 3492 (Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)). 本模块实现RFC 3490(应用程序中的国际化域名)和RFC 3492(Nameprep:国际化域名(IDN)的Stringprep配置文件)。It builds upon the 它基于punycode
encoding and stringprep
.punycode
编码和stringprep
。
If you need the IDNA 2008 standard from RFC 5891 and RFC 5895, use the third-party idna module.如果需要RFC 5891和RFC 5895中的IDNA 2008标准,请使用第三方IDNA模块。
These RFCs together define a protocol to support non-ASCII characters in domain names. 这些RFC共同定义了一个协议,以支持域名中的非ASCII字符。A domain name containing non-ASCII characters (such as 包含非ASCII字符的域名(如www.Alliancefrançaise.nu
) is converted into an ASCII-compatible encoding (ACE, such as www.xn--alliancefranaise-npb.nu
). www.Alliancefrançaise.nu
)被转换为ASCII兼容编码(ACE,如www.xn--alliancefranaise-npb.nu
)。The ACE form of the domain name is then used in all places where arbitrary characters are not allowed by the protocol, such as DNS queries, HTTP Host fields, and so on. 然后在协议不允许使用任意字符的所有地方使用域名的ACE形式,例如DNS查询、HTTPHost字段等。This conversion is carried out in the application; if possible invisible to the user: The application should transparently convert Unicode domain labels to IDNA on the wire, and convert back ACE labels to Unicode before presenting them to the user.此转换在应用程序中执行;如果用户可能看不见:应用程序应在网上透明地将Unicode域标签转换为IDNA,并将ACE标签转换回Unicode,然后再将其呈现给用户。
Python supports this conversion in several ways: the Python以几种方式支持这种转换:idna
codec performs conversion between Unicode and ACE, separating an input string into labels based on the separator characters defined in section 3.1 of RFC 3490 and converting each label to ACE as required, and conversely separating an input byte string into labels based on the .
separator and converting any ACE labels found into unicode. idna
编解码器执行Unicode和ACE之间的转换,根据section 3.1 of RFC 3490中定义的分隔符将输入字符串分离为标签,并根据需要将每个标签转换为ACE,反之,根据.
分隔符并将找到的任何ACE标签转换为unicode。Furthermore, the 此外,socket
module transparently converts Unicode host names to ACE, so that applications need not be concerned about converting host names themselves when they pass them to the socket module. socket
模块将Unicode主机名透明地转换为ACE,因此应用程序在将主机名传递给socket模块时无需关心主机名本身的转换。On top of that, modules that have host names as function parameters, such as 最重要的是,以主机名作为功能参数的模块,如http.client
and ftplib
, accept Unicode host names (http.client
then also transparently sends an IDNA hostname in the Host field if it sends that field at all).http.client
和ftplib
,接受Unicode主机名(如果http.client
发送了主机字段,则还会在Host字段中透明地发送IDNA主机名)。
When receiving host names from the wire (such as in reverse name lookup), no automatic conversion to Unicode is performed: applications wishing to present such host names to the user should decode them to Unicode.当从导线接收主机名时(例如在反向名称查找中),不会执行到Unicode的自动转换:希望向用户呈现此类主机名的应用程序应将其解码为Unicode。
The module encodings.idna
also implements the nameprep procedure, which performs certain normalizations on host names, to achieve case-insensitivity of international domain names, and to unify similar characters. encodings.idna
模块还实现了nameprep过程,该过程对主机名执行某些规范化,以实现国际域名的大小写不敏感,并统一相似字符。The nameprep functions can be used directly if desired.如果需要,可以直接使用nameprep函数。
-
encodings.idna.
nameprep
(label)¶ Return the nameprepped version of label.返回label的nameprepped版本。The implementation currently assumes query strings, so该实现当前假设查询字符串,因此AllowUnassigned
is true.AllowUnassigned
为true
。
encodings.mbcs
— Windows ANSI codepageWindows ANSI代码页¶
This module implements the ANSI codepage (CP_ACP).该模块实现ANSI代码页(CP\U ACP)。
Availability可用性: Windows only.:仅限Windows。
Changed in version 3.3:版本3.3中更改: Support any error handler.支持任何错误处理程序。
Changed in version 3.2:版本3.2中更改: Before 3.2, the errors argument was ignored; 在3.2之前,errors参数被忽略;'replace'
was always used to encode, and 'ignore'
to decode.'replace'
始终用于编码,'ignore'
用于解码。
encodings.utf_8_sig
— UTF-8 codec with BOM signature具有BOM签名的UTF-8编解码器¶
This module implements a variant of the UTF-8 codec. 该模块实现了UTF-8编解码器的一种变体。On encoding, a UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. 编码时,UTF-8编码的BOM将前置到UTF-8编码的字节。For the stateful encoder this is only done once (on the first write to the byte stream). 对于有状态编码器,这只执行一次(在第一次写入字节流时)。On decoding, an optional UTF-8 encoded BOM at the start of the data will be skipped.解码时,将跳过数据开头的可选UTF-8编码BOM。