codecs

Convert a string to and from various encodings.

The basic supported encodings are roughly as specified in the WHATWG Encoding Standard, but more are also supported unless restriction to web encodings is explicitly specified.

Most encodings supported by Python are implemented, but not currently idna or punycode. Note however that Python makes x-mac-japanese an alias of shift_jis; this has not been done here. Also note that the behaviour in regards to association of encoding names with variants is somewhat different to Python's, partly due to following WHATWG: this affects most CJK codecs (e.g. Python treats shift_jis and ms-kanji differently, while this package does not), but also e.g. "ISO-8859-1".

Main entry points for the package are codecs.infrastructure.encode, codecs.infrastructure.decode and codecs.infrastructure.lookup, all three of which are also available as e.g. codecs.encode for convenience.

The list of codecs (not an exhaustive list of labels, nor close to one) is as follows.

Single-byte extended ASCII encodings:

Major label(s) Meaning
cp437 8-bit United States (DOS)
cp720 8-bit Arabic Letters and Box Drawing (DOS)
cp737 8-bit Greek and Box Drawing (DOS)
cp775 8-bit Baltic Rim (DOS)
cp850 8-bit Western Europe and Canada (DOS)
cp852 8-bit Central European (DOS)
cp855 8-bit Balkan Cyrillic (DOS)
cp856 8-bit Hebrew (DOS)
cp857 8-bit Turkish (DOS)
cp858 8-bit Western Europe and Canada with Euro (DOS)
cp860 8-bit European Portugese (DOS)
cp861 8-bit Icelandic (DOS)
cp862 8-bit Hebrew and Box Drawing (DOS)
cp863 8-bit Quebecois French (DOS)
cp864 8-bit Arabic Positional Forms (DOS)
cp865 8-bit Continental Nordic (DOS)
cp866, ibm866 8-bit Russian Cyrillic (DOS)
cp869 8-bit Greek (DOS)
cp1006 8-bit Urdu
cp1125 8-bit Ukrainian Cyrillic (DOS)
ecma-43-dv, cp367, csascii "8-bit Plain ASCII", i.e. ASCII without backspace composition, and with high bit unused. Note: most ASCII labels are mapped to Windows-1252, per WHATWG.
hp-roman8 8-bit Roman (HP)
iso-8859-2 8-bit Central European (ISO)
iso-8859-3 8-bit South European (Maltese/Esperanto)
iso-8859-4 8-bit North European
iso-8859-5 8-bit Cyrillic (ISO)
iso-8859-6 8-bit Arabic (ASMO/ISO)
iso-8859-7 8-bit Greek (ISO)
iso-8859-8, iso-8859-8-i 8-bit Hebrew (without vowel points). Although some, but not all, of the labels using this mapping request legacy visual-order behaviour (e.g. iso-8859-8, iso-8859-8-e or even visual, but not e.g. iso-8859-8-i), bidirectional conversion for any given markup format is beyond the scope of this package: determining from the label whether legacy visual-order behaviour should be used, and responding if so, should be implemented separately if needed.
iso-8859-10 8-bit Nordic
iso-8859-13 8-bit Baltic Rim (ISO)
iso-8859-14 8-bit Celtic
iso-8859-15 8-bit New Western European
iso-8859-16 8-bit South-Eastern European (ISO)
koi8-r 8-bit Russian Cyrillic (KOI8)
koi8-u, koi8-ru 8-bit Ruthenian/Ukrainian/Belarusian Cyrillic (KOI8)
koi8-t 8-bit Tajik Cyrillic
kz1048 8-bit Kazakh Cyrillic
macintosh 8-bit Roman (Macintosh)
palmos PalmOS code page
ptcp154 8-bit Asian Cyrillic (Paratype)
windows-874, iso-8859-11, tis-620, cp874 8-bit Thai
windows-1250 8-bit Central European (Windows)
windows-1251 8-bit Cyrillic (Windows)
windows-1252, ascii, iso-8859-1, latin1 8-bit Western European. This is in accordance with WHATWG specification in re which mappings to associate with which labels. Note: Python's latin1 is sometimes used to round-trip arbitrary sensu stricto extended ASCII data; in Kuroko, it is better to use x-user-defined for that.
windows-1253 8-bit Greek (Windows)
windows-1254, iso-8859-9 8-bit Turkish
windows-1255 8-bit Hebrew (logical with vowel points)
windows-1256 8-bit Arabic (Windows)
windows-1257 8-bit Baltic Rim (Windows)
windows-1258 8-bit Vietnamese (Windows). Basic codec: encoder will accept text in the form generated by the decoder, but neither NFC nor NFD normalised forms. This follows both Python and WHATWG behaviour. Conversion of text in NFC or NFD forms to encodable form may need to be done in a separate step before using the encoder.
x-mac-arabic 8-bit Arabic (Macintosh)
x-mac-ce 8-bit Central European (Macintosh)
x-mac-croatian 8-bit Gajica
x-mac-cyrillic 8-bit Cyrillic (Macintosh)
x-mac-farsi 8-bit Persian (Macintosh)
x-mac-greek 8-bit Greek (Macintosh)
x-mac-icelandic 8-bit Icelandic (Macintosh)
x-mac-romanian 8-bit Romanian (Macintosh)
x-mac-turkish 8-bit Turkish (Macintosh)
x-user-defined 8-bit User Defined (ASCII based variant: using U+0000–007F, U+F780–F7FF)

Single-byte symbol or dingbat font encodings:

Major label(s) Meaning
cp042 8-bit User Defined (variant using U+0000–001F, U+F020–F0FF). Windows uses that mapping for symbol fonts in some contexts.

8-bit multi-byte Unicode codecs:

Major label(s) Meaning
cesu-8, utf8mb3, utf8-ucs2 CESU-8 (to UTF-16 as UTF-8 is to UTF-32). Mostly for interoperability with existing systems that use it.
gb18030 Chinese GB18030, WHATWG version. Not technically a full UTF in this implementation, since one PUA character is changed to an ideographic space per WHATWG.
utf-8, utf8mb4, utf8-ucs4 UTF-8 without a byte order mark
utf-8-sig UTF-8 with a byte order mark
utf-16 UTF-16 with byte order mark, little endian if missing
utf-16be UTF-16, big endian, no byte order mark
utf-16le UTF-16, little endian, no byte order mark
utf-32 UTF-32 with byte order mark (though byte order can usually also be detected in its absence)
utf-32be UTF-32, big endian, no byte order mark
utf-32le UTF-32, little endian, no byte order mark

8-bit multi-byte legacy CJK codecs:

Major label(s) Meaning
big5, big5-eten Traditional Chinese Big-5, ETen version, condoning HKSCS extensions when decoding.
big5-hkscs Traditional Chinese Big-5 with HKSCS extensions in both directions.
big5-nonetenkana, big5-tw Traditional Chinese Big-5, with BIG5.TXT (non-ETen) layout for kana and Cyrillic.
euc-jp, x-euc-jp Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 only when decoding.
euc-jp-full Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 in both directions.
euc-jisx0213, euc-jis-2004 Japanese EUC-JP, with JIS X 0213 mappings and extensions.
euc-kr, uhc, windows-949 Korean Unified Hangul Code (superset of EUC-KR, encodes KS C 5601).
gbk, gb2312 Chinese GBK (GB2312 extension), condoning GB18030 when decoding.
johab, johab-ascii Korean Johab (ASCII-compatible stateless standard version)
shift_jis, ms-kanji, windows-31j Japanese Shift JIS (Windows compatible version)
shift-jisx0213, shift-jis-2004 Japanese Shift JIS (JIS X 0213 version)
x-mac-chinesesimp Simplified Chinese GB2312, Macintosh version
x-mac-chinesetrad Traditional Chinese Big5, Macintosh version
x-mac-korean Korean HangulTalk (Macintosh encoding, another superset of EUC-KR)

7-bit stateful codecs:

Major label(s) Meaning
hz-gb-2312 HZ (Usenet Simplified Chinese) encoding
iso-2022-cn 7-bit stateful Chinese (Simplified and Traditional)
iso-2022-jp 7-bit stateful Japanese, web version
iso-2022-jp-ext 7-bit stateful Japanese, including JIS X 0212 and preserving katakana width
iso-2022-jp-1 7-bit stateful Japanese, including JIS X 0212
iso-2022-jp-2 7-bit stateful Multilingual (Japanese, Korean, Greek, Simplified Chinese, Western European)
iso-2022-jp-3 7-bit stateful Japanese, including JIS X 0213 (2000 edition format)
iso-2022-jp-2004 7-bit stateful Japanese, including JIS X 0213 (2004 edition format)
iso-2022-kr 7-bit stateful Korean
jis_encoding 7-bit stateful Japanese, comprehensive version
utf-7 A largely obsolete scheme for mixing ASCII and Base64'd UTF-16BE in e-mail. Included mostly for Python parity.

EBCDIC codecs:

Major label(s) Meaning
cp037 EBCDIC Default (United States, Netherlands, Portugal, Brazil, Australia, New Zealand, Canadian ESA/390)
cp273 EBCDIC German
cp424 EBCDIC Hebrew
cp500 EBCDIC "International" (Belgium, Switzerland, Canadian AS/400)
cp875 EBCDIC Greek
cp933, ibm-933, ibm-1364, johab-ebcdic EBCDIC Korean (Johab, IBM stateful version for EBCDIC)
cp1026 EBCDIC Turkish
cp1140 EBCDIC with Euro Sign

Codecs with unusual behaviour:

Major label(s) Meaning
inverse-base64 Base64 with inverse semantics to preserve type correctness (encoder reads, decoder creates). Error handler is ignored.
inverse-base64hqx Same, but using the BinHex4 alphabet (note: does not in and of itself create the BinHex4 format)
inverse-base64uu Same, but using the uuencode alphabet (note: does not in and of itself create the uuencode format)
inverse-quopri Quoted-Printable, with inverse semantics (encoder reads, decoder creates). Error handler is ignored.
japanese Attempts to detect the encoding of a Japanese document (like the unified "Japanese" option now offered by some browsers' encoding override menus), and raises ValueError if it cannot. Not intended to be used in the encode direction, but will behave as utf-8-sig in that case.
undefined, replacement Represents data for which encoding/decoding must not be attempted. Following WHATWG (and differing from Python), error handlers are accepted, though only by the decoder: the encoder will ignore them.

Notes on error conditions in the ISO-2022-JP family

Like most codecs, the ISO-2022-JP family will generate errors in place of sequences which it cannot interpret. However, they will also generate errors over certain sequences which have no immediate effect on the stream being outputted, so as to prevent them being used for masking syntax that would otherwise be sanitised or escaped (with errors="replace", this will cause a U+FFFD to be inserted, thus preserving the interruption in any syntax). For this to happen is per WHATWG; the specific circumstances in which this happens, however, deliberately vary somewhat from the WHATWG specification.

The iso-2022-jp codec is somewhat more pedantic than the WHATWG approach, and is intended to accept a strict subset of what the WHATWG approach accepts (but a superset of what the WHATWG and Python codecs generate for a single non-concatenated stream), excluding cases that are unlikely to occur in reality except as masking sequences. The WHATWG approach follows UTR #36 in forbidding exactly those cases which RFC 1468 does not permit, meaning that it forbids some cases that often occur in reality as a result of concatenation and are usually benign, and also permits some cases that are less likely to occur in reality and less likely to be benign. The jis_encoding codec (which is also used for decoding, but not for encoding, the iso-2022-jp* labels except for iso-2022-jp itself) permits the cases that result from concatenation but otherwise behaves the same as the iso-2022-jp codec in this regard; this, however, means that Shift Out and Shift In, which are not interpreted by the iso-2022-jp codec but are by the jis_encoding codec, are not currently checked for zero-effect use.

In reaction to the WHATWG approach, UTC L2/20-202 defines two "end states", the first of which forbids no such cases, and the second of which forbids certain additional cases while also being more lenient on the WHATWG-forbidden cases that actually occur in practice—however, it forbids the ordinary output of Python's iso-2022-jp codec under certain circumstances (see bold in table below), and still misses some obvious zero-effect switching sequences (notably, it permits ASCII→ASCII switches anywhere). Accordingly, end state 2 is not followed exactly by either codec either.

All the which having been said, this is purely for the sake of a theoretical consistency, and absolutely should not be used as an excuse to sanitise or escape text while it is encoded as ISO-2022-JP (nor any other stateful encoding). Accomodating established and plausible variation in encoders means that some masking sequences may still be possible; furthermore, the jis_encoding decoder does not penalise zero-effect Shift In characters. All sanitisation or escaping of data received in ISO-2022-JP must be carried out over the Unicode stream. Furthermore, using the ISO-2022-JP family or other stateful encodings inside an ASCII-delimited structure such as JSON should be avoided if possible; even if it cannot be avoided, one must not substitute an untrusted ISO-2022-JP* or JIS_Encoding sequence straight into an ASCII structure without verifying that it in fact returns to Shift In state (if applicable), with ASCII designated, at the end of the sequence, with no trailing single-shifts.

A summary of the differing behaviours is listed in the table below:

Approach DB→ASCII→DB SB→DB→SB ASCII→JISCII JISCII→ASCII ASCII→ASCII JISCII→JISCII DB→JISCII
iso-2022-jp No Good No Good Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E Only at start No Good Okay
jis_encoding Okay No Good Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E Only at start/end or next to 5C/7E or C0 control code; generated after 5C/7E Only at start No Good Okay
Python Okay Okay Okay; generated before 5C/7E Okay; generated after 5C/7E Okay Okay Okay
WHATWG No Good No Good Okay; generated before 5C/7E Okay; generated before 5C/7E Okay Okay Okay
End State 1 Okay Okay Okay Okay Okay Okay Okay
End State 2 Okay No Good Only at end or before 5C/7E Only at end or before 5C/7E Okay Only before 5C/7E Only before 5C/7E

Package contents