Convert a string to and from various encodings.
The basic supported encodings are roughly as specified in the WHATWG Encoding Standard, but more are also supported unless restriction to web encodings is explicitly specified.
Most encodings supported by Python are implemented, but not currently idna
or punycode
. Note however that Python makes x-mac-japanese
an alias of shift_jis
; this has not been done here. Also note that the behaviour in regards to association of encoding names with variants is somewhat different to Python's, partly due to following WHATWG: this affects most CJK codecs (e.g. Python treats shift_jis
and ms-kanji
differently, while this package does not), but also e.g. "ISO-8859-1".
Main entry points for the package are codecs.infrastructure.encode
, codecs.infrastructure.decode
and codecs.infrastructure.lookup
, all three of which are also available as e.g. codecs.encode
for convenience.
The list of codecs (not an exhaustive list of labels, nor close to one) is as follows.
Single-byte extended ASCII encodings:
Major label(s) | Meaning |
---|---|
cp437 | 8-bit United States (DOS) |
cp720 | 8-bit Arabic Letters and Box Drawing (DOS) |
cp737 | 8-bit Greek and Box Drawing (DOS) |
cp775 | 8-bit Baltic Rim (DOS) |
cp850 | 8-bit Western Europe and Canada (DOS) |
cp852 | 8-bit Central European (DOS) |
cp855 | 8-bit Balkan Cyrillic (DOS) |
cp856 | 8-bit Hebrew (DOS) |
cp857 | 8-bit Turkish (DOS) |
cp858 | 8-bit Western Europe and Canada with Euro (DOS) |
cp860 | 8-bit European Portugese (DOS) |
cp861 | 8-bit Icelandic (DOS) |
cp862 | 8-bit Hebrew and Box Drawing (DOS) |
cp863 | 8-bit Quebecois French (DOS) |
cp864 | 8-bit Arabic Positional Forms (DOS) |
cp865 | 8-bit Continental Nordic (DOS) |
cp866 , ibm866 | 8-bit Russian Cyrillic (DOS) |
cp869 | 8-bit Greek (DOS) |
cp1006 | 8-bit Urdu |
cp1125 | 8-bit Ukrainian Cyrillic (DOS) |
ecma-43-dv , cp367 , csascii | "8-bit Plain ASCII", i.e. ASCII without backspace composition, and with high bit unused. Note: most ASCII labels are mapped to Windows-1252, per WHATWG. |
hp-roman8 | 8-bit Roman (HP) |
iso-8859-2 | 8-bit Central European (ISO) |
iso-8859-3 | 8-bit South European (Maltese/Esperanto) |
iso-8859-4 | 8-bit North European |
iso-8859-5 | 8-bit Cyrillic (ISO) |
iso-8859-6 | 8-bit Arabic (ASMO/ISO) |
iso-8859-7 | 8-bit Greek (ISO) |
iso-8859-8 , iso-8859-8-i | 8-bit Hebrew (without vowel points). Although some, but not all, of the labels using this mapping request legacy visual-order behaviour (e.g. iso-8859-8 , iso-8859-8-e or even visual , but not e.g. iso-8859-8-i ), bidirectional conversion for any given markup format is beyond the scope of this package: determining from the label whether legacy visual-order behaviour should be used, and responding if so, should be implemented separately if needed. |
iso-8859-10 | 8-bit Nordic |
iso-8859-13 | 8-bit Baltic Rim (ISO) |
iso-8859-14 | 8-bit Celtic |
iso-8859-15 | 8-bit New Western European |
iso-8859-16 | 8-bit South-Eastern European (ISO) |
koi8-r | 8-bit Russian Cyrillic (KOI8) |
koi8-u , koi8-ru | 8-bit Ruthenian/Ukrainian/Belarusian Cyrillic (KOI8) |
koi8-t | 8-bit Tajik Cyrillic |
kz1048 | 8-bit Kazakh Cyrillic |
macintosh | 8-bit Roman (Macintosh) |
palmos | PalmOS code page |
ptcp154 | 8-bit Asian Cyrillic (Paratype) |
windows-874 , iso-8859-11 , tis-620 , cp874 | 8-bit Thai |
windows-1250 | 8-bit Central European (Windows) |
windows-1251 | 8-bit Cyrillic (Windows) |
windows-1252 , ascii , iso-8859-1 , latin1 | 8-bit Western European. This is in accordance with WHATWG specification in re which mappings to associate with which labels. Note: Python's latin1 is sometimes used to round-trip arbitrary sensu stricto extended ASCII data; in Kuroko, it is better to use x-user-defined for that. |
windows-1253 | 8-bit Greek (Windows) |
windows-1254 , iso-8859-9 | 8-bit Turkish |
windows-1255 | 8-bit Hebrew (logical with vowel points) |
windows-1256 | 8-bit Arabic (Windows) |
windows-1257 | 8-bit Baltic Rim (Windows) |
windows-1258 | 8-bit Vietnamese (Windows). Basic codec: encoder will accept text in the form generated by the decoder, but neither NFC nor NFD normalised forms. This follows both Python and WHATWG behaviour. Conversion of text in NFC or NFD forms to encodable form may need to be done in a separate step before using the encoder. |
x-mac-arabic | 8-bit Arabic (Macintosh) |
x-mac-ce | 8-bit Central European (Macintosh) |
x-mac-croatian | 8-bit Gajica |
x-mac-cyrillic | 8-bit Cyrillic (Macintosh) |
x-mac-farsi | 8-bit Persian (Macintosh) |
x-mac-greek | 8-bit Greek (Macintosh) |
x-mac-icelandic | 8-bit Icelandic (Macintosh) |
x-mac-romanian | 8-bit Romanian (Macintosh) |
x-mac-turkish | 8-bit Turkish (Macintosh) |
x-user-defined | 8-bit User Defined (ASCII based variant: using U+0000–007F, U+F780–F7FF) |
Single-byte symbol or dingbat font encodings:
Major label(s) | Meaning |
---|---|
cp042 | 8-bit User Defined (variant using U+0000–001F, U+F020–F0FF). Windows uses that mapping for symbol fonts in some contexts. |
8-bit multi-byte Unicode codecs:
Major label(s) | Meaning |
---|---|
cesu-8 , utf8mb3 , utf8-ucs2 | CESU-8 (to UTF-16 as UTF-8 is to UTF-32). Mostly for interoperability with existing systems that use it. |
gb18030 | Chinese GB18030, WHATWG version. Not technically a full UTF in this implementation, since one PUA character is changed to an ideographic space per WHATWG. |
utf-8 , utf8mb4 , utf8-ucs4 | UTF-8 without a byte order mark |
utf-8-sig | UTF-8 with a byte order mark |
utf-16 | UTF-16 with byte order mark, little endian if missing |
utf-16be | UTF-16, big endian, no byte order mark |
utf-16le | UTF-16, little endian, no byte order mark |
utf-32 | UTF-32 with byte order mark (though byte order can usually also be detected in its absence) |
utf-32be | UTF-32, big endian, no byte order mark |
utf-32le | UTF-32, little endian, no byte order mark |
8-bit multi-byte legacy CJK codecs:
Major label(s) | Meaning |
---|---|
big5 , big5-eten | Traditional Chinese Big-5, ETen version, condoning HKSCS extensions when decoding. |
big5-hkscs | Traditional Chinese Big-5 with HKSCS extensions in both directions. |
big5-nonetenkana , big5-tw | Traditional Chinese Big-5, with BIG5.TXT (non-ETen) layout for kana and Cyrillic. |
euc-jp , x-euc-jp | Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 only when decoding. |
euc-jp-full | Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 in both directions. |
euc-jisx0213 , euc-jis-2004 | Japanese EUC-JP, with JIS X 0213 mappings and extensions. |
euc-kr , uhc , windows-949 | Korean Unified Hangul Code (superset of EUC-KR, encodes KS C 5601). |
gbk , gb2312 | Chinese GBK (GB2312 extension), condoning GB18030 when decoding. |
johab , johab-ascii | Korean Johab (ASCII-compatible stateless standard version) |
shift_jis , ms-kanji , windows-31j | Japanese Shift JIS (Windows compatible version) |
shift-jisx0213 , shift-jis-2004 | Japanese Shift JIS (JIS X 0213 version) |
x-mac-chinesesimp | Simplified Chinese GB2312, Macintosh version |
x-mac-chinesetrad | Traditional Chinese Big5, Macintosh version |
x-mac-korean | Korean HangulTalk (Macintosh encoding, another superset of EUC-KR) |
7-bit stateful codecs:
Major label(s) | Meaning |
---|---|
hz-gb-2312 | HZ (Usenet Simplified Chinese) encoding |
iso-2022-cn | 7-bit stateful Chinese (Simplified and Traditional) |
iso-2022-jp | 7-bit stateful Japanese, web version |
iso-2022-jp-ext | 7-bit stateful Japanese, including JIS X 0212 and preserving katakana width |
iso-2022-jp-1 | 7-bit stateful Japanese, including JIS X 0212 |
iso-2022-jp-2 | 7-bit stateful Multilingual (Japanese, Korean, Greek, Simplified Chinese, Western European) |
iso-2022-jp-3 | 7-bit stateful Japanese, including JIS X 0213 (2000 edition format) |
iso-2022-jp-2004 | 7-bit stateful Japanese, including JIS X 0213 (2004 edition format) |
iso-2022-kr | 7-bit stateful Korean |
jis_encoding | 7-bit stateful Japanese, comprehensive version |
utf-7 | A largely obsolete scheme for mixing ASCII and Base64'd UTF-16BE in e-mail. Included mostly for Python parity. |
EBCDIC codecs:
Major label(s) | Meaning |
---|---|
cp037 | EBCDIC Default (United States, Netherlands, Portugal, Brazil, Australia, New Zealand, Canadian ESA/390) |
cp273 | EBCDIC German |
cp424 | EBCDIC Hebrew |
cp500 | EBCDIC "International" (Belgium, Switzerland, Canadian AS/400) |
cp875 | EBCDIC Greek |
cp933 , ibm-933 , ibm-1364 , johab-ebcdic | EBCDIC Korean (Johab, IBM stateful version for EBCDIC) |
cp1026 | EBCDIC Turkish |
cp1140 | EBCDIC with Euro Sign |
Codecs with unusual behaviour:
Major label(s) | Meaning |
---|---|
inverse-base64 | Base64 with inverse semantics to preserve type correctness (encoder reads, decoder creates). Error handler is ignored. |
inverse-base64hqx | Same, but using the BinHex4 alphabet (note: does not in and of itself create the BinHex4 format) |
inverse-base64uu | Same, but using the uuencode alphabet (note: does not in and of itself create the uuencode format) |
inverse-quopri | Quoted-Printable, with inverse semantics (encoder reads, decoder creates). Error handler is ignored. |
japanese | Attempts to detect the encoding of a Japanese document (like the unified "Japanese" option now offered by some browsers' encoding override menus), and raises ValueError if it cannot. Not intended to be used in the encode direction, but will behave as utf-8-sig in that case. |
undefined , replacement | Represents data for which encoding/decoding must not be attempted. Following WHATWG (and differing from Python), error handlers are accepted, though only by the decoder: the encoder will ignore them. |
Notes on error conditions in the ISO-2022-JP family
Like most codecs, the ISO-2022-JP family will generate errors in place of sequences which it cannot interpret. However, they will also generate errors over certain sequences which have no immediate effect on the stream being outputted, so as to prevent them being used for masking syntax that would otherwise be sanitised or escaped (with errors="replace"
, this will cause a U+FFFD
to be inserted, thus preserving the interruption in any syntax). For this to happen is per WHATWG; the specific circumstances in which this happens, however, deliberately vary somewhat from the WHATWG specification.
The iso-2022-jp
codec is somewhat more pedantic than the WHATWG approach, and is intended to accept a strict subset of what the WHATWG approach accepts (but a superset of what the WHATWG and Python codecs generate for a single non-concatenated stream), excluding cases that are unlikely to occur in reality except as masking sequences. The WHATWG approach follows UTR #36 in forbidding exactly those cases which RFC 1468 does not permit, meaning that it forbids some cases that often occur in reality as a result of concatenation and are usually benign, and also permits some cases that are less likely to occur in reality and less likely to be benign. The jis_encoding
codec (which is also used for decoding, but not for encoding, the iso-2022-jp*
labels except for iso-2022-jp
itself) permits the cases that result from concatenation but otherwise behaves the same as the iso-2022-jp
codec in this regard; this, however, means that Shift Out and Shift In, which are not interpreted by the iso-2022-jp
codec but are by the jis_encoding
codec, are not currently checked for zero-effect use.
In reaction to the WHATWG approach, UTC L2/20-202 defines two "end states", the first of which forbids no such cases, and the second of which forbids certain additional cases while also being more lenient on the WHATWG-forbidden cases that actually occur in practice—however, it forbids the ordinary output of Python's iso-2022-jp
codec under certain circumstances (see bold in table below), and still misses some obvious zero-effect switching sequences (notably, it permits ASCII→ASCII switches anywhere). Accordingly, end state 2 is not followed exactly by either codec either.
All the which having been said, this is purely for the sake of a theoretical consistency, and absolutely should not be used as an excuse to sanitise or escape text while it is encoded as ISO-2022-JP (nor any other stateful encoding). Accomodating established and plausible variation in encoders means that some masking sequences may still be possible; furthermore, the jis_encoding
decoder does not penalise zero-effect Shift In characters. All sanitisation or escaping of data received in ISO-2022-JP must be carried out over the Unicode stream. Furthermore, using the ISO-2022-JP family or other stateful encodings inside an ASCII-delimited structure such as JSON should be avoided if possible; even if it cannot be avoided, one must not substitute an untrusted ISO-2022-JP* or JIS_Encoding sequence straight into an ASCII structure without verifying that it in fact returns to Shift In state (if applicable), with ASCII designated, at the end of the sequence, with no trailing single-shifts.
A summary of the differing behaviours is listed in the table below:
Approach | DB→ASCII→DB | SB→DB→SB | ASCII→JISCII | JISCII→ASCII | ASCII→ASCII | JISCII→JISCII | DB→JISCII |
---|---|---|---|---|---|---|---|
iso-2022-jp | No Good | No Good | Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E | Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E | Only at start | No Good | Okay |
jis_encoding | Okay | No Good | Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E | Only at start/end or next to 5C/7E or C0 control code; generated after 5C/7E | Only at start | No Good | Okay |
Python | Okay | Okay | Okay; generated before 5C/7E | Okay; generated after 5C/7E | Okay | Okay | Okay |
WHATWG | No Good | No Good | Okay; generated before 5C/7E | Okay; generated before 5C/7E | Okay | Okay | Okay |
End State 1 | Okay | Okay | Okay | Okay | Okay | Okay | Okay |
End State 2 | Okay | No Good | Only at end or before 5C/7E | Only at end or before 5C/7E | Okay | Only before 5C/7E | Only before 5C/7E |
Package contents