From 96f31ef00d71acaa73b886960f86e1ecd4970cf7 Mon Sep 17 00:00:00 2001 From: Prhmma Date: Sun, 12 Oct 2025 00:27:02 +0100 Subject: [PATCH 1/9] gh-128571: Document UTF-16/32 native byte order --- Doc/library/codecs.rst | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 8c5c87a7ef16e4..009ac980d7adca 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -1000,6 +1000,23 @@ byte sequence. The byte swapped version of this character (``0xFFFE``) is an illegal character that may not appear in a Unicode text. So when the first character in a ``UTF-16`` or ``UTF-32`` byte sequence appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. + +.. note:: + + **Python UTF-16 and UTF-32 Codec Behavior** + + Python's ``UTF-16`` and ``UTF-32`` codecs (when used without an explicit + byte order suffix like ``-BE`` or ``-LE``) follow the platform's native + byte order when no BOM is present. This differs from the Unicode Standard + specification, which states that UTF-16 and UTF-32 encoding schemes should + default to big-endian byte order when no BOM is present and no higher-level + protocol specifies the byte order. + + This behavior was chosen for practical compatibility reasons, as it avoids + byte swapping on the most common platforms, but developers should be aware + of this difference when exchanging data with systems that strictly follow + the Unicode specification. + Unfortunately the character ``U+FEFF`` had a second purpose as a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow a word to be split. It can e.g. be used to give hints to a ligature algorithm. From 9a5ee89987fd07f73dee21390632bfcebef62848 Mon Sep 17 00:00:00 2001 From: Prhmma Date: Sun, 12 Oct 2025 11:16:35 +0100 Subject: [PATCH 2/9] Removed the note and improved existing description based on the discussion in the issue --- Doc/library/codecs.rst | 40 ++++++++++++++-------------------------- 1 file changed, 14 insertions(+), 26 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 009ac980d7adca..84816d737275f1 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -990,32 +990,20 @@ code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you -will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this -problem: bytes will always be in natural endianness. When these bytes are read -by a CPU with a different endianness, then bytes have to be swapped though. To -be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, -there's the so called BOM ("Byte Order Mark"). This is the Unicode character -``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` -byte sequence. The byte swapped version of this character (``0xFFFE``) is an -illegal character that may not appear in a Unicode text. So when the -first character in a ``UTF-16`` or ``UTF-32`` byte sequence -appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. - -.. note:: - - **Python UTF-16 and UTF-32 Codec Behavior** - - Python's ``UTF-16`` and ``UTF-32`` codecs (when used without an explicit - byte order suffix like ``-BE`` or ``-LE``) follow the platform's native - byte order when no BOM is present. This differs from the Unicode Standard - specification, which states that UTF-16 and UTF-32 encoding schemes should - default to big-endian byte order when no BOM is present and no higher-level - protocol specifies the byte order. - - This behavior was chosen for practical compatibility reasons, as it avoids - byte swapping on the most common platforms, but developers should be aware - of this difference when exchanging data with systems that strictly follow - the Unicode specification. +will always have to swap bytes on encoding and decoding. +Python's ``UTF-32`` codec avoids this problem by using the platform's native byte +order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or +``-LE`` suffix) behaves the same way. Python follows prevailing platform +practice so native-endian data round-trips without redundant byte swapping, +even though the Unicode Standard defaults to big-endian when the byte order is +unspecified.When these bytes are read by a CPU with a different endianness, +then bytes have to be swapped though. To be able to detect the endianness of a +``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark"). +This is the Unicode character ``U+FEFF``. This character can be prepended to every +``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character +(``0xFFFE``) is an illegal character that may not appear in a Unicode text. +So when the first character in a ``UTF-16`` or ``UTF-32`` byte sequence appears to be +a ``U+FFFE`` the bytes have to be swapped on decoding. Unfortunately the character ``U+FEFF`` had a second purpose as a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow From 171869f4d61b4bfda0bead627174976509249cc3 Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Fri, 17 Oct 2025 11:09:29 +0100 Subject: [PATCH 3/9] Update Doc/library/codecs.rst Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 84816d737275f1..cedd77a34918fb 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -996,7 +996,7 @@ order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or ``-LE`` suffix) behaves the same way. Python follows prevailing platform practice so native-endian data round-trips without redundant byte swapping, even though the Unicode Standard defaults to big-endian when the byte order is -unspecified.When these bytes are read by a CPU with a different endianness, +unspecified. When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. To be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark"). This is the Unicode character ``U+FEFF``. This character can be prepended to every From bc661bce666989e94816ae48897ad8c625606df2 Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Fri, 17 Oct 2025 11:10:04 +0100 Subject: [PATCH 4/9] Update Doc/library/codecs.rst Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index cedd77a34918fb..a3e11bb4149385 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -994,7 +994,7 @@ will always have to swap bytes on encoding and decoding. Python's ``UTF-32`` codec avoids this problem by using the platform's native byte order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or ``-LE`` suffix) behaves the same way. Python follows prevailing platform -practice so native-endian data round-trips without redundant byte swapping, +practice, so native-endian data round-trips without redundant byte swapping, even though the Unicode Standard defaults to big-endian when the byte order is unspecified. When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. To be able to detect the endianness of a From 9c93da5f2a848228425e6af8ba8511917a51841f Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Fri, 17 Oct 2025 11:10:35 +0100 Subject: [PATCH 5/9] Update Doc/library/codecs.rst Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index a3e11bb4149385..7911543301af51 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -1002,8 +1002,8 @@ then bytes have to be swapped though. To be able to detect the endianness of a This is the Unicode character ``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character (``0xFFFE``) is an illegal character that may not appear in a Unicode text. -So when the first character in a ``UTF-16`` or ``UTF-32`` byte sequence appears to be -a ``U+FFFE`` the bytes have to be swapped on decoding. +When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is +``U+FFFE``, the bytes have to be swapped on decoding. Unfortunately the character ``U+FEFF`` had a second purpose as a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow From 92044308144680060995a6df1a0396d0c4b0a10e Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Fri, 17 Oct 2025 11:10:56 +0100 Subject: [PATCH 6/9] Update Doc/library/codecs.rst Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 7911543301af51..fb584993ecab02 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -998,7 +998,7 @@ practice, so native-endian data round-trips without redundant byte swapping, even though the Unicode Standard defaults to big-endian when the byte order is unspecified. When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. To be able to detect the endianness of a -``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark"). +``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used. This is the Unicode character ``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character (``0xFFFE``) is an illegal character that may not appear in a Unicode text. From b0fa2ba63ff5c06151ceba862d4271c582f47bb8 Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Fri, 17 Oct 2025 11:11:11 +0100 Subject: [PATCH 7/9] Update Doc/library/codecs.rst Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index fb584993ecab02..6e3cf99c2ddb62 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -997,7 +997,7 @@ order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or practice, so native-endian data round-trips without redundant byte swapping, even though the Unicode Standard defaults to big-endian when the byte order is unspecified. When these bytes are read by a CPU with a different endianness, -then bytes have to be swapped though. To be able to detect the endianness of a +the bytes have to be swapped. To be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used. This is the Unicode character ``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character From ff82ed51dfc33d8a368483dca73e15818b948231 Mon Sep 17 00:00:00 2001 From: Prhmma Date: Fri, 17 Oct 2025 11:22:08 +0100 Subject: [PATCH 8/9] changing e.g. to for example --- Doc/library/codecs.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 6e3cf99c2ddb62..bf4ae822d95543 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -989,8 +989,8 @@ defined in Unicode. A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their -disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you -will always have to swap bytes on encoding and decoding. +disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian +machine you will always have to swap bytes on encoding and decoding. Python's ``UTF-32`` codec avoids this problem by using the platform's native byte order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or ``-LE`` suffix) behaves the same way. Python follows prevailing platform From 8fda4360bf9ac18ecdb6f82b1392d6630b29f41a Mon Sep 17 00:00:00 2001 From: Parham MohammadAlizadeh Date: Sat, 18 Oct 2025 16:15:14 +0100 Subject: [PATCH 9/9] Update Doc/library/codecs.rst - deduplication Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> --- Doc/library/codecs.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 2c9545e3c5e303..2a5994b11d83d9 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -991,9 +991,9 @@ possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian machine you will always have to swap bytes on encoding and decoding. -Python's ``UTF-32`` codec avoids this problem by using the platform's native byte -order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or -``-LE`` suffix) behaves the same way. Python follows prevailing platform +Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the +platform's native byte order when no BOM is present. +Python follows prevailing platform practice, so native-endian data round-trips without redundant byte swapping, even though the Unicode Standard defaults to big-endian when the byte order is unspecified. When these bytes are read by a CPU with a different endianness,