From 96f31ef00d71acaa73b886960f86e1ecd4970cf7 Mon Sep 17 00:00:00 2001
From: Prhmma <prhmma@gmail.com>
Date: Sun, 12 Oct 2025 00:27:02 +0100
Subject: [PATCH 1/9] gh-128571: Document UTF-16/32 native byte order

---
 Doc/library/codecs.rst | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 8c5c87a7ef16e4..009ac980d7adca 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -1000,6 +1000,23 @@ byte sequence. The byte swapped version of this character (``0xFFFE``) is an
 illegal character that may not appear in a Unicode text. So when the
 first character in a ``UTF-16`` or ``UTF-32`` byte sequence
 appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
+
+.. note::
+
+   **Python UTF-16 and UTF-32 Codec Behavior**
+
+   Python's ``UTF-16`` and ``UTF-32`` codecs (when used without an explicit
+   byte order suffix like ``-BE`` or ``-LE``) follow the platform's native
+   byte order when no BOM is present. This differs from the Unicode Standard
+   specification, which states that UTF-16 and UTF-32 encoding schemes should
+   default to big-endian byte order when no BOM is present and no higher-level
+   protocol specifies the byte order.
+
+   This behavior was chosen for practical compatibility reasons, as it avoids
+   byte swapping on the most common platforms, but developers should be aware
+   of this difference when exchanging data with systems that strictly follow
+   the Unicode specification.
+
 Unfortunately the character ``U+FEFF`` had a second purpose as
 a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
 a word to be split. It can e.g. be used to give hints to a ligature algorithm.

From 9a5ee89987fd07f73dee21390632bfcebef62848 Mon Sep 17 00:00:00 2001
From: Prhmma <prhmma@gmail.com>
Date: Sun, 12 Oct 2025 11:16:35 +0100
Subject: [PATCH 2/9] Removed the note and improved existing description based
 on the discussion in the issue

---
 Doc/library/codecs.rst | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 009ac980d7adca..84816d737275f1 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -990,32 +990,20 @@ code point, is to store each code point as four consecutive bytes. There are two
 possibilities: store the bytes in big endian or in little endian order. These
 two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
 disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
-will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
-problem: bytes will always be in natural endianness. When these bytes are read
-by a CPU with a different endianness, then bytes have to be swapped though. To
-be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
-there's the so called BOM ("Byte Order Mark"). This is the Unicode character
-``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
-byte sequence. The byte swapped version of this character (``0xFFFE``) is an
-illegal character that may not appear in a Unicode text. So when the
-first character in a ``UTF-16`` or ``UTF-32`` byte sequence
-appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
-
-.. note::
-
-   **Python UTF-16 and UTF-32 Codec Behavior**
-
-   Python's ``UTF-16`` and ``UTF-32`` codecs (when used without an explicit
-   byte order suffix like ``-BE`` or ``-LE``) follow the platform's native
-   byte order when no BOM is present. This differs from the Unicode Standard
-   specification, which states that UTF-16 and UTF-32 encoding schemes should
-   default to big-endian byte order when no BOM is present and no higher-level
-   protocol specifies the byte order.
-
-   This behavior was chosen for practical compatibility reasons, as it avoids
-   byte swapping on the most common platforms, but developers should be aware
-   of this difference when exchanging data with systems that strictly follow
-   the Unicode specification.
+will always have to swap bytes on encoding and decoding.
+Python's ``UTF-32`` codec avoids this problem by using the platform's native byte
+order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
+``-LE`` suffix) behaves the same way. Python follows prevailing platform
+practice so native-endian data round-trips without redundant byte swapping,
+even though the Unicode Standard defaults to big-endian when the byte order is
+unspecified.When these bytes are read by a CPU with a different endianness,
+then bytes have to be swapped though. To be able to detect the endianness of a
+``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark").
+This is the Unicode character ``U+FEFF``. This character can be prepended to every
+``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
+(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
+So when the first character in a ``UTF-16`` or ``UTF-32`` byte sequence appears to be
+a ``U+FFFE`` the bytes have to be swapped on decoding.
 
 Unfortunately the character ``U+FEFF`` had a second purpose as
 a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow

From 171869f4d61b4bfda0bead627174976509249cc3 Mon Sep 17 00:00:00 2001
From: Parham MohammadAlizadeh <prhmma@gmail.com>
Date: Fri, 17 Oct 2025 11:09:29 +0100
Subject: [PATCH 3/9] Update Doc/library/codecs.rst

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/library/codecs.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 84816d737275f1..cedd77a34918fb 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -996,7 +996,7 @@ order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
 ``-LE`` suffix) behaves the same way. Python follows prevailing platform
 practice so native-endian data round-trips without redundant byte swapping,
 even though the Unicode Standard defaults to big-endian when the byte order is
-unspecified.When these bytes are read by a CPU with a different endianness,
+unspecified. When these bytes are read by a CPU with a different endianness,
 then bytes have to be swapped though. To be able to detect the endianness of a
 ``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark").
 This is the Unicode character ``U+FEFF``. This character can be prepended to every

From bc661bce666989e94816ae48897ad8c625606df2 Mon Sep 17 00:00:00 2001
From: Parham MohammadAlizadeh <prhmma@gmail.com>
Date: Fri, 17 Oct 2025 11:10:04 +0100
Subject: [PATCH 4/9] Update Doc/library/codecs.rst

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/library/codecs.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index cedd77a34918fb..a3e11bb4149385 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -994,7 +994,7 @@ will always have to swap bytes on encoding and decoding.
 Python's ``UTF-32`` codec avoids this problem by using the platform's native byte
 order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
 ``-LE`` suffix) behaves the same way. Python follows prevailing platform
-practice so native-endian data round-trips without redundant byte swapping,
+practice, so native-endian data round-trips without redundant byte swapping,
 even though the Unicode Standard defaults to big-endian when the byte order is
 unspecified. When these bytes are read by a CPU with a different endianness,
 then bytes have to be swapped though. To be able to detect the endianness of a

From 9c93da5f2a848228425e6af8ba8511917a51841f Mon Sep 17 00:00:00 2001
From: Parham MohammadAlizadeh <prhmma@gmail.com>
Date: Fri, 17 Oct 2025 11:10:35 +0100
Subject: [PATCH 5/9] Update Doc/library/codecs.rst

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/library/codecs.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index a3e11bb4149385..7911543301af51 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -1002,8 +1002,8 @@ then bytes have to be swapped though. To be able to detect the endianness of a
 This is the Unicode character ``U+FEFF``. This character can be prepended to every
 ``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
 (``0xFFFE``) is an illegal character that may not appear in a Unicode text.
-So when the first character in a ``UTF-16`` or ``UTF-32`` byte sequence appears to be
-a ``U+FFFE`` the bytes have to be swapped on decoding.
+When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
+``U+FFFE``, the bytes have to be swapped on decoding.
 
 Unfortunately the character ``U+FEFF`` had a second purpose as
 a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow

From 92044308144680060995a6df1a0396d0c4b0a10e Mon Sep 17 00:00:00 2001
From: Parham MohammadAlizadeh <prhmma@gmail.com>
Date: Fri, 17 Oct 2025 11:10:56 +0100
Subject: [PATCH 6/9] Update Doc/library/codecs.rst

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/library/codecs.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 7911543301af51..fb584993ecab02 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -998,7 +998,7 @@ practice, so native-endian data round-trips without redundant byte swapping,
 even though the Unicode Standard defaults to big-endian when the byte order is
 unspecified. When these bytes are read by a CPU with a different endianness,
 then bytes have to be swapped though. To be able to detect the endianness of a
-``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark").
+``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
 This is the Unicode character ``U+FEFF``. This character can be prepended to every
 ``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
 (``0xFFFE``) is an illegal character that may not appear in a Unicode text.

From b0fa2ba63ff5c06151ceba862d4271c582f47bb8 Mon Sep 17 00:00:00 2001
From: Parham MohammadAlizadeh <prhmma@gmail.com>
Date: Fri, 17 Oct 2025 11:11:11 +0100
Subject: [PATCH 7/9] Update Doc/library/codecs.rst

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/library/codecs.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index fb584993ecab02..6e3cf99c2ddb62 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -997,7 +997,7 @@ order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
 practice, so native-endian data round-trips without redundant byte swapping,
 even though the Unicode Standard defaults to big-endian when the byte order is
 unspecified. When these bytes are read by a CPU with a different endianness,
-then bytes have to be swapped though. To be able to detect the endianness of a
+the bytes have to be swapped. To be able to detect the endianness of a
 ``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
 This is the Unicode character ``U+FEFF``. This character can be prepended to every
 ``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character

From ff82ed51dfc33d8a368483dca73e15818b948231 Mon Sep 17 00:00:00 2001
From: Prhmma <prhmma@gmail.com>
Date: Fri, 17 Oct 2025 11:22:08 +0100
Subject: [PATCH 8/9] changing e.g. to for example

---
 Doc/library/codecs.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 6e3cf99c2ddb62..bf4ae822d95543 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -989,8 +989,8 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
 code point, is to store each code point as four consecutive bytes. There are two
 possibilities: store the bytes in big endian or in little endian order. These
 two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
-disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
-will always have to swap bytes on encoding and decoding.
+disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
+machine you will always have to swap bytes on encoding and decoding.
 Python's ``UTF-32`` codec avoids this problem by using the platform's native byte
 order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
 ``-LE`` suffix) behaves the same way. Python follows prevailing platform

From 8fda4360bf9ac18ecdb6f82b1392d6630b29f41a Mon Sep 17 00:00:00 2001
From: Parham MohammadAlizadeh <prhmma@gmail.com>
Date: Sat, 18 Oct 2025 16:15:14 +0100
Subject: [PATCH 9/9] Update Doc/library/codecs.rst - deduplication

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
---
 Doc/library/codecs.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 2c9545e3c5e303..2a5994b11d83d9 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -991,9 +991,9 @@ possibilities: store the bytes in big endian or in little endian order. These
 two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
 disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
 machine you will always have to swap bytes on encoding and decoding.
-Python's ``UTF-32`` codec avoids this problem by using the platform's native byte
-order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
-``-LE`` suffix) behaves the same way. Python follows prevailing platform
+Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
+platform's native byte order when no BOM is present.
+Python follows prevailing platform
 practice, so native-endian data round-trips without redundant byte swapping,
 even though the Unicode Standard defaults to big-endian when the byte order is
 unspecified. When these bytes are read by a CPU with a different endianness,