-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
gh-101178: Add Ascii85, base85, and Z85 support to binascii #102753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
It's a year later, and Z85 support has been added to For reference, this is the benchmark run that led me to do so. # After merging main but before adding Z85 support to this PR
(cpython-b85) $ python bench_b85.py 64
b64encode : 67108864 b in 0.121 s (527.435 MB/s) using 42.667 MB
b64decode : 89478488 b in 0.309 s (276.188 MB/s) using 56.889 MB
a85encode : 67108864 b in 0.297 s (215.150 MB/s) using 0.000 MB
a85decode : 83886080 b in 0.205 s (390.751 MB/s) using 0.000 MB
b85encode : 67108864 b in 0.106 s (604.359 MB/s) using 0.000 MB
b85decode : 83886080 b in 0.204 s (393.040 MB/s) using 0.000 MB
z85encode : 67108864 b in 0.204 s (313.610 MB/s) using 80.000 MB
z85decode : 83886080 b in 0.300 s (266.670 MB/s) using 100.000 MBThe existing Z85 implementation translates from the standard base85 alphabet to Z85 after the fact and within Python, so it was already benefiting from this PR but with substantial performance and memory usage overhead. That overhead is now gone. |
71f1955 to
7b4aba1
Compare
Add Ascii85, base85, and Z85 encoders and decoders to `binascii`, replacing the existing pure Python implementations in `base64`. No API or documentation changes are necessary with respect to `base64.a85encode()`, `b85encode()`, etc., and all existing unit tests for those functions continue to pass without modification. Note that attempting to decode Ascii85 or base85 data of length 1 mod 5 (after accounting for Ascii85 quirks) now produces an error, as no encoder would emit such data. This should be the only significant externally visible difference compared to the old implementation. Resolves: pythongh-101178
7b4aba1 to
05ae5ad
Compare
|
PR has been rebased onto main at 78cfee6 with squashing. |
I believe you have to document this change. |
Fair point, I could do that. In case anyone argues for keeping the old behavior (silently ignoring length 1 mod 5), I won't do it just yet. |
If we were strictly following PEP-0399, _base64 would be a C module for accelerated functions in base64. Due to historical reasons, those should actually go in binascii instead. We still want to preserve the existing Python code in base64. Parting out facilities for accessing the C functions into a module named _base64 shouldn't risk a naming conflict and will simplify testing.
This is done differently to PEP-0399 to minimize the number of changed lines.
As we're now keeping the existing Python base 85 functions, the C implementations should behave exactly the same, down to exception type and wording. It is also no longer an error to try to decode data of length 1 mod 5.
|
The PR has been updated to preserve the existing base 85 Python functions in |
Importing update_wrapper() from functools to copy attributes is expensive. Do it ourselves for only the most relevant ones.
This requires some code duplication, but oh well.
Using a decorator complicates function signature introspection.
Do we really need to test the legacy API twice?
|
PR was accidentally closed due to misclicking on mobile. There should be a confirmation dialog or something 😅 |
|
Thank you for your PR, @kangtastic. This looks very interesting. The speed was increased by 100 times, this is impressive. Initially we did not bother to implement these codecs in C, because they are non-standard, and the expected use was just for few kilobytes. But it seems people use them for larger data. I resolved conflicts and fixed tests and support of features added since the last update. I am planning to finish the review and land that code in Python. I managed to speed up Ascii85 encoding by 3 times, it can be almost so fast as Base85. If you don't mind, I'll apply this optimization here. As for general organization of the code, I think that introduction of the Also, instead of the boolean z85 parameter, it is better to add separate functions. It could be even better to generalize the functions to work with an arbitrary alphabet, but it is not easy for decoders. If we are going this way, we have to generate the decoding table in Python, or implement caching in C. |
Thanks for reviewing, @serhiy-storchaka.
No objections from me. Interested to see how you did it.
Yes, I originally removed the pure-Python base-85-related function implementations in
Regarding generalization for arbitrary alphabets, I didn't think it was necessary given that only Ascii85, Base85, and Z85 are in common use, and Ascii85 is different enough that there are really only two alphabets to be generalized. How about separate functions for Base85 and Z85 only on the Python API side? I'm inclined to keep them combined on the C side to avoid code duplication (with wrapper functions for the z85 parameter). |
picnixz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I'd appreciate, as it's new code, that PEP-7 is followed.
|
|
||
|
|
||
| def _bytes_from_encode_data(b): | ||
| return b if isinstance(b, bytes_types) else memoryview(b).tobytes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memoryview(b).tobytes() may raise a generic TypeError here, whereas it is in _bytes_from_decode_data. Should we do it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion by @serhiy-storchaka seems to have been resolved with the conclusion that the pure-Python base-85-related functions from base64 may be removed.
I'll therefore turn base64.a85encode(), etc. into wrappers for the new binascii.b2a_ascii85(), etc. as I had it originally, similarly to how e.g. base64.b64encode() is a wrapper for binascii.b2a_base64().
_base64.py will be removed, since it will no longer be necessary to test against both Python and C implementations of the base-85-related functions. So that makes your concerns for this specific file moot.
However, to follow your comment here a bit:
The type conversion logic is replicated from the existing _85encode() helper function, which itself seems to have replicated the logic from the existing _b32encode() helper function.
The logic in _b32encode(), in turn, seems to be an attempt to replicate the existing binascii facility for input type conversions.
The new additions (base64.a85encode(), etc.) use the existing binascii facility, which raises a more descriptive TypeError instead of a generic one.
Can you or someone else verify that the C type conversion logic in binascii and the Python logic in base64 are equivalent, or at least equivalent enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so it was something that was already there. Ok, sorry for the noise. I was more worried about the fact that a TypeError was caught during decoding but not during encoding.
|
|
||
| # Functions in binascii raise binascii.Error instead of ValueError. | ||
|
|
||
| def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False): | |
| def a85encode(b, /, *, foldspaces=False, wrapcol=0, pad=False, adobe=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the existing Python implementation does not use positional-only parameters. Changing this is a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arf, ok :( I thought it was incorrectly c/c.
| raise ValueError(e) from None | ||
|
|
||
|
|
||
| def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'): | |
| def a85decode(b, /, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'): |
| state = get_binascii_state(module); | ||
| if (state != NULL) { | ||
| PyErr_SetString(state->Error, "Ascii85 overflow"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be NULL only if:
- the state is really missing (
mod->md_state == NULL) - 'module' is not a module
I don't know when it could be possible for the state to be missing (maybe at interpreter's finalizaation?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know when it could be possible for the state to be missing (maybe at interpreter's finalizaation?).
Apparently it's possible for state to be missing if an error is already set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but this is only if the error is a badargument (that is we didn't give a module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did it because every other PyErr_SetString(state->) or PyErr_Format(state->) in binascii does it 😅
Maybe someone knows if it's OK to do assert(state != NULL); instead, like some other modules. Separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's better to do it in a separate issue.
|
|
||
| return PyBytesWriter_FinishWithPointer(writer, bin_data); | ||
|
|
||
| error_end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest naming this error instead of error_end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
It may be better to separate them for PGO though I have no idea whether there will be an impact or not (this needs to be measured). If you're worried about parts of the code being duplicated, it can be refactored into macros (or into smaller functions). However, we should avoid the user-interface exposing different flavors with boolean switches (we usually try to avoid this, as illustrated by examples of |
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
I pushed these changes. The code is similar to See also #143216 -- it reuses the same code for wrapping line in
Yes, of course. |
|
Hi @picnixz, thanks for reviewing.
I did attempt to follow PEP-7; did you have something specific in mind? If it's about the old-style variable declarations at the top of each C function, the reason for that was to match much of the rest of But since @serhiy-storchaka partially undid that in 0df9a40 (apparently following ca99af3) I don't mind cleaning up the style a bit more. |
|
It's mainly to avoid Py_ssize_t out_len = 5 * ((bin_len + 3) / 4);
if (wrap) out_len += 4;
if (!pad && (bin_len % 4)) out_len -= 4 - (bin_len % 4);
if (width && out_len) out_len += (out_len - 1) / width;I prefer clear |
Will do. |
Synopsis
Add Ascii85 and base85 encoder and decoder functions implemented in C to
binasciiand use them to greatly improve the performance and reduce the memory usage of the existing Ascii85, base85, and Z85 codec functions inbase64.No API or documentation changes are necessary with respect to any functions in
base64, and all existing unit tests for those functions continue to pass without modification.Resolves: gh-101178
Discussion
The base85-related functions in
base64are now wrappers for the new functions inbinascii, as envisioned in the docs:Parting out Ascii85 from base85 and Z85 was warranted in my testing despite the code duplication due to the various performance-murdering special cases in Ascii85.
Comments and questions are welcome.
Benchmarks
Updated April 20, 2025.
The old pure-Python implementation is two orders of magnitude slower and uses over O(40n) temporary memory.