Skip to content

Commit df97a73

Browse files
willingcpradyunsg
andauthored
Updating CPython Internals (#1188)
* Move bytecode section out of compiler doc * Edits to compiler sections for readability * Add references for Objects and Specializing Adaptive Interpreter * add description of PEP * Add note for moved section on Introducing New Bytecode * fix space * Update internals/compiler.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> * Update internals/compiler.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> --------- Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com>
1 parent 3793c45 commit df97a73

File tree

2 files changed

+105
-75
lines changed

2 files changed

+105
-75
lines changed

internals/compiler.rst

Lines changed: 46 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -11,33 +11,40 @@ Abstract
1111

1212
In CPython, the compilation from source code to bytecode involves several steps:
1313

14-
1. Tokenize the source code (:cpy-file:`Parser/tokenizer.c`)
14+
1. Tokenize the source code (:cpy-file:`Parser/tokenizer.c`).
1515
2. Parse the stream of tokens into an Abstract Syntax Tree
16-
(:cpy-file:`Parser/parser.c`)
17-
3. Transform AST into an instruction sequence (:cpy-file:`Python/compile.c`)
18-
4. Construct a Control Flow Graph and apply optimizations to it (:cpy-file:`Python/flowgraph.c`)
19-
5. Emit bytecode based on the Control Flow Graph (:cpy-file:`Python/assemble.c`)
16+
(:cpy-file:`Parser/parser.c`).
17+
3. Transform AST into an instruction sequence (:cpy-file:`Python/compile.c`).
18+
4. Construct a Control Flow Graph and apply optimizations to it (:cpy-file:`Python/flowgraph.c`).
19+
5. Emit bytecode based on the Control Flow Graph (:cpy-file:`Python/assemble.c`).
2020

21-
The purpose of this document is to outline how these steps of the process work.
21+
This document outlines how these steps of the process work.
2222

23-
This document does not touch on how parsing works beyond what is needed
24-
to explain what is needed for compilation. It is also not exhaustive
25-
in terms of the how the entire system works. You will most likely need
26-
to read some source to have an exact understanding of all details.
23+
This document only describes parsing in enough depth to explain what is needed
24+
for understanding compilation. This document provides a detailed, though not
25+
exhaustive, view of the how the entire system works. You will most likely need
26+
to read some source code to have an exact understanding of all details.
2727

2828

2929
Parsing
3030
=======
3131

3232
As of Python 3.9, Python's parser is a PEG parser of a somewhat
33-
unusual design (since its input is a stream of tokens rather than a
34-
stream of characters as is more common with PEG parsers).
33+
unusual design. It is unusual in the sense that the parser's input is a stream
34+
of tokens rather than a stream of characters which is more common with PEG
35+
parsers.
3536

3637
The grammar file for Python can be found in
3738
:cpy-file:`Grammar/python.gram`. The definitions for literal tokens
3839
(such as ``:``, numbers, etc.) can be found in :cpy-file:`Grammar/Tokens`.
3940
Various C files, including :cpy-file:`Parser/parser.c` are generated from
40-
these (see :ref:`grammar`).
41+
these.
42+
43+
.. seealso::
44+
45+
:ref:`parser` for a detailed description of the parser.
46+
47+
:ref:`grammar` for a detailed description of the grammar.
4148

4249

4350
Abstract syntax trees (AST)
@@ -133,9 +140,9 @@ Memory management
133140
=================
134141

135142
Before discussing the actual implementation of the compiler, a discussion of
136-
how memory is handled is in order. To make memory management simple, an arena
137-
is used. This means that a memory is pooled in a single location for easy
138-
allocation and removal. What this gives us is the removal of explicit memory
143+
how memory is handled is in order. To make memory management simple, an **arena**
144+
is used that pools memory in a single location for easy
145+
allocation and removal. This enables the removal of explicit memory
139146
deallocation. Because memory allocation for all needed memory in the compiler
140147
registers that memory with the arena, a single call to free the arena is all
141148
that is needed to completely free all memory used by the compiler.
@@ -153,8 +160,8 @@ used. That freeing is done with ``PyArena_Free()``. This only needs to be
153160
called in strategic areas where the compiler exits.
154161

155162
As stated above, in general you should not have to worry about memory
156-
management when working on the compiler. The technical details have been
157-
designed to be hidden from you for most cases.
163+
management when working on the compiler. The technical details of memory
164+
management have been designed to be hidden from you for most cases.
158165

159166
The only exception comes about when managing a PyObject. Since the rest
160167
of Python uses reference counting, there is extra support added
@@ -173,7 +180,7 @@ The AST is generated from source code using the function
173180
After some checks, a helper function in :cpy-file:`Parser/parser.c` begins applying
174181
production rules on the source code it receives; converting source code to
175182
tokens and matching these tokens recursively to their corresponding rule. The
176-
rule's corresponding rule function is called on every match. These rule
183+
production rule's corresponding rule function is called on every match. These rule
177184
functions follow the format :samp:`xx_rule`. Where *xx* is the grammar rule
178185
that the function handles and is automatically derived from
179186
:cpy-file:`Grammar/python.gram`
@@ -293,7 +300,7 @@ number is passed as the last parameter to each ``stmt_ty`` function.
293300
Control flow graphs
294301
===================
295302

296-
A *control flow graph* (often referenced by its acronym, CFG) is a
303+
A **control flow graph** (often referenced by its acronym, **CFG**) is a
297304
directed graph that models the flow of a program. A node of a CFG is
298305
not an individual bytecode instruction, but instead represents a
299306
sequence of bytecode instructions that always execute sequentially.
@@ -441,60 +448,6 @@ flattening and then a ``PyCodeObject`` is created. All of this is
441448
handled by calling ``assemble()``.
442449

443450

444-
Introducing new bytecode
445-
========================
446-
447-
Sometimes a new feature requires a new opcode. But adding new bytecode is
448-
not as simple as just suddenly introducing new bytecode in the AST ->
449-
bytecode step of the compiler. Several pieces of code throughout Python depend
450-
on having correct information about what bytecode exists.
451-
452-
First, you must choose a name, implement the bytecode in
453-
:cpy-file:`Python/bytecodes.c`, and add a documentation entry in
454-
:cpy-file:`Doc/library/dis.rst`. Then run ``make regen-cases`` to
455-
assign a number for it (see :cpy-file:`Include/opcode_ids.h`) and
456-
regenerate a number of files with the actual implementation of the
457-
bytecodes (:cpy-file:`Python/generated_cases.c.h`) and additional
458-
files with metadata about them.
459-
460-
With a new bytecode you must also change what is called the magic number for
461-
.pyc files. The variable ``MAGIC_NUMBER`` in
462-
:cpy-file:`Lib/importlib/_bootstrap_external.py` contains the number.
463-
Changing this number will lead to all .pyc files with the old ``MAGIC_NUMBER``
464-
to be recompiled by the interpreter on import. Whenever ``MAGIC_NUMBER`` is
465-
changed, the ranges in the ``magic_values`` array in :cpy-file:`PC/launcher.c`
466-
must also be updated. Changes to :cpy-file:`Lib/importlib/_bootstrap_external.py`
467-
will take effect only after running ``make regen-importlib``. Running this
468-
command before adding the new bytecode target to :cpy-file:`Python/bytecodes.c`
469-
(followed by ``make regen-cases``) will result in an error. You should only run
470-
``make regen-importlib`` after the new bytecode target has been added.
471-
472-
.. note:: On Windows, running the ``./build.bat`` script will automatically
473-
regenerate the required files without requiring additional arguments.
474-
475-
Finally, you need to introduce the use of the new bytecode. Altering
476-
:cpy-file:`Python/compile.c`, :cpy-file:`Python/bytecodes.c` will be the
477-
primary places to change. Optimizations in :cpy-file:`Python/flowgraph.c`
478-
may also need to be updated.
479-
If the new opcode affects a control flow or the block stack, you may have
480-
to update the ``frame_setlineno()`` function in :cpy-file:`Objects/frameobject.c`.
481-
:cpy-file:`Lib/dis.py` may need an update if the new opcode interprets its
482-
argument in a special way (like ``FORMAT_VALUE`` or ``MAKE_FUNCTION``).
483-
484-
If you make a change here that can affect the output of bytecode that
485-
is already in existence and you do not change the magic number constantly, make
486-
sure to delete your old .py(c|o) files! Even though you will end up changing
487-
the magic number if you change the bytecode, while you are debugging your work
488-
you will be changing the bytecode output without constantly bumping up the
489-
magic number. This means you end up with stale .pyc files that will not be
490-
recreated.
491-
Running ``find . -name '*.py[co]' -exec rm -f '{}' +`` should delete all .pyc
492-
files you have, forcing new ones to be created and thus allow you test out your
493-
new bytecode properly. Run ``make regen-importlib`` for updating the
494-
bytecode of frozen importlib files. You have to run ``make`` again after this
495-
for recompiling generated C files.
496-
497-
498451
Code objects
499452
============
500453

@@ -613,12 +566,30 @@ Important files
613566

614567
* :cpy-file:`Lib/opcode.py`: Master list of bytecode; if this file is
615568
modified you must modify several other files accordingly
616-
(see "`Introducing New Bytecode`_")
617569

618570
* :cpy-file:`Lib/importlib/_bootstrap_external.py`: Home of the magic number
619571
(named ``MAGIC_NUMBER``) for bytecode versioning.
620572

621573

574+
Objects
575+
=======
576+
577+
* :cpy-file:`Objects/locations.md`: Describes the location table
578+
* :cpy-file:`Objects/frame_layout.md`: Describes the frame stack
579+
* :cpy-file:`Objects/object_layout.md`: Descibes object layout for 3.11 and later
580+
* :cpy-file:`Objects/exception_handling_notes.txt`: Exception handling notes
581+
582+
583+
Specializing Adaptive Interpreter
584+
=================================
585+
586+
Adding a specializing, adaptive interpreter to CPython will bring significant
587+
performance improvements. These documents provide more information:
588+
589+
* :pep:`659`: Specializing Adaptive Interpreter
590+
* :cpy-file:`Python/adaptive.md`: Adding or extending a family of adaptive instructions
591+
592+
622593
References
623594
==========
624595

internals/interpreter.rst

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,3 +312,62 @@ Other topics
312312
- Tracing
313313
- Setting the current lineno (debugger-induced jumps)
314314
- Specialization, inline caches etc.
315+
316+
317+
Introducing new bytecode
318+
========================
319+
320+
.. note::
321+
322+
This section is relevant if you are adding a new bytecode to the interpreter.
323+
324+
325+
Sometimes a new feature requires a new opcode. But adding new bytecode is
326+
not as simple as just suddenly introducing new bytecode in the AST ->
327+
bytecode step of the compiler. Several pieces of code throughout Python depend
328+
on having correct information about what bytecode exists.
329+
330+
First, you must choose a name, implement the bytecode in
331+
:cpy-file:`Python/bytecodes.c`, and add a documentation entry in
332+
:cpy-file:`Doc/library/dis.rst`. Then run ``make regen-cases`` to
333+
assign a number for it (see :cpy-file:`Include/opcode_ids.h`) and
334+
regenerate a number of files with the actual implementation of the
335+
bytecodes (:cpy-file:`Python/generated_cases.c.h`) and additional
336+
files with metadata about them.
337+
338+
With a new bytecode you must also change what is called the magic number for
339+
.pyc files. The variable ``MAGIC_NUMBER`` in
340+
:cpy-file:`Lib/importlib/_bootstrap_external.py` contains the number.
341+
Changing this number will lead to all .pyc files with the old ``MAGIC_NUMBER``
342+
to be recompiled by the interpreter on import. Whenever ``MAGIC_NUMBER`` is
343+
changed, the ranges in the ``magic_values`` array in :cpy-file:`PC/launcher.c`
344+
must also be updated. Changes to :cpy-file:`Lib/importlib/_bootstrap_external.py`
345+
will take effect only after running ``make regen-importlib``. Running this
346+
command before adding the new bytecode target to :cpy-file:`Python/bytecodes.c`
347+
(followed by ``make regen-cases``) will result in an error. You should only run
348+
``make regen-importlib`` after the new bytecode target has been added.
349+
350+
.. note:: On Windows, running the ``./build.bat`` script will automatically
351+
regenerate the required files without requiring additional arguments.
352+
353+
Finally, you need to introduce the use of the new bytecode. Altering
354+
:cpy-file:`Python/compile.c`, :cpy-file:`Python/bytecodes.c` will be the
355+
primary places to change. Optimizations in :cpy-file:`Python/flowgraph.c`
356+
may also need to be updated.
357+
If the new opcode affects a control flow or the block stack, you may have
358+
to update the ``frame_setlineno()`` function in :cpy-file:`Objects/frameobject.c`.
359+
:cpy-file:`Lib/dis.py` may need an update if the new opcode interprets its
360+
argument in a special way (like ``FORMAT_VALUE`` or ``MAKE_FUNCTION``).
361+
362+
If you make a change here that can affect the output of bytecode that
363+
is already in existence and you do not change the magic number constantly, make
364+
sure to delete your old .py(c|o) files! Even though you will end up changing
365+
the magic number if you change the bytecode, while you are debugging your work
366+
you will be changing the bytecode output without constantly bumping up the
367+
magic number. This means you end up with stale .pyc files that will not be
368+
recreated.
369+
Running ``find . -name '*.py[co]' -exec rm -f '{}' +`` should delete all .pyc
370+
files you have, forcing new ones to be created and thus allow you test out your
371+
new bytecode properly. Run ``make regen-importlib`` for updating the
372+
bytecode of frozen importlib files. You have to run ``make`` again after this
373+
for recompiling generated C files.

0 commit comments

Comments
 (0)