You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-2Lines changed: 25 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,8 +6,8 @@
6
6
> Fast tokenization and structural analysis of
7
7
any programming language in Python
8
8
9
-
Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages.
10
-
To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
9
+
Programming Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages.
10
+
To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programming languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
11
11
12
12
**code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
13
13
By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.
@@ -75,7 +75,20 @@ Languages in the `native` class support all features
75
75
of this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and
76
76
only support the feature set of the backend. They can still be used for tokenization and AST parsing.
77
77
78
+
## How to contribute
79
+
**Your language is not natively supported by code.tokenize or the tokenization seems to be incorrect?** Then change it!
80
+
81
+
While code.tokenize is developed mainly as an helper library for internal research projects, we welcome pull requests of any sorts (if it is a new feature or a bug fix).
82
+
83
+
**Want to help to test more languages?**
84
+
Our goal is to support as many languages as possible at a `native` level. However, languages on `basic` level are completly untested. You can help by testing `basic` languages and reporting issues in the tokenization process!
85
+
78
86
## Release history
87
+
* 0.2.0
88
+
* Major API redesign!
89
+
* CHANGE: AST parsing is now done by an external library: [code_ast](https://github.com/cedricrupb/code_ast)
90
+
* CHANGE: Visitor pattern instead of custom tokenizer
91
+
* CHANGE: Custom visitors for language dependent tokenization
79
92
* 0.1.0
80
93
* The first proper release
81
94
* CHANGE: Language specific tokenizer configuration
@@ -95,4 +108,14 @@ happens.
95
108
96
109
Distributed under the MIT license. See ``LICENSE`` for more information.
97
110
111
+
This project was developed as part of our research related to:
112
+
```bibtex
113
+
@inproceedings{richter2022tssb,
114
+
title={TSSB-3M: Mining single statement bugs at massive scale},
115
+
author={Cedric Richter, Heike Wehrheim},
116
+
booktitle={MSR},
117
+
year={2022}
118
+
}
119
+
```
120
+
98
121
We thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible.
In the following, we benchmark the runtime of **code.tokenize** for parsing Python functions. To obtain a realistic set of Python code for PLP, we employ
4
+
the Python portion of the [CodeSearchNet](https://github.com/github/CodeSearchNet) corpus. The corpus includes more than 500K Python functions
5
+
annotated for training.
6
+
7
+
## Environment
8
+
We benchmark the following implementation in our benchmark:
9
+
```python
10
+
import code_tokenize as ctok
11
+
12
+
ctok.tokenize(
13
+
source_code,
14
+
lang='python',
15
+
syntax_error='raise'
16
+
)
17
+
```
18
+
Therefore, we skip all instances that contain syntax errors.
19
+
20
+
For benchmarking, we employ a Macbook Pro M1 with 8GB RAM.
21
+
22
+
## Results
23
+
We start by plotting the mean runtime of the tokenizer in relation
24
+
to the size of the Python function (in number of tokens). For determining the size of program, we count the tokens in the pretokenized code. For brevity, we show results for functions below 1024 tokens (since this is the typical size of functions employed in PLP).
We observe that the time for tokenization scales linearly with the number of tokens in the Python function. Even large function with up to 1024 tokens can be tokenized within 10ms.
31
+
Note: The plot only shows runtimes for function implementation that are parsed without an error (Python 2 functions will likely produce an error). However, also functions that raise an exception will also run in a similar time window.
32
+
33
+
34
+
## Complete set
35
+
Below the uncut version of the diagram. Even for large scale function with
36
+
more than 25K tokens, the tokenizer does not take much longer than 100ms.
0 commit comments