Skip to content

Commit 2705388

Browse files
authored
Merge pull request #3 from cedricrupb/code_ast
code_tokenize v0.2.0: Major API redesign
2 parents 84e1b0a + 5f5dc55 commit 2705388

25 files changed

+376
-538
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,3 +130,6 @@ dmypy.json
130130

131131
# Project specific ignore
132132
build/
133+
134+
data/
135+
.DS_Store

README.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
> Fast tokenization and structural analysis of
77
any programming language in Python
88

9-
Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages.
10-
To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
9+
Programming Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages.
10+
To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programming languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
1111

1212
**code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
1313
By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.
@@ -75,7 +75,20 @@ Languages in the `native` class support all features
7575
of this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and
7676
only support the feature set of the backend. They can still be used for tokenization and AST parsing.
7777

78+
## How to contribute
79+
**Your language is not natively supported by code.tokenize or the tokenization seems to be incorrect?** Then change it!
80+
81+
While code.tokenize is developed mainly as an helper library for internal research projects, we welcome pull requests of any sorts (if it is a new feature or a bug fix).
82+
83+
**Want to help to test more languages?**
84+
Our goal is to support as many languages as possible at a `native` level. However, languages on `basic` level are completly untested. You can help by testing `basic` languages and reporting issues in the tokenization process!
85+
7886
## Release history
87+
* 0.2.0
88+
* Major API redesign!
89+
* CHANGE: AST parsing is now done by an external library: [code_ast](https://github.com/cedricrupb/code_ast)
90+
* CHANGE: Visitor pattern instead of custom tokenizer
91+
* CHANGE: Custom visitors for language dependent tokenization
7992
* 0.1.0
8093
* The first proper release
8194
* CHANGE: Language specific tokenizer configuration
@@ -95,4 +108,14 @@ happens.
95108

96109
Distributed under the MIT license. See ``LICENSE`` for more information.
97110

111+
This project was developed as part of our research related to:
112+
```bibtex
113+
@inproceedings{richter2022tssb,
114+
title={TSSB-3M: Mining single statement bugs at massive scale},
115+
author={Cedric Richter, Heike Wehrheim},
116+
booktitle={MSR},
117+
year={2022}
118+
}
119+
```
120+
98121
We thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible.

benchmark/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Benchmarking
2+
3+
In the following, we benchmark the runtime of **code.tokenize** for parsing Python functions. To obtain a realistic set of Python code for PLP, we employ
4+
the Python portion of the [CodeSearchNet](https://github.com/github/CodeSearchNet) corpus. The corpus includes more than 500K Python functions
5+
annotated for training.
6+
7+
## Environment
8+
We benchmark the following implementation in our benchmark:
9+
```python
10+
import code_tokenize as ctok
11+
12+
ctok.tokenize(
13+
source_code,
14+
lang = 'python',
15+
syntax_error = 'raise'
16+
)
17+
```
18+
Therefore, we skip all instances that contain syntax errors.
19+
20+
For benchmarking, we employ a Macbook Pro M1 with 8GB RAM.
21+
22+
## Results
23+
We start by plotting the mean runtime of the tokenizer in relation
24+
to the size of the Python function (in number of tokens). For determining the size of program, we count the tokens in the pretokenized code. For brevity, we show results for functions below 1024 tokens (since this is the typical size of functions employed in PLP).
25+
26+
<p align="center">
27+
<img height="150" src="https://github.com/cedricrupb/code_tokenize/raw/main/benchmark/runtime_raise.png" />
28+
</p>
29+
30+
We observe that the time for tokenization scales linearly with the number of tokens in the Python function. Even large function with up to 1024 tokens can be tokenized within 10ms.
31+
Note: The plot only shows runtimes for function implementation that are parsed without an error (Python 2 functions will likely produce an error). However, also functions that raise an exception will also run in a similar time window.
32+
33+
34+
## Complete set
35+
Below the uncut version of the diagram. Even for large scale function with
36+
more than 25K tokens, the tokenizer does not take much longer than 100ms.
37+
38+
<p align="center">
39+
<img height="150" src="https://github.com/cedricrupb/code_tokenize/raw/main/benchmark/runtime_all.png" />
40+
</p>

benchmark/runtime_all.png

14.7 KB
Loading

benchmark/runtime_raise.png

31.6 KB
Loading

code_tokenize/__init__.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11

2-
from .parsers import ASTParser
3-
from .config import load_from_lang_config
2+
from code_ast.parsers import ASTParser
3+
44
from .tokenizer import tokenize_tree
5+
from .lang import load_from_lang_config
56

67
import logging as logger
78

@@ -38,6 +39,11 @@ def tokenize(source_code, lang = "guess", **kwargs):
3839
ignore: Ignores syntax errors. Helpful for parsing code snippets.
3940
Default: raise
4041
42+
visitors : list[Visitor]
43+
Optional list of visitors that should be executed during tokenization
44+
Since code is tokenized by traversing the parsed AST, visitors
45+
can be used to run further AST based analyses.
46+
4147
Returns
4248
-------
4349
TokenSequence
@@ -59,7 +65,7 @@ def tokenize(source_code, lang = "guess", **kwargs):
5965
parser = ASTParser(config.lang)
6066
tree, code = parser.parse(source_code)
6167

62-
return tokenize_tree(config, tree.root_node, code)
68+
return tokenize_tree(config, tree.root_node, code, visitors = config.visitors)
6369

6470

6571

code_tokenize/config.py

Lines changed: 6 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
import os
1+
22
import json
33

4+
from .lang.base_visitors import LeafVisitor
5+
46

57
class TokenizationConfig:
68
"""Helper object to translate arguments of tokenize to config object"""
@@ -9,14 +11,15 @@ def __init__(self, lang, **kwargs):
911
self.lang = lang
1012
self.syntax_error = "raise" # Options: raise, warn, ignore
1113

12-
self.ident_tokens = False # Whether to represent indentations and newlines (Helpful for script languages like Python)
14+
self.indent_tokens = False # Whether to represent indentations and newlines (Helpful for script languages like Python)
15+
self.num_whitespaces_for_indent = 4
1316

1417
# A list of all statement node defined in the language
1518
self.statement_types = [
1619
"*_statement", "*_definition", "*_declaration"
1720
]
1821

19-
self.path_handler = None # A dictionary that maps path handler to AST node types
22+
self.visitors = [LeafVisitor] # visitor classes which should be run during analysis
2023

2124
self.update(kwargs)
2225

@@ -51,26 +54,3 @@ def load_from_config(config_path, **kwargs):
5154

5255
return TokenizationConfig(**config)
5356

54-
55-
def _get_config_path():
56-
current_path = os.path.abspath(__file__)
57-
58-
while len(current_path) > 0 and os.path.basename(current_path) != "code_tokenize":
59-
current_path = os.path.dirname(current_path)
60-
parent_path = os.path.dirname(current_path)
61-
62-
return os.path.join(parent_path, "lang_configs")
63-
64-
65-
def load_from_lang_config(lang, **kwargs):
66-
"""Automatically bootstrap config from language specific config"""
67-
config_path = _get_config_path()
68-
config_path = os.path.join(config_path, "%s.json" % lang)
69-
70-
if os.path.exists(config_path):
71-
kwargs["lang"] = lang
72-
return load_from_config(config_path, **kwargs)
73-
74-
return TokenizationConfig(lang, **kwargs)
75-
76-

code_tokenize/lang/__init__.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
2+
from ..config import TokenizationConfig
3+
4+
from .python import create_tokenization_config as pytok_config
5+
from .java import create_tokenization_config as jvtok_config
6+
from .go import create_tokenization_config as gotok_config
7+
from .js import create_tokenization_config as jstok_config
8+
from .php import create_tokenization_config as phptok_config
9+
from .ruby import create_tokenization_config as rubytok_config
10+
11+
12+
def load_from_lang_config(lang, **kwargs):
13+
14+
if lang == "python" : base_config = pytok_config()
15+
elif lang == "java" : base_config = jvtok_config()
16+
elif lang == "go" : base_config = gotok_config()
17+
elif lang == "javascript" : base_config = jstok_config()
18+
elif lang == "php" : base_config = phptok_config()
19+
elif lang == "ruby" : base_config = rubytok_config()
20+
else : base_config = TokenizationConfig(lang)
21+
22+
base_config.update(kwargs)
23+
return base_config
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
from code_ast import ASTVisitor
2+
3+
# Basic visitor -----------------------------------------------------------
4+
5+
class LeafVisitor(ASTVisitor):
6+
7+
def __init__(self, node_handler):
8+
self.node_handler = node_handler
9+
10+
def visit_string(self, node):
11+
self.node_handler(node)
12+
return False
13+
14+
def visit(self, node):
15+
if node.child_count == 0:
16+
self.node_handler(node)
17+
return False

code_tokenize/lang/go/__init__.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
2+
from ...config import TokenizationConfig
3+
from ...tokens import NewlineToken
4+
5+
from ..base_visitors import LeafVisitor
6+
7+
8+
# Tokenization config ----------------------------------------------------------------
9+
10+
def create_tokenization_config():
11+
return TokenizationConfig(
12+
lang = 'go',
13+
statement_types = ["*_statement", "*_declaration"],
14+
visitors = [GoLeafVisitor],
15+
indent_tokens = False
16+
)
17+
18+
# Custom leaf visitor ----------------------------------------------------------------
19+
20+
class GoLeafVisitor(LeafVisitor):
21+
22+
def visit_interpreted_string_literal(self, node):
23+
self.node_handler(node)
24+
return False
25+
26+
def visit(self, node):
27+
if node.type == "\n":
28+
self.node_handler.handle_token(NewlineToken(self.node_handler.config))
29+
return False
30+
return super().visit(node)

0 commit comments

Comments
 (0)