|
3 | 3 | </p> |
4 | 4 |
|
5 | 5 | ------------------------------------------------ |
| 6 | +> In short: Fast tokenization and structural analysis of |
| 7 | +any programming language |
6 | 8 |
|
7 | 9 | Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. |
8 | 10 | To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs. |
9 | 11 |
|
10 | | -Code(dot)tokenize provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing. |
| 12 | +**code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing. |
11 | 13 | By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information. |
12 | 14 |
|
13 | 15 | ## Installation |
14 | | -The package is currently only tested under Python 3. It can be installed via: |
| 16 | +The package is tested under Python 3. It can be installed via: |
15 | 17 | ``` |
16 | 18 | pip install code-tokenize |
17 | 19 | ``` |
18 | 20 |
|
19 | | - |
20 | | -## Library highlights |
21 | | -Whether you are on the search for a fast multilingual program tokenizer or want to start your next PLP project, here are some reason why you should build upon ptokenizers: |
22 | | - |
23 | | -* **Easy to use** All it takes to tokenize your code is to run a single line: |
| 21 | +## Usage |
| 22 | +code.tokenize can tokenize nearly any program code in a few lines of code: |
24 | 23 | ``` |
25 | 24 | import code_tokenize as ctok |
26 | 25 |
|
| 26 | +# Python |
27 | 27 | ctok.tokenize( |
28 | 28 | ''' |
29 | 29 | def my_func(): |
30 | 30 | print("Hello World") |
31 | 31 | ''', |
32 | 32 | lang = "python") |
33 | 33 |
|
| 34 | +# Output: [def, my_func, (, ), :, #NEWLINE#, ...] |
| 35 | +
|
| 36 | +# Java |
| 37 | +ctok.tokenize( |
| 38 | + ''' |
| 39 | + public static void main(String[] args){ |
| 40 | + System.out.println("Hello World"); |
| 41 | + } |
| 42 | + ''', |
| 43 | +lang = "java", |
| 44 | +syntax_error = "ignore") |
| 45 | +
|
| 46 | +# Output: [public, static, void, main, (, String, [, ], args), {, System, ...] |
| 47 | +
|
| 48 | +# JavaScript |
| 49 | +ctok.tokenize( |
| 50 | + ''' |
| 51 | + alert("Hello World"); |
| 52 | + ''', |
| 53 | +lang = "javascript", |
| 54 | +syntax_error = "ignore") |
| 55 | +
|
| 56 | +# Output: [alert, (, "Hello World", ), ;] |
| 57 | +
|
| 58 | +
|
34 | 59 | ``` |
35 | 60 |
|
36 | | -* **Most programming languages supported** Since all our tokenizers are backed by [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/) we support a long list of programming languages. This also includes popular languages such as Python, Java and JavaScript. |
| 61 | +## Supported languages |
| 62 | +code.tokenize employs [tree-sitter](https://tree-sitter.github.io/tree-sitter/) as a backend. Therefore, in principal, any language supported by tree-sitter is also |
| 63 | +supported by a tokenizer in code.tokenize. |
| 64 | + |
| 65 | +For some languages, this library supports additional |
| 66 | +features that are not directly supported by tree-sitter. |
| 67 | +Therefore, we distinguish between three language classes |
| 68 | +and support the following language identifier: |
| 69 | + |
| 70 | +- `native`: python |
| 71 | +- `advanced`: java |
| 72 | +- `basic`: javascript, go, ruby, cpp, c, swift, rust, ... |
| 73 | + |
| 74 | +Languages in the `native` class support all features |
| 75 | +of this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and |
| 76 | +only support the feature set of the backend. They can still be used for tokenization and AST parsing. |
37 | 77 |
|
| 78 | +## Release history |
| 79 | +* 0.1.0 |
| 80 | + * The first proper release |
| 81 | + * CHANGE: Language specific tokenizer configuration |
| 82 | + * CHANGE: Basic analyses of the program structure and token role |
| 83 | + * CHANGE: Documentation |
| 84 | +* 0.0.1 |
| 85 | + * Work in progress |
38 | 86 |
|
39 | | -## Roadmap |
40 | | -code(dot)tokenize is currently under active development. To enable application for various types of PLP methods, the following features are planned for future versions: |
| 87 | +## Project Info |
| 88 | +The goal of this project is to provide developer in the |
| 89 | +programming language processing community with easy |
| 90 | +access to program tokenization and AST parsing. This is currently developed as a helper library for internal research projects. Therefore, it will only be updated |
| 91 | +as needed. |
41 | 92 |
|
42 | | -- **Token tagging** Automatically identify certain token types including variable usages, definition and type usages. |
| 93 | +Feel free to open an issue if anything unexpected |
| 94 | +happens. |
43 | 95 |
|
44 | | -- **Syntactic relations** Automatically identify syntactic relations between tokens. This includes read and write relations or structural dependencies. |
| 96 | +Distributed under the MIT license. See ``LICENSE`` for more information. |
45 | 97 |
|
46 | | -- **Basic CFG analysis** Automatically identify statement heads which are connected via a control flow |
| 98 | +We thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible. |
0 commit comments