Skip to content

Commit 53ee4bd

Browse files
committed
Updated README for new release
1 parent ac550e2 commit 53ee4bd

File tree

1 file changed

+65
-13
lines changed

1 file changed

+65
-13
lines changed

README.md

Lines changed: 65 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,44 +3,96 @@
33
</p>
44

55
------------------------------------------------
6+
> In short: Fast tokenization and structural analysis of
7+
any programming language
68

79
Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages.
810
To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
911

10-
Code(dot)tokenize provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
12+
**code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
1113
By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.
1214

1315
## Installation
14-
The package is currently only tested under Python 3. It can be installed via:
16+
The package is tested under Python 3. It can be installed via:
1517
```
1618
pip install code-tokenize
1719
```
1820

19-
20-
## Library highlights
21-
Whether you are on the search for a fast multilingual program tokenizer or want to start your next PLP project, here are some reason why you should build upon ptokenizers:
22-
23-
* **Easy to use** All it takes to tokenize your code is to run a single line:
21+
## Usage
22+
code.tokenize can tokenize nearly any program code in a few lines of code:
2423
```
2524
import code_tokenize as ctok
2625
26+
# Python
2727
ctok.tokenize(
2828
'''
2929
def my_func():
3030
print("Hello World")
3131
''',
3232
lang = "python")
3333
34+
# Output: [def, my_func, (, ), :, #NEWLINE#, ...]
35+
36+
# Java
37+
ctok.tokenize(
38+
'''
39+
public static void main(String[] args){
40+
System.out.println("Hello World");
41+
}
42+
''',
43+
lang = "java",
44+
syntax_error = "ignore")
45+
46+
# Output: [public, static, void, main, (, String, [, ], args), {, System, ...]
47+
48+
# JavaScript
49+
ctok.tokenize(
50+
'''
51+
alert("Hello World");
52+
''',
53+
lang = "javascript",
54+
syntax_error = "ignore")
55+
56+
# Output: [alert, (, "Hello World", ), ;]
57+
58+
3459
```
3560

36-
* **Most programming languages supported** Since all our tokenizers are backed by [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/) we support a long list of programming languages. This also includes popular languages such as Python, Java and JavaScript.
61+
## Supported languages
62+
code.tokenize employs [tree-sitter](https://tree-sitter.github.io/tree-sitter/) as a backend. Therefore, in principal, any language supported by tree-sitter is also
63+
supported by a tokenizer in code.tokenize.
64+
65+
For some languages, this library supports additional
66+
features that are not directly supported by tree-sitter.
67+
Therefore, we distinguish between three language classes
68+
and support the following language identifier:
69+
70+
- `native`: python
71+
- `advanced`: java
72+
- `basic`: javascript, go, ruby, cpp, c, swift, rust, ...
73+
74+
Languages in the `native` class support all features
75+
of this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and
76+
only support the feature set of the backend. They can still be used for tokenization and AST parsing.
3777

78+
## Release history
79+
* 0.1.0
80+
* The first proper release
81+
* CHANGE: Language specific tokenizer configuration
82+
* CHANGE: Basic analyses of the program structure and token role
83+
* CHANGE: Documentation
84+
* 0.0.1
85+
* Work in progress
3886

39-
## Roadmap
40-
code(dot)tokenize is currently under active development. To enable application for various types of PLP methods, the following features are planned for future versions:
87+
## Project Info
88+
The goal of this project is to provide developer in the
89+
programming language processing community with easy
90+
access to program tokenization and AST parsing. This is currently developed as a helper library for internal research projects. Therefore, it will only be updated
91+
as needed.
4192

42-
- **Token tagging** Automatically identify certain token types including variable usages, definition and type usages.
93+
Feel free to open an issue if anything unexpected
94+
happens.
4395

44-
- **Syntactic relations** Automatically identify syntactic relations between tokens. This includes read and write relations or structural dependencies.
96+
Distributed under the MIT license. See ``LICENSE`` for more information.
4597

46-
- **Basic CFG analysis** Automatically identify statement heads which are connected via a control flow
98+
We thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible.

0 commit comments

Comments
 (0)