Updated README for new release

cedricrupb · cedricrupb · commit 53ee4bdd3aa2 · 2022-01-19T19:26:09.000+01:00
diff --git a/README.md b/README.md
@@ -3,44 +3,96 @@
 </p>
 
 ------------------------------------------------
+> In short: Fast tokenization and structural analysis of
+any programming language
 
 Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. 
 To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
 
-Code(dot)tokenize provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
+**code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
 By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.
 
 ## Installation
-The package is currently only tested under Python 3. It can be installed via:
+The package is tested under Python 3. It can be installed via:
 ```
 pip install code-tokenize
 ```
 
-
-## Library highlights
-Whether you are on the search for a fast multilingual program tokenizer or want to start your next PLP project, here are some reason why you should build upon ptokenizers:
-
-* **Easy to use** All it takes to tokenize your code is to run a single line:
+## Usage
+code.tokenize can tokenize nearly any program code in a few lines of code:
 ```
 import code_tokenize as ctok
 
+# Python
 ctok.tokenize(
     '''
         def my_func():
             print("Hello World")
     ''',
 lang = "python")
 
+# Output: [def, my_func, (, ), :, #NEWLINE#, ...]
+
+# Java
+ctok.tokenize(
+    '''
+        public static void main(String[] args){
+          System.out.println("Hello World");
+        }
+    ''',
+lang = "java", 
+syntax_error = "ignore")
+
+# Output: [public, static, void, main, (, String, [, ], args), {, System, ...]
+
+# JavaScript
+ctok.tokenize(
+    '''
+        alert("Hello World");
+    ''',
+lang = "javascript", 
+syntax_error = "ignore")
+
+# Output: [alert, (, "Hello World", ), ;]
+
+
 ```
 
-* **Most programming languages supported** Since all our tokenizers are backed by [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/) we support a long list of programming languages. This also includes popular languages such as Python, Java and JavaScript.
+## Supported languages
+code.tokenize employs [tree-sitter](https://tree-sitter.github.io/tree-sitter/) as a backend. Therefore, in principal, any language supported by tree-sitter is also
+supported by a tokenizer in code.tokenize.
+
+For some languages, this library supports additional
+features that are not directly supported by tree-sitter.
+Therefore, we distinguish between three language classes
+and support the following language identifier:
+
+- `native`: python
+- `advanced`: java
+- `basic`: javascript, go, ruby, cpp, c, swift, rust, ...
+
+Languages in the `native` class support all features 
+of this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and
+only support the feature set of the backend. They can still be used for tokenization and AST parsing.
 
+## Release history
+* 0.1.0
+    * The first proper release
+    * CHANGE: Language specific tokenizer configuration
+    * CHANGE: Basic analyses of the program structure and token role
+    * CHANGE: Documentation
+* 0.0.1
+    * Work in progress
 
-## Roadmap
-code(dot)tokenize is currently under active development. To enable application for various types of PLP methods, the following features are planned for future versions:
+## Project Info
+The goal of this project is to provide developer in the
+programming language processing community with easy
+access to program tokenization and AST parsing. This is currently developed as a helper library for internal research projects. Therefore, it will only be updated
+as needed.
 
-- **Token tagging** Automatically identify certain token types including variable usages, definition and type usages.
+Feel free to open an issue if anything unexpected
+happens. 
 
-- **Syntactic relations** Automatically identify syntactic relations between tokens. This includes read and write relations or structural dependencies.
+Distributed under the MIT license. See ``LICENSE`` for more information.
 
-- **Basic CFG analysis** Automatically identify statement heads which are connected via a control flow
+We thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible.