cedricrupb
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 25 additions & 2 deletions b/‎README.md‎
Lines changed: 25 additions & 2 deletions
diff --git a/‎benchmark/README.md‎
Lines changed: 40 additions & 0 deletions b/‎benchmark/README.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎benchmark/runtime_all.png‎
14.7 KB b/‎benchmark/runtime_all.png‎
14.7 KB
diff --git a/‎benchmark/runtime_raise.png‎
31.6 KB b/‎benchmark/runtime_raise.png‎
31.6 KB
diff --git a/‎code_tokenize/__init__.py‎
Lines changed: 9 additions & 3 deletions b/‎code_tokenize/__init__.py‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎code_tokenize/config.py‎
Lines changed: 6 additions & 26 deletions b/‎code_tokenize/config.py‎
Lines changed: 6 additions & 26 deletions
diff --git a/‎code_tokenize/lang/__init__.py‎
Lines changed: 23 additions & 0 deletions b/‎code_tokenize/lang/__init__.py‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎code_tokenize/lang/base_visitors.py‎
Lines changed: 17 additions & 0 deletions b/‎code_tokenize/lang/base_visitors.py‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎code_tokenize/lang/go/__init__.py‎
Lines changed: 30 additions & 0 deletions b/‎code_tokenize/lang/go/__init__.py‎
Lines changed: 30 additions & 0 deletions
@@ -130,3 +130,6 @@ dmypy.json
 
 # Project specific ignore
 build/
+
+data/
+.DS_Store
@@ -6,8 +6,8 @@
 > Fast tokenization and structural analysis of
 any programming language in Python
 
-Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. 
-To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
+Programming Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. 
+To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programming languages. Especially the syntactical structure can be exploited to gain knowledge about programs.
 
 **code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
 By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.
@@ -75,7 +75,20 @@ Languages in the `native` class support all features
 of this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and
 only support the feature set of the backend. They can still be used for tokenization and AST parsing.
 
+## How to contribute
+**Your language is not natively supported by code.tokenize or the tokenization seems to be incorrect?** Then change it!
+
+While code.tokenize is developed mainly as an helper library for internal research projects, we welcome pull requests of any sorts (if it is a new feature or a bug fix). 
+
+**Want to help to test more languages?**
+Our goal is to support as many languages as possible at a `native` level. However, languages on `basic` level are completly untested. You can help by testing `basic` languages and reporting issues in the tokenization process!
+
 ## Release history
+* 0.2.0
+    * Major API redesign!
+    * CHANGE: AST parsing is now done by an external library: [code_ast](https://github.com/cedricrupb/code_ast)
+    * CHANGE: Visitor pattern instead of custom tokenizer
+    * CHANGE: Custom visitors for language dependent tokenization
 * 0.1.0
     * The first proper release
     * CHANGE: Language specific tokenizer configuration
@@ -95,4 +108,14 @@ happens.
 
 Distributed under the MIT license. See ``LICENSE`` for more information.
 
+This project was developed as part of our research related to:
+```bibtex
+@inproceedings{richter2022tssb,
+  title={TSSB-3M: Mining single statement bugs at massive scale},
+  author={Cedric Richter, Heike Wehrheim},
+  booktitle={MSR},
+  year={2022}
+}
+```
+
 We thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible. 
@@ -0,0 +1,40 @@
+# Benchmarking
+
+In the following, we benchmark the runtime of **code.tokenize** for parsing Python functions. To obtain a realistic set of Python code for PLP, we employ
+the Python portion of the [CodeSearchNet](https://github.com/github/CodeSearchNet) corpus. The corpus includes more than 500K Python functions 
+annotated for training.
+
+## Environment 
+We benchmark the following implementation in our benchmark:
+```python
+import code_tokenize as ctok
+
+ctok.tokenize(
+    source_code,
+    lang = 'python',
+    syntax_error = 'raise'
+)
+```
+Therefore, we skip all instances that contain syntax errors. 
+
+For benchmarking, we employ a Macbook Pro M1 with 8GB RAM.
+
+## Results
+We start by plotting the mean runtime of the tokenizer in relation
+to the size of the Python function (in number of tokens). For determining the size of program, we count the tokens in the pretokenized code. For brevity, we show results for functions below 1024 tokens (since this is the typical size of functions employed in PLP).
+
+<p align="center">
+  <img height="150" src="https://github.com/cedricrupb/code_tokenize/raw/main/benchmark/runtime_raise.png" />
+</p>
+
+We observe that the time for tokenization scales linearly with the number of tokens in the Python function. Even large function with up to 1024 tokens can be tokenized within 10ms.
+Note: The plot only shows runtimes for function implementation that are parsed without an error (Python 2 functions will likely produce an error). However, also functions that raise an exception will also run in a similar time window.
+
+
+## Complete set
+Below the uncut version of the diagram. Even for large scale function with
+more than 25K tokens, the tokenizer does not take much longer than 100ms.
+
+<p align="center">
+  <img height="150" src="https://github.com/cedricrupb/code_tokenize/raw/main/benchmark/runtime_all.png" />
+</p>
@@ -1,7 +1,8 @@
 
-from .parsers import ASTParser
-from .config  import load_from_lang_config
+from code_ast.parsers   import ASTParser
+
 from .tokenizer import tokenize_tree
+from .lang      import load_from_lang_config
 
 import logging as logger
 
@@ -38,6 +39,11 @@ def tokenize(source_code, lang = "guess", **kwargs):
         ignore: Ignores syntax errors. Helpful for parsing code snippets.
         Default: raise
 
+    visitors : list[Visitor]
+        Optional list of visitors that should be executed during tokenization
+        Since code is tokenized by traversing the parsed AST, visitors
+        can be used to run further AST based analyses.
+
     Returns
     -------
     TokenSequence
@@ -59,7 +65,7 @@ def tokenize(source_code, lang = "guess", **kwargs):
     parser = ASTParser(config.lang)
     tree, code = parser.parse(source_code)
 
-    return tokenize_tree(config, tree.root_node, code)
+    return tokenize_tree(config, tree.root_node, code, visitors = config.visitors)
 
 
 
 
@@ -1,6 +1,8 @@
-import os
+
 import json
 
+from .lang.base_visitors import LeafVisitor
+
 
 class TokenizationConfig:
     """Helper object to translate arguments of tokenize to config object"""
@@ -9,14 +11,15 @@ def __init__(self, lang, **kwargs):
         self.lang = lang
         self.syntax_error = "raise" # Options: raise, warn, ignore
 
-        self.ident_tokens = False # Whether to represent indentations and newlines (Helpful for script languages like Python)
+        self.indent_tokens = False # Whether to represent indentations and newlines (Helpful for script languages like Python)
+        self.num_whitespaces_for_indent = 4
 
         # A list of all statement node defined in the language
         self.statement_types = [
             "*_statement", "*_definition", "*_declaration"
         ]
 
-        self.path_handler = None # A dictionary that maps path handler to AST node types
+        self.visitors = [LeafVisitor] # visitor classes which should be run during analysis
 
         self.update(kwargs)
 
@@ -51,26 +54,3 @@ def load_from_config(config_path, **kwargs):
 
     return TokenizationConfig(**config)
 
-
-def _get_config_path():
-    current_path = os.path.abspath(__file__)
-
-    while len(current_path) > 0 and os.path.basename(current_path) != "code_tokenize":
-        current_path = os.path.dirname(current_path)
-    parent_path = os.path.dirname(current_path)
-
-    return os.path.join(parent_path, "lang_configs")
-
-
-def load_from_lang_config(lang, **kwargs):
-    """Automatically bootstrap config from language specific config"""
-    config_path = _get_config_path()
-    config_path = os.path.join(config_path, "%s.json" % lang)
-
-    if os.path.exists(config_path):
-        kwargs["lang"] = lang
-        return load_from_config(config_path, **kwargs)
-
-    return TokenizationConfig(lang, **kwargs)
-
-
@@ -0,0 +1,23 @@
+
+from ..config    import TokenizationConfig
+
+from .python import create_tokenization_config as pytok_config
+from .java   import create_tokenization_config as jvtok_config
+from .go     import create_tokenization_config as gotok_config
+from .js     import create_tokenization_config as jstok_config
+from .php    import create_tokenization_config as phptok_config
+from .ruby   import create_tokenization_config as rubytok_config
+
+
+def load_from_lang_config(lang, **kwargs):
+    
+    if lang == "python"       : base_config = pytok_config()
+    elif lang == "java"       : base_config = jvtok_config()
+    elif lang == "go"         : base_config = gotok_config()
+    elif lang == "javascript" : base_config = jstok_config()
+    elif lang == "php"        : base_config = phptok_config()
+    elif lang == "ruby"       : base_config = rubytok_config()
+    else                      : base_config = TokenizationConfig(lang)
+
+    base_config.update(kwargs)
+    return base_config
@@ -0,0 +1,17 @@
+from code_ast import ASTVisitor
+
+# Basic visitor -----------------------------------------------------------
+
+class LeafVisitor(ASTVisitor):
+
+    def __init__(self, node_handler):
+        self.node_handler = node_handler
+
+    def visit_string(self, node):
+        self.node_handler(node)
+        return False
+
+    def visit(self, node):
+        if node.child_count == 0:
+            self.node_handler(node)
+            return False
@@ -0,0 +1,30 @@
+
+from ...config import TokenizationConfig
+from ...tokens import NewlineToken
+
+from ..base_visitors import LeafVisitor
+
+
+# Tokenization config ----------------------------------------------------------------
+
+def create_tokenization_config():
+    return TokenizationConfig(
+        lang = 'go',
+        statement_types = ["*_statement", "*_declaration"],
+        visitors = [GoLeafVisitor],
+        indent_tokens   = False
+    )
+
+# Custom leaf visitor ----------------------------------------------------------------
+
+class GoLeafVisitor(LeafVisitor):
+
+    def visit_interpreted_string_literal(self, node):
+        self.node_handler(node)
+        return False
+
+    def visit(self, node):
+        if node.type == "\n":
+            self.node_handler.handle_token(NewlineToken(self.node_handler.config))
+            return False
+        return super().visit(node)