Skip to content

Commit a8e7192

Browse files
sjmoranmallamanis
authored andcommitted
Create lherondelle2022topical.markdown
1 parent 5e84dd2 commit a8e7192

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
layout: publication
3+
title: "Topical: Learning Repository Embeddings from Source Code using Attention"
4+
authors: Agathe Lherondelle, Yash Satsangi, Fran Silavong, Shaltiel Eloul, Sean Moran
5+
conference: Arxiv
6+
year: 2022
7+
additional_links:
8+
- {name: "ArXiV", url: "https://arxiv.org/pdf/2208.09495.pdf"}
9+
tags: ["representation", "topic modelling"]
10+
---
11+
Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode
12+
augments the software developer’s capabilities with code autogeneration, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level
13+
representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example,
14+
auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language
15+
documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a
16+
deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the
17+
script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that
18+
were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines
19+
that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging. Furthermore, we show that Topical’s attention mechanism outperforms naive aggregation methods when computing repositorylevel representations from script-level representation generated
20+
by existing methods. Topical is a lightweight framework for computing repository-level representation of code repositories that scales efficiently with the number of topics and dataset size.

0 commit comments

Comments
 (0)