Skip to content

Commit f80a9ad

Browse files
authored
Add CrystalBLEU
1 parent c1d2092 commit f80a9ad

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
layout: publication
3+
title: "CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code"
4+
authors: Aryaz Eghbali, Michael Pradel
5+
conference: ASE
6+
year: 2022
7+
additional_links:
8+
- {name: "Preprint", url: "https://arxiv.org/abs/xxxx.xxxxxx"}
9+
tags: ["evaluation"]
10+
---
11+
Recent years have brought a surge of work on predicting pieces
12+
of source code, e.g., for code completion, code migration, program
13+
repair, or translating natural language into code. All this work faces
14+
the challenge of evaluating the quality of a prediction w.r.t. some
15+
oracle, typically in the form of a reference solution. A common
16+
evaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but
17+
adopted in software engineering because it can be easily computed
18+
on any programming language and enables automated evaluation at
19+
scale. However, a key difference between natural and programming
20+
languages is that in the latter, completely unrelated pieces of code
21+
may have many common n-grams simply because of the syntactic
22+
verbosity and coding conventions of programming languages. We
23+
observe that these trivially shared n-grams hamper the ability of
24+
the metric to distinguish between truly similar code examples and
25+
code examples that are merely written in the same language. This
26+
paper presents CrystalBLEU, an evaluation metric based on BLEU,
27+
that allows for precisely and efficiently measuring the similarity of
28+
code. Our metric preserves the desirable properties of BLEU, such
29+
as being language-agnostic, able to handle incomplete or partially
30+
incorrect code, and efficient, while reducing the noise caused by
31+
trivially shared n-grams. We evaluate CrystalBLEU on two datasets
32+
from prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish
33+
similar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously
34+
proposed variant of BLEU for code.

0 commit comments

Comments
 (0)