|
| 1 | +--- |
| 2 | +layout: publication |
| 3 | +title: "CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code" |
| 4 | +authors: Aryaz Eghbali, Michael Pradel |
| 5 | +conference: ASE |
| 6 | +year: 2022 |
| 7 | +additional_links: |
| 8 | + - {name: "Preprint", url: "https://arxiv.org/abs/xxxx.xxxxxx"} |
| 9 | +tags: ["evaluation"] |
| 10 | +--- |
| 11 | +Recent years have brought a surge of work on predicting pieces |
| 12 | +of source code, e.g., for code completion, code migration, program |
| 13 | +repair, or translating natural language into code. All this work faces |
| 14 | +the challenge of evaluating the quality of a prediction w.r.t. some |
| 15 | +oracle, typically in the form of a reference solution. A common |
| 16 | +evaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but |
| 17 | +adopted in software engineering because it can be easily computed |
| 18 | +on any programming language and enables automated evaluation at |
| 19 | +scale. However, a key difference between natural and programming |
| 20 | +languages is that in the latter, completely unrelated pieces of code |
| 21 | +may have many common n-grams simply because of the syntactic |
| 22 | +verbosity and coding conventions of programming languages. We |
| 23 | +observe that these trivially shared n-grams hamper the ability of |
| 24 | +the metric to distinguish between truly similar code examples and |
| 25 | +code examples that are merely written in the same language. This |
| 26 | +paper presents CrystalBLEU, an evaluation metric based on BLEU, |
| 27 | +that allows for precisely and efficiently measuring the similarity of |
| 28 | +code. Our metric preserves the desirable properties of BLEU, such |
| 29 | +as being language-agnostic, able to handle incomplete or partially |
| 30 | +incorrect code, and efficient, while reducing the noise caused by |
| 31 | +trivially shared n-grams. We evaluate CrystalBLEU on two datasets |
| 32 | +from prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish |
| 33 | +similar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously |
| 34 | +proposed variant of BLEU for code. |
0 commit comments