Skip to content

Commit a191fcb

Browse files
authored
Add the stack
1 parent ed0d3a7 commit a191fcb

File tree

1 file changed

+23
-0
lines changed

1 file changed

+23
-0
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
layout: publication
3+
title: "The Stack: 3TB of permissively licensed source code"
4+
authors: Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries
5+
conference:
6+
year: 2022
7+
additional_links:
8+
- {name: "Preprint", url: "https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i_7G5Bb/view"}
9+
tags: ["dataset"]
10+
---
11+
Large Language Models (LLMs) play an ever-increasing role in the field of
12+
Artificial Intelligence (AI)–not only for natural language processing but also
13+
for code understanding and generation. To stimulate open and responsible
14+
research on LLMs for code, we introduce The Stack, a 3.1 TB dataset
15+
consisting of permissively licensed source code in 30 programming languages.
16+
We describe how we collect the full dataset, construct a permissively licensed
17+
subset, and present promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that
18+
(1) near-deduplicating the data significantly boosts performance across all
19+
experiments, and (2) it is possible to match previously reported HumanEval
20+
and MBPP performance using only permissively licensed data. We make the
21+
dataset available at https://hf.co/BigCode and give developers the possi-
22+
bility to have their code removed from the dataset by following the instruc-
23+
tions at https://www.bigcode-project.org/docs/about/the-stack/.

0 commit comments

Comments
 (0)