Skip to content

Commit b2c533c

Browse files
Fabio SalernoFabio Salerno
authored andcommitted
update README.md
1 parent 191ce8f commit b2c533c

File tree

1 file changed

+52
-2
lines changed

1 file changed

+52
-2
lines changed

README.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,56 @@
11
# LLM4Code-memtune
22

3-
Replication package for the paper titled "How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning"
3+
Replication package for the paper: "**How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning**"
44

5-
For questions about the content of this repo, please use the issues board. If you have any questions about the paper, please email the first author.
5+
For questions:
6+
- Repository content: Please use the issues board
7+
- Paper inquiries: Contact the first author via email (info DOT fabiosalern AT gmail DOT COM)
8+
9+
## Repository Structure
10+
11+
```
12+
LLM4Code-memtune/
13+
├── data/ # Dataset filtering and sample creation tools
14+
├── training/ # StarCoder2 fine-tuning scripts and training stats
15+
└── evaluation/ # Data extraction experiment code and results
16+
```
17+
18+
## Requirements
19+
20+
### Hardware Requirements
21+
- GPU: Nvidia A100 (80GB VRAM)
22+
- RAM: 32GB
23+
- CPU: 16 cores
24+
25+
GPU requirements by model:
26+
- StarCoder2-3B: 2 GPUs
27+
- StarCoder2-7B: 4 GPUs
28+
- StarCoder2-15B: 6 GPUs
29+
30+
Note: Data extraction experiments can run on a single GPU.
31+
32+
### Software Requirements
33+
- Python 3.8
34+
- Additional dependencies:
35+
```bash
36+
pip install -r requirements.txt
37+
```
38+
39+
## Directories
40+
41+
### Data
42+
Contains scripts and tools for dataset filtering and sample creation, organized into two main directories.
43+
44+
### Training
45+
Contains:
46+
- Fine-tuning scripts for StarCoder2
47+
- Training statistics and metrics
48+
49+
### Evaluation
50+
Contains code, data, and results for data extraction experiments.
51+
52+
For detailed documentation of each directory, please refer to their respective README files.
53+
54+
## Ethical use
55+
Please use the code and concepts shared here responsibly and ethically. The authors have provided this code to enhance the security and safety of large language models (LLMs). Avoid using this code for any malicious purposes. When disclosing data leakage, take care not to compromise individuals' privacy unnecessarily.
656

0 commit comments

Comments
 (0)