\
Minimal, fast corpus preparation for training a tiny GPT-like model using Polars
- Data downloading: Built-in tools to download Wikipedia or FineWeb-edu datasets
- Polars-based processing: Fast, efficient DataFrame operations
- Text normalization: Lowercase, punctuation cleanup
- Readability filtering: Uses
textstatfor Flesch-Kincaid grade level filtering - Keyword filtering: Filter by topic keywords
- Vocabulary simplification: Map synonyms to canonical forms
- CEFR-based synonym mapping: Intelligent word simplification using CEFR language levels (see CEFR_SYNONYMS.md)
- Custom annotations: Extend with your own annotation logic
- Gemini integration: Built-in support for Google Gemini API annotations
- Statistics generation: Automatic JSON stats for output datasets
- Deduplication: Remove duplicate text entries
Run the installation script:
./install.shThis will:
- Install UV if needed
- Install tiny_corpus_prep with your choice of features
# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install with all features (recommended)
uv pip install -e ".[annotators,downloaders]"
# Or install with annotation support only
uv pip install -e ".[annotators]"
# Or just core dependencies
uv pip install -e .# With all features
pip install -e ".[annotators,downloaders]"
# Or with annotations only
pip install -e ".[annotators]"See INSTALL.md for detailed installation instructions.
If you need training data, use the download script to get Wikipedia or FineWeb-edu data:
# Download Simple Wikipedia (use -u for real-time progress bars)
python -u bin/download_data.py --source wikipedia --output-dir data/raw
# Or download FineWeb-edu (HuggingFace)
python -u bin/download_data.py --source fineweb --output-dir data/raw --num-files 5Tip: Use python -u for unbuffered output to see progress bars in real-time.
See QUICKSTART.md for detailed instructions.
Process a parquet file with basic filtering:
python bin/prepare_corpus.py \
--input data.parquet \
--output out.parquet \
--max-grade 8 \
--keywords science,mathWith Gemini annotation:
python bin/prepare_corpus.py \
--input data.parquet \
--output out.parquet \
--max-grade 8 \
--annotate gemini \
--api-key YOUR_GOOGLE_API_KEYAll CLI options:
python bin/prepare_corpus.py \
--input data.parquet \
--output out.parquet \
--text-column text \
--synonyms synonyms.json \
--keywords science,math,physics \
--max-grade 8.0 \
--annotate gemini \
--api-key YOUR_API_KEY \
--gemini-model gemini-2.5-flash-liteSimple processing:
from tiny_corpus_prep import process_corpus
# Process with filtering and stats
stats = process_corpus(
input_path="data.parquet",
output_path="out.parquet",
keywords=["science", "math"],
max_grade=8.0,
dedup=True,
)
print(f"Processed {stats['total_rows']} rows")Advanced pipeline with custom annotator:
import polars as pl
from tiny_corpus_prep import DataPipeline, GeminiAnnotator, read_parquet, write_parquet_with_stats
# Read input
df = read_parquet("data.parquet")
# Build pipeline
pipeline = (
DataPipeline(text_column="text", normalize=True, dedup=True)
.add_keyword_filter(["science", "math"])
.add_readability_filter(max_grade=8.0)
.add_synonym_mapper(mapping_path="synonyms.json")
.add_annotator(GeminiAnnotator(api_key="YOUR_KEY"))
)
# Process
df_processed = pipeline.process(df)
# Save with stats
stats = write_parquet_with_stats(df_processed, "out.parquet")Custom annotation function:
from tiny_corpus_prep import CustomFunctionAnnotator
def my_annotator(text: str) -> dict:
"""Add custom metadata."""
return {
"word_count": len(text.split()),
"has_numbers": any(c.isdigit() for c in text),
}
pipeline = DataPipeline().add_annotator(
CustomFunctionAnnotator(my_annotator)
)Readability filtering:
from tiny_corpus_prep import filter_by_readability, is_middle_school_level
import polars as pl
df = pl.DataFrame({"text": ["Simple text.", "Complex terminology."]})
# Filter by grade level
df_filtered = filter_by_readability(df, max_grade=8.0)
# Or use the helper function
texts = ["Some text here", "Another text"]
readable = [t for t in texts if is_middle_school_level(t)]process_corpus(input_path, output_path, **options): Main function for processingDataPipeline: Fluent API for building custom pipelines
BaseAnnotator: Base class for custom annotatorsGeminiAnnotator: Google Gemini API annotator for topic/education classificationCustomFunctionAnnotator: Wrap a function as an annotator
filter_by_readability(df, max_grade, text_column): Filter by Flesch-Kincaid gradefilter_by_keywords(df, keywords, text_column): Filter by keyword presenceis_middle_school_level(text): Check if text is ≤ 8th grade level
read_parquet(path, text_column): Read parquet with validationwrite_parquet(df, path): Write parquet filewrite_parquet_with_stats(df, path, text_column): Write parquet + JSON statsgenerate_stats(df, text_column): Generate statistics dictionary
Input: Parquet file with required text column (configurable name)
Output: Parquet file with:
- Original
textcolumn - Any annotation columns added by annotators
- Filtered/processed rows
Statistics JSON (same name as output parquet):
{
"total_rows": 1000,
"total_columns": 3,
"columns": ["text", "topic", "education"],
"text_stats": {
"min_length": 10,
"max_length": 500,
"mean_length": 120.5,
"median_length": 115.0,
"total_characters": 120500,
"empty_or_null_count": 0
},
"column_stats": {
"topic": {
"dtype": "String",
"null_count": 5,
"unique_values": 15,
"top_values": [...]
}
}
}The built-in Gemini annotator classifies text into topics and education levels:
from tiny_corpus_prep import GeminiAnnotator
annotator = GeminiAnnotator(
api_key="YOUR_KEY",
model_name="gemini-2.5-flash-lite"
)
# Annotates with 'topic' and 'education' columns
result = annotator.annotate("The Copy Number Variation analysis...")
# {'topic': 'Life Sciences', 'education': 'PhD degree'}Topics include: Arts & Humanities, Mathematics, Computer Science, Health & Medicine, and more.
Education levels: primary school, middle school, high school, university degree, PhD degree.
- All processing uses Polars for efficiency
- Input must be parquet with a
textcolumn - Readability uses
textstatlibrary (Flesch-Kincaid grade) - Synonym mapping is word-level replacement
- Deduplication is based on exact text matches
- Statistics JSON automatically generated alongside output
Build a smarter synonym mapping that replaces difficult words with easier alternatives based on CEFR language levels:
python bin/build_synmap_from_cefr.py \
--cefr_csv data/cefr_wordlist.csv \
--out_dir output/synonyms \
--easy_levels A1,A2 \
--difficult_levels B2,C1,C2This generates:
synonyms.json- JSON mapping for use in the pipelinesynonyms.csv- Detailed mapping with CEFR levels and sourcesunmapped.txt- Words that couldn't be simplifiedbuild_stats.txt- Statistics about the mapping process
See CEFR_SYNONYMS.md for complete documentation and examples.
The download_data.py script helps you get training data from public sources:
Download and process Simple English Wikipedia:
# Install optional dependency
uv pip install -e ".[downloaders]"
# Download and convert to parquet
python bin/download_data.py --source wikipedia --output-dir data/raw
# Specify a different date
python bin/download_data.py --source wikipedia --output-dir data/raw --date 20240101This will create a parquet file with columns: filename, title, text, number_of_characters, number_of_words, topic, text_quality.
Downloads show a progress bar using Python's requests library.
Requirements: bzip2 (system dependency) and wikiextractor (installed with [downloaders] extra)
Download pre-processed educational web content:
# Download 1 file
python bin/download_data.py --source fineweb --output-dir data/raw
# Download multiple files
python bin/download_data.py --source fineweb --output-dir data/raw --num-files 10 --start-index 0Downloads show a progress bar for each file.
The downloaded parquet files can be directly processed with prepare_corpus.py.
MIT