End-to-end scaffolding for a Text-to-SQL copilot — Mistral-7B QLoRA fine-tuning → dataset + training pipeline → evaluation → remote inference via Hugging Face Inference → Streamlit UI ready for Streamlit Community Cloud
Built with:
Python |
PyTorch |
Transformers |
PEFT / QLoRA |
Hugging Face Datasets |
Hugging Face Hub |
Streamlit |
pytest
Current high-level layout:
.
├── app/ # Streamlit UI (remote inference via HF InferenceClient)
│ └── streamlit_app.py
├── docs/ # Documentation, design notes, evaluation reports
│ ├── dataset.md
│ ├── training.md
│ ├── evaluation.md
│ └── external_validation.md
├── notebooks/ # Jupyter/Colab notebooks for experimentation
├── scripts/ # CLI scripts (dataset, training, evaluation, utilities)
│ ├── build_dataset.py
│ ├── check_syntax.py
│ ├── smoke_load_dataset.py
│ ├── smoke_infer_endpoint.py
│ ├── train_qlora.py
│ ├── evaluate_internal.py
│ ├── evaluate_spider_external.py
│ └── publish_to_hub.py
├── src/
│ └── text2sql/ # Core Python package
│ ├── __init__.py
│ ├── data_prep.py
│ ├── infer.py
│ ├── training/
│ │ ├── __init__.py
│ │ ├── config.py
│ │ └── formatting.py
│ └── eval/
│ ├── __init__.py
│ ├── normalize.py
│ ├── schema.py
│ ├── metrics.py
│ └── spider.py
├── tests/
│ ├── fixtures/
│ │ ├── sql_create_context_sample.jsonl
│ │ ├── eval_internal_sample.jsonl
│ │ ├── spider_sample.jsonl
│ │ └── spider_schema_sample.jsonl
│ ├── test_repo_smoke.py
│ ├── test_build_dataset_offline.py
│ ├── test_data_prep.py
│ ├── test_eval_cli_args.py
│ ├── test_infer_quantization.py
│ ├── test_prompt_formatting.py
│ ├── test_normalize_sql.py
│ ├── test_schema_adherence.py
│ └── test_metrics_aggregate.py
├── .env.example # Example environment file
├── .gitignore
├── context.md # Persistent project context & decisions
├── LICENSE
├── README.md
└── requirements.txt
As the project progresses, this structure will be refined and additional modules, scripts, and documentation will be added.
This repository contains the scaffolding for an Analytics Copilot that converts natural-language questions into SQL queries over structured data (e.g., warehouse tables). The core goal is to fine-tune a Mistral-7B model using QLoRA for efficient, high-quality text-to-SQL generation, and to expose it via a Streamlit UI.
The project is currently in the initial setup phase:
- Basic Python project structure (src/ layout).
- Dataset smoke test for b-mc2/sql-create-context using Hugging Face Datasets.
- Minimal test suite using
pytest. - Persistent project context in
context.md.
For the evolving high-level plan, decisions, and change history, see:
This section will be expanded as the project matures.
Planned high-level components:
-
Modeling
- Base model: Mistral-7B
- Finetuning: QLoRA (parameter-efficient)
- Libraries:
transformers,peft,accelerate, etc.
-
Data & Training
- Primary dataset: b-mc2/sql-create-context (Hugging Face:
b-mc2/sql-create-context) - Preprocessing: prompt construction, schema serialization, and handling of
CREATE TABLEcontext. - Training scripts & notebooks in
scripts/andnotebooks/.
- Primary dataset: b-mc2/sql-create-context (Hugging Face:
-
Inference & Serving
- Text-to-SQL generation pipeline (prompting, decoding, validation).
- Safe SQL execution layer (read-only queries, limits).
- Streamlit UI under
app/for interactive usage.
-
Evaluation
- Accuracy metrics on WikiSQL and optionally Spider dev.
- Latency measurements and quality reports.
- Name:
b-mc2/sql-create-context - Source: Hugging Face Datasets
- Contents: Natural language questions, the corresponding
CREATE TABLEDDL context, and gold SQL query answers (well-suited for text-to-SQL with schema awareness).
This repo includes a smoke script to verify dataset access locally.
Note: The WikiSQL Hugging Face dataset (
Salesforce/wikisql) is implemented as a script-based dataset (wikisql.py). Starting withdatasets>=4, script-based datasets like this are no longer supported by default, so loading WikiSQL will fail unless you explicitly pindatasets<4.0.0or use an older version ofdatasets. This project instead usesb-mc2/sql-create-context, which is backed by parquet data files and compatible withdatasets 4.x.
- Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
python -m pip install --upgrade pip
python -m pip install -r requirements.txt- (Optional but recommended) Configure your Hugging Face credentials if required for any private resources:
- Copy
.env.exampleto.envand fill in values as needed. - Or run
huggingface-cli loginif you plan to use a HF Hub token.
To verify that the b-mc2/sql-create-context dataset can be loaded:
python scripts/smoke_load_dataset.pyExpected behavior:
- Logs basic information while loading the dataset.
- Prints the sizes of all available splits (e.g.,
train,validation,testif present). - Shows one example from the
trainsplit. - If an error occurs (e.g., missing
datasetslibrary, no network, or invalid dataset name), a clear error message will be logged.
The raw b-mc2/sql-create-context dataset is converted into Alpaca-style
instruction-tuning JSONL files using the build script:
# Full preprocessing run (uses the full dataset)
python scripts/build_dataset.py
# Quick dev run on a subset of rows (e.g., 2000 examples)
python scripts/build_dataset.py --max_rows 2000By default, the script writes:
data/processed/train.jsonldata/processed/val.jsonl
The data/ directory is not tracked in version control; it is intended to
be generated locally as needed. See docs/dataset.md for
details on the raw dataset, the train/val split strategy, and the output format.
We provide two main paths for QLoRA fine-tuning:
- A detailed, Colab-friendly notebook:
notebooks/finetune_mistral7b_qlora_text2sql.ipynb
- A reproducible CLI script:
scripts/train_qlora.py
Basic usage:
# Dry run: load config + dataset, format a small batch, and exit
python scripts/train_qlora.py --dry_run
# Smoke run: validate dataset + config; model loading is skipped on CPU-only environments
python scripts/train_qlora.py --smoke
# Full training example (requires a GPU with sufficient VRAM)
python scripts/train_qlora.py \
--train_path data/processed/train.jsonl \
--val_path data/processed/val.jsonl \
--output_dir outputs/ \
--max_steps 500 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-4 \
--max_seq_length 2048See docs/training.md for details on hyperparameters,
QLoRA/LoRA configuration, and troubleshooting (OOM, sequence length, etc.).
The repository includes a lightweight but robust evaluation pipeline for text-to-SQL:
- Internal evaluation on the processed
b-mc2/sql-create-contextval set. - Secondary external validation on the Spider dev split.
See docs/evaluation.md for full details. Below are the
most common commands.
Mock mode (no model required, exercises metrics/reporting):
python scripts/evaluate_internal.py --mock \
--val_path data/processed/val.jsonl \
--out_dir reports/With a trained QLoRA adapter (GPU recommended):
python scripts/evaluate_internal.py \
--val_path data/processed/val.jsonl \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--max_examples 200 \
--out_dir reports/Quick smoke test (small subset, CPU-friendly):
python scripts/evaluate_internal.py --smoke \
--val_path data/processed/val.jsonl \
--out_dir reports/- On GPU machines,
--smokeruns a tiny subset through the full pipeline. - On CPU-only environments,
--smokeautomatically falls back to--mockso that no heavy model loading is attempted while still exercising the metrics and reporting stack.
Outputs:
reports/eval_internal.json– metrics, config, and sample predictions.reports/eval_internal.md– human-readable summary.
Mock mode (offline fixtures only, no internet required):
python scripts/evaluate_spider_external.py --mock \
--out_dir reports/Full Spider dev evaluation with a trained model (requires internet + HF Datasets):
python scripts/evaluate_spider_external.py \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--spider_source xlangai/spider \
--schema_source richardr1126/spider-schema \
--spider_split validation \
--max_examples 200 \
--out_dir reports/Outputs:
reports/eval_spider.json– metrics, config, and sample predictions.reports/eval_spider.md– human-readable summary, including notes on differences from official Spider evaluation.
After training on b-mc2/sql-create-context, we run a secondary evaluation
harness on the Spider dev set (e.g., xlangai/spider) to measure
generalization to harder, multi-table, cross-domain text-to-SQL tasks.
Spider evaluation uses a lightweight EM-style metric suite:
- Exact Match and No-values Exact Match on normalized SQL.
- SQL parse success using
sqlglot. - Schema adherence checks against serialized schemas from
richardr1126/spider-schema(licensed under CC BY-SA 4.0).
Spider and its schema helper are used only for evaluation, not for training.
For details, see docs/external_validation.md.
For a quick local quality check before pushing changes, you can run:
# 0) Verify runtime configuration for HF / OpenAI providers
python scripts/check_runtime_config.py
# 1) Syntax validation across src/, scripts/, and app/
python scripts/check_syntax.py
# 2) Linting (requires ruff to be installed, e.g. `pip install ruff`)
ruff check .
# 3) Test suite (offline-friendly)
pytest -qThese commands are also wired into the CI workflow (.github/workflows/ci.yml),
with an additional python -m compileall . safety gate to catch syntax errors
early.
After training, you can publish the QLoRA adapter artifacts to the Hugging Face Hub for reuse and remote inference.
-
Authenticate with Hugging Face:
huggingface-cli login --token YOUR_HF_TOKEN
or set an environment variable:
export HF_TOKEN=YOUR_HF_TOKEN -
Run the publish script, pointing it at your adapter directory and desired repo id:
python scripts/publish_to_hub.py \ --repo_id your-username/analytics-copilot-text2sql-mistral7b-qlora \ --adapter_dir outputs/adapters \ --include_metrics reports/eval_internal.json
The script will:
- Validate that a Hugging Face token is available.
- Create the model repo if it does not exist (public by default, or
--private). - Ensure a README.md model card is written into the adapter directory (with metrics if provided).
- Upload the entire adapter folder to the Hub using
huggingface_hub.HfApi.
You can re-run the script safely; it will perform another commit with the specified
--commit_message.
The repository includes a lightweight Streamlit UI that talks to a remote
model via Hugging Face Inference (no local GPU required). The app lives at
app/streamlit_app.py and intentionally does not import torch or
transformers.
- The Streamlit app is UI-only; it never loads model weights locally.
- All text-to-SQL generation is performed remotely using
huggingface_hub.InferenceClient. - For small models you may be able to use Hugging Face serverless Inference, but large models like Mistral-7B often require Inference Endpoints or a dedicated provider.
- If serverless calls fail or time out, consider deploying a dedicated
Inference Endpoint or self-hosted TGI/serving stack and pointing the app at
its URL via
HF_ENDPOINT_URL/HF_INFERENCE_BASE_URL.
Adapter repos and the HF router: If you point the app at a pure LoRA adapter repository (e.g.
BrejBala/analytics-copilot-mistral7b-text2sql-adapter) usingHF_MODEL_IDwithout anHF_ENDPOINT_URL, the request goes through the Hugging Face router and most providers will respond withmodel_not_supported. For adapter-based inference, use a dedicated Inference Endpoint and configureHF_ENDPOINT_URL+HF_ADAPTER_IDinstead of trying to call the adapter repo directly via the router.
-
Configure Streamlit secrets by creating
.streamlit/secrets.tomlfrom the example:cp .streamlit/secrets.toml.example .streamlit/secrets.toml
Then edit
.streamlit/secrets.toml(not tracked by git) and fill in either:Preferred: dedicated endpoint + adapter
HF_TOKEN = "hf_your_access_token_here" # Dedicated Inference Endpoint / TGI URL HF_ENDPOINT_URL = "https://o0mmkmv1itfrikie.us-east4.gcp.endpoints.huggingface.cloud" # Adapter identifier configured in your endpoint's LORA_ADAPTERS HF_ADAPTER_ID = "BrejBala/analytics-copilot-mistral7b-text2sql-adapter"
Fallback: provider/router-based merged model (no adapters)
HF_TOKEN = "hf_your_access_token_here" HF_MODEL_ID = "your-username/your-merged-text2sql-model" HF_PROVIDER = "auto" # optional provider hint
HF_INFERENCE_BASE_URLis also supported as an alias forHF_ENDPOINT_URL. The app will always prefer secrets over environment variables when both are set. -
Start the Streamlit app:
streamlit run app/streamlit_app.py
-
In the UI:
- Paste your database schema (DDL) into the Schema text area.
- Enter a natural-language question.
- Click Generate SQL to call the remote model.
- View the generated SQL in a code block (with a copy button).
- Optionally open the Show prompt expander to inspect the exact prompt sent to the model (useful for debugging and prompt engineering).
When deploying to Streamlit Cloud:
- Add
HF_TOKEN,HF_ENDPOINT_URL, andHF_ADAPTER_ID(orHF_MODEL_ID/HF_PROVIDERfor the router fallback) to the app's Secrets in the Streamlit Cloud UI. - Optionally configure
OPENAI_API_KEY(andOPENAI_FALLBACK_MODEL, which defaults togpt-5-nano) to enable the OpenAI fallback path when HF inference fails. - The app will automatically construct an
InferenceClientfrom those values and use the dedicated endpoint whenHF_ENDPOINT_URLis set. - No GPU is required on the Streamlit side; all heavy lifting is done by the remote Hugging Face Inference backend.
- For a step-by-step deployment walkthrough (including screenshots and
details on secrets), see
docs/deploy_streamlit_cloud.md.
This project is licensed under the MIT License. See the LICENSE file for details.
- Live App: analytics-copilot-text2sql.streamlit.app/
- For issues/feature requests: open a GitHub Issue in this repository.
- For questions or feedback, connect with me on LinkedIn