feat: Add PII probing transforms and scoring #315

rdheekonda · 2026-01-26T23:42:03Z

Key Changes:

Add 5 PII extraction transforms for adversarial testing
Add 3 advanced PII scorers with statistical CI support
Add 59 unit tests with no LLM calls
Based on 2024-2025 research (Carlini, PII-Scope, Model Inversion)

Added:

dreadnode/transforms/pii_extraction.py: 5 transforms
- repeat_word_divergence: Trigger memorization (Carlini technique)
- continue_exact_text: Force prefix completion
- complete_from_internet: Probe memorized web content
- partial_pii_completion: Adaptive extraction with hints
- public_figure_pii_probe: Test public figure disclosure
dreadnode/scorers/pii_advanced.py: 3 scorers + 2 helpers
- training_data_memorization: Entropy/pattern detection
- credential_leakage: 13 credential types (API keys, tokens)
- pii_disclosure_rate: Binary scorer for eval aggregation
- wilson_score_interval: Statistical confidence intervals
- calculate_disclosure_rate_with_ci: Helper for 95% CI analysis
examples/airt/pii_extraction_attacks.ipynb: Usage examples
- TAP attacks with PII transforms
- Eval-based disclosure rate testing
- Credential leakage detection
tests/test_pii_extraction_transforms.py: 21 transform tests
tests/test_pii_advanced_scorers.py: 38 scorer tests

Changed:

dreadnode/transforms/__init__.py: Export pii_extraction module
dreadnode/scorers/__init__.py: Export new scorers and helpers

Generated Summary:

Added new functionalities for advanced PII detection in the pii_advanced.py module.
Introduced five new scoring functions:
- training_data_memorization: Detects verbatim memorized text from training data.
- credential_leakage: Identifies potential leaked credentials, API keys, and tokens.
- pii_disclosure_rate: Binary detection of PII for evaluation purposes.
- wilson_score_interval: Calculates statistical confidence intervals for PII disclosure rates.
- calculate_disclosure_rate_with_ci: Aggregates PII detection results to compute disclosure rates.
Enhanced the __init__.py files to include new scorer functions and maintain module imports.
Created pii_extraction.py with functions targeting specific PII extraction techniques:
- repeat_word_divergence, continue_exact_text, complete_from_internet, partial_pii_completion, and public_figure_pii_probe.
Added a new Jupyter notebook example to demonstrate adversarial PII extraction techniques, including various attack scenarios.
Potential impact: These changes significantly enhance the capability to assess and evaluate PII leakage risks in outputs from language models, valuable for security assessments in development and production environments.

This summary was generated with ❤️ by rigging

Research References:

Carlini et al. (USENIX 2024): Extracting Training Data from LLMs
PII-Scope Benchmark (arXiv 2410.06704): 48.9% success rate
Model Inversion Attacks (arXiv 2507.04478): Credential extraction

Add comprehensive PII extraction capabilities for AI red teaming: Transforms (dreadnode/transforms/pii_extraction.py): - repeat_word_divergence: Trigger training data memorization via Carlini et al. technique - continue_exact_text: Force exact continuation of memorized prefixes - complete_from_internet: Probe for memorized web content - partial_pii_completion: Adaptive PII extraction with contextual hints - public_figure_pii_probe: Test disclosure of public figure PII Scorers (dreadnode/scorers/pii_advanced.py): - training_data_memorization: Detect memorized text via entropy, repetition, and structural patterns - credential_leakage: Pattern-based detection for API keys, tokens, passwords (13 types) - pii_disclosure_rate: Binary scorer for eval aggregation - wilson_score_interval: Statistical confidence intervals for disclosure rates - calculate_disclosure_rate_with_ci: Helper for disclosure rate analysis with 95% CI Example notebook (examples/airt/pii_extraction_attacks.ipynb): - TAP attacks with PII extraction transforms - Eval-based disclosure rate testing with statistical confidence intervals - Credential leakage detection examples Tests: - 21 transform tests (test_pii_extraction_transforms.py) - 38 scorer tests (test_pii_advanced_scorers.py) - All tests use static inputs, no LLM calls Based on research: - Carlini et al. (USENIX 2024): Extracting Training Data from LLMs - PII-Scope Benchmark (arXiv 2410.06704): 48.9% extraction success rate - Model Inversion Attacks (arXiv 2507.04478): Password/credential extraction

rdheekonda added 2 commits January 26, 2026 15:38

chore: apply formatting fixes from pre-commit hooks

c7b62ef

dreadnode-renovate-bot bot added area/tests Changes to test files and testing infrastructure area/examples Changes to example code and demonstrations labels Jan 26, 2026

rdheekonda changed the title ~~Add PII probing transforms and scoring~~ feat: Add PII probing transforms and scoring Jan 26, 2026

fix: split string to avoid false positive in key detection

d5e272f

rdheekonda added this pull request to the merge queue Jan 27, 2026

Merged via the queue into main with commit 34e9538 Jan 27, 2026
8 checks passed

rdheekonda deleted the feat/pii-extraction-capabilities branch January 27, 2026 01:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add PII probing transforms and scoring #315

feat: Add PII probing transforms and scoring #315

Uh oh!

rdheekonda commented Jan 26, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add PII probing transforms and scoring #315

feat: Add PII probing transforms and scoring #315

Uh oh!

Conversation

rdheekonda commented Jan 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Generated Summary:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdheekonda commented Jan 26, 2026 •

edited by github-actions bot

Loading