Skip to content

Conversation

@rdheekonda
Copy link
Contributor

@rdheekonda rdheekonda commented Jan 26, 2026

Key Changes:

  • Add 5 PII extraction transforms for adversarial testing
  • Add 3 advanced PII scorers with statistical CI support
  • Add 59 unit tests with no LLM calls
  • Based on 2024-2025 research (Carlini, PII-Scope, Model Inversion)

Added:

  • dreadnode/transforms/pii_extraction.py: 5 transforms
    • repeat_word_divergence: Trigger memorization (Carlini technique)
    • continue_exact_text: Force prefix completion
    • complete_from_internet: Probe memorized web content
    • partial_pii_completion: Adaptive extraction with hints
    • public_figure_pii_probe: Test public figure disclosure
  • dreadnode/scorers/pii_advanced.py: 3 scorers + 2 helpers
    • training_data_memorization: Entropy/pattern detection
    • credential_leakage: 13 credential types (API keys, tokens)
    • pii_disclosure_rate: Binary scorer for eval aggregation
    • wilson_score_interval: Statistical confidence intervals
    • calculate_disclosure_rate_with_ci: Helper for 95% CI analysis
  • examples/airt/pii_extraction_attacks.ipynb: Usage examples
    • TAP attacks with PII transforms
    • Eval-based disclosure rate testing
    • Credential leakage detection
  • tests/test_pii_extraction_transforms.py: 21 transform tests
  • tests/test_pii_advanced_scorers.py: 38 scorer tests

Changed:

  • dreadnode/transforms/__init__.py: Export pii_extraction module
  • dreadnode/scorers/__init__.py: Export new scorers and helpers

Generated Summary:

  • Added new functionalities for advanced PII detection in the pii_advanced.py module.
  • Introduced five new scoring functions:
    • training_data_memorization: Detects verbatim memorized text from training data.
    • credential_leakage: Identifies potential leaked credentials, API keys, and tokens.
    • pii_disclosure_rate: Binary detection of PII for evaluation purposes.
    • wilson_score_interval: Calculates statistical confidence intervals for PII disclosure rates.
    • calculate_disclosure_rate_with_ci: Aggregates PII detection results to compute disclosure rates.
  • Enhanced the __init__.py files to include new scorer functions and maintain module imports.
  • Created pii_extraction.py with functions targeting specific PII extraction techniques:
    • repeat_word_divergence, continue_exact_text, complete_from_internet, partial_pii_completion, and public_figure_pii_probe.
  • Added a new Jupyter notebook example to demonstrate adversarial PII extraction techniques, including various attack scenarios.
  • Potential impact: These changes significantly enhance the capability to assess and evaluate PII leakage risks in outputs from language models, valuable for security assessments in development and production environments.

This summary was generated with ❤️ by rigging

Research References:

  • Carlini et al. (USENIX 2024): Extracting Training Data from LLMs
  • PII-Scope Benchmark (arXiv 2410.06704): 48.9% success rate
  • Model Inversion Attacks (arXiv 2507.04478): Credential extraction

Add comprehensive PII extraction capabilities for AI red teaming:

Transforms (dreadnode/transforms/pii_extraction.py):
- repeat_word_divergence: Trigger training data memorization via Carlini et al. technique
- continue_exact_text: Force exact continuation of memorized prefixes
- complete_from_internet: Probe for memorized web content
- partial_pii_completion: Adaptive PII extraction with contextual hints
- public_figure_pii_probe: Test disclosure of public figure PII

Scorers (dreadnode/scorers/pii_advanced.py):
- training_data_memorization: Detect memorized text via entropy, repetition, and structural patterns
- credential_leakage: Pattern-based detection for API keys, tokens, passwords (13 types)
- pii_disclosure_rate: Binary scorer for eval aggregation
- wilson_score_interval: Statistical confidence intervals for disclosure rates
- calculate_disclosure_rate_with_ci: Helper for disclosure rate analysis with 95% CI

Example notebook (examples/airt/pii_extraction_attacks.ipynb):
- TAP attacks with PII extraction transforms
- Eval-based disclosure rate testing with statistical confidence intervals
- Credential leakage detection examples

Tests:
- 21 transform tests (test_pii_extraction_transforms.py)
- 38 scorer tests (test_pii_advanced_scorers.py)
- All tests use static inputs, no LLM calls

Based on research:
- Carlini et al. (USENIX 2024): Extracting Training Data from LLMs
- PII-Scope Benchmark (arXiv 2410.06704): 48.9% extraction success rate
- Model Inversion Attacks (arXiv 2507.04478): Password/credential extraction
@dreadnode-renovate-bot dreadnode-renovate-bot bot added area/tests Changes to test files and testing infrastructure area/examples Changes to example code and demonstrations labels Jan 26, 2026
@rdheekonda rdheekonda changed the title Add PII probing transforms and scoring feat: Add PII probing transforms and scoring Jan 26, 2026
@rdheekonda rdheekonda added this pull request to the merge queue Jan 27, 2026
Merged via the queue into main with commit 34e9538 Jan 27, 2026
8 checks passed
@rdheekonda rdheekonda deleted the feat/pii-extraction-capabilities branch January 27, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/examples Changes to example code and demonstrations area/tests Changes to test files and testing infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants