Skip to content

A secure data engineering pipeline implemented in Python using the Titanic dataset. It demonstrates best practices for handling sensitive information (PII) by applying AES-128 encryption (Fernet) before storage in a SQLite database, ensuring data privacy while maintaining analytical capabilities through secure decryption methods.

Notifications You must be signed in to change notification settings

abdelfatah-chaib/secure-data-ingestion-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Secure Data Pipeline

A secure data engineering pipeline implemented in Python using the Titanic dataset as example. This project demonstrates best practices for handling sensitive information (PII) by applying AES encryption before storage in a SQLite database, ensuring data privacy while maintaining analytical capabilities.

Overview

This project simulates a real-world scenario where data privacy is paramount. Instead of storing sensitive passenger information in plain text, this system encrypts specific columns before insertion into the database. It allows for:

  1. Secure Storage: PII is unreadable without the specific encryption key.
  2. Safe Analysis: General statistics can be computed on non-sensitive data without needing decryption.
  3. Authorized Access: Specific records can be retrieved and decrypted on-demand for authorized users.

Key Features

  • AES Encryption: Uses the Fernet symmetric encryption standard (AES-128 in CBC mode) to protect sensitive columns.
  • SQL Injection Protection: All database interactions use parameterized SQL queries to prevent injection attacks.
  • Hybrid Data Model: Stores analytical data in clear text while keeping PII encrypted.
  • Metadata Tracking: Maintains a separate table for encryption metadata (algorithm used, key creation date).
  • Object-Oriented Design: Encapsulates logic in DataEncryption and Database classes for better maintainability.

Tech Stack

  • Language: Python 3.x
  • Data Manipulation: pandas, numpy
  • Security/Cryptography: cryptography
  • Database: sqlite3 (Standard Library)

Installation

  1. Clone the repository:

    git clone https://github.com/abdelfatah-chaib/secure-data-ingestion-python.git
  2. Install dependencies:

    pip install pandas cryptography
  3. Prepare the Data: Ensure the Data file is present in the root directory.

Usage

The project is designed to be run as a sequential execution (via the provided Jupyter Notebook or converting it to a script).

  1. Initialize the System: Running the script will first check for an encryption_key.key. If missing, it generates a new high-entropy key.

    ⚠️ CRITICAL WARNING: Do not lose encryption_key.key. If deleted, all encrypted data in the database will be permanently irretrievable.

  2. Run the Pipeline: Execute the notebook cells to:

    • Load and clean the raw CSV data.
    • Encrypt sensitive columns.
    • Insert records into database.

Security Specifications

  • Algorithm: AES-128-CBC (Advanced Encryption Standard).
  • Authentication: HMAC-SHA256 (Hash-based Message Authentication Code) to prevent tampering.
  • Padding: PKCS7 padding schema.
  • Implementation: Powered by the cryptography library's Fernet recipe, which guarantees that a message cannot be manipulated or read without the key.

This project is for educational purposes demonstrating secure data handling practices.

About

A secure data engineering pipeline implemented in Python using the Titanic dataset. It demonstrates best practices for handling sensitive information (PII) by applying AES-128 encryption (Fernet) before storage in a SQLite database, ensuring data privacy while maintaining analytical capabilities through secure decryption methods.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published