A secure data engineering pipeline implemented in Python using the Titanic dataset as example. This project demonstrates best practices for handling sensitive information (PII) by applying AES encryption before storage in a SQLite database, ensuring data privacy while maintaining analytical capabilities.
This project simulates a real-world scenario where data privacy is paramount. Instead of storing sensitive passenger information in plain text, this system encrypts specific columns before insertion into the database. It allows for:
- Secure Storage: PII is unreadable without the specific encryption key.
- Safe Analysis: General statistics can be computed on non-sensitive data without needing decryption.
- Authorized Access: Specific records can be retrieved and decrypted on-demand for authorized users.
- AES Encryption: Uses the Fernet symmetric encryption standard (AES-128 in CBC mode) to protect sensitive columns.
- SQL Injection Protection: All database interactions use parameterized SQL queries to prevent injection attacks.
- Hybrid Data Model: Stores analytical data in clear text while keeping PII encrypted.
- Metadata Tracking: Maintains a separate table for encryption metadata (algorithm used, key creation date).
- Object-Oriented Design: Encapsulates logic in
DataEncryptionandDatabaseclasses for better maintainability.
- Language: Python 3.x
- Data Manipulation:
pandas,numpy - Security/Cryptography:
cryptography - Database:
sqlite3(Standard Library)
-
Clone the repository:
git clone https://github.com/abdelfatah-chaib/secure-data-ingestion-python.git
-
Install dependencies:
pip install pandas cryptography
-
Prepare the Data: Ensure the Data file is present in the root directory.
The project is designed to be run as a sequential execution (via the provided Jupyter Notebook or converting it to a script).
-
Initialize the System: Running the script will first check for an
encryption_key.key. If missing, it generates a new high-entropy key.⚠️ CRITICAL WARNING: Do not loseencryption_key.key. If deleted, all encrypted data in the database will be permanently irretrievable. -
Run the Pipeline: Execute the notebook cells to:
- Load and clean the raw CSV data.
- Encrypt sensitive columns.
- Insert records into
database.
- Algorithm: AES-128-CBC (Advanced Encryption Standard).
- Authentication: HMAC-SHA256 (Hash-based Message Authentication Code) to prevent tampering.
- Padding: PKCS7 padding schema.
- Implementation: Powered by the
cryptographylibrary'sFernetrecipe, which guarantees that a message cannot be manipulated or read without the key.
This project is for educational purposes demonstrating secure data handling practices.