Skip to content

DemevengDerrick/Lab_data_management

Repository files navigation

AFRO POLIO LABORATORY DATA MANAGEMENT ETL WORKFLOW

Introduction

This repository host scripts to ease the process of ingestion, cleaning, loading, analyzing and dissiminating the Afro polio laboratory ES and AFP databases. The general workflow for this data pipeline is highlighted in the below diagram.

My Image

Workflow Summary

Both AFP and ES Laboratory databases are shared as mdb zipped files by the 17 polio labs on a weekly bases. 15 for ES and 17 for AFP, making a total of 32 mdb files. One these files are downloaded and placed in respective folders, the following steps are then applied to generate the output results.

STEP 1: Bulk Unzipping of zipped mdb files.

The ES and AFP zipped files are placed in two different folders. an R function is used to automatically unzip files from each folder to place them in two seperate unzipped folders.

STEP 2: mdb tables renaming

Withing both the ES and AFP unzipped mdb files, the table naming is not consistent. We use a R funtion to loop through each files and then rename them to have a unique table name accross all files. The names are EnvironmentalSampling for ES and Poliolab for AFP Lab. These naming will further ease data merging process.

STEP 3: Merging

Once the tables have been renamed, we then move forward to merge them into one ES mdb and one AFP Lab mdb files. The merging could be done from R, but with experience, writing into an mdb table from R is less efficient than doing direct merge in mdb. The AFRO Regional Office has an mdb script for both ES and AFP. These scripts are used at this stage to quickly perform the merging. After the merging, the merged data is brough back into R to do a first level data quality check and correction of obvious errors such as wrong spellings. One this is done, the cleaned data is then exported into a final mdb to be shared.

STEP 4: Data Quality check

The ES and AFP Lab databases contain errors that can't be solved at the regional level, and requires feedback to the labs and countries. These errors include but not limited to dates inconsistencies, missing values, duplicates, inconsistent results and many others. In this step, using the merged database from the previous step, we generate error reports with the line list of records having issues and share them back to labs and countries for action. This may be immidiate or take a long time to be resolved.

STEP 5: Analyses and Dissemination

After the mdbs are merged, they are used to produce performnace indicators of both surveillance and Lab processess. Also, this data feeds into the regional M&E dashboards.

STEP 6: Data Sharing

Once everything is finalised, the data is shared accross different partners of the GPEI via email.

How to use the workflow

This section will take you through details on how to successfully use the workflow. This worflow is split into five main sections; Extraction - Transformation - Loading - Data Quality Check - Analysis

Data Extraction

This consist of the process of manually downloading the files from emails, storing in appropriate folders and unzipping. The detailed steps are as follow

1- Download all the zip files and place them in the input folder that corresponds to the database. Example: A zip AFP database should be placed in the "1-input\zipped\AFP" folder, and an ES database in "1-input\zipped\ES" folder. This folder should contain only zipped files regardless of the format (.rar, .zip etc). The table below gives detailed information on where to get and download the data.

image

2- Use the AFP_main or ES main script depending on which database you are processing, and run the following section of code within the file.

input_folder <- "1-input/zipped/ES"
output_folder <- "1-input/unzipped/ES"

# Unzip ES files into the output folder
prep_unzip_files(input_folder, output_folder, del_existing = T)

3- Rename all the tables in the mdbs to a unique name. Because the tables in the different mdbs have different names making them difficult to track, you can either manually rename them to a unique name or use the following line of code to interactively select and rename the tables within R. Example of renaming ES databases to EnvironmentalSampling.

# Rename tables
rename_multi_mdb_tables(output_folder, "EnvironmentalSampling")

4- After renaming the the tables in the databases to a unique name, we can now merge all the tables into one using the following lines of code in the main scripts (ES or AFP).

# Extract and merge the tables into one
poliprep::prep_mdb_table_extract(output_folder, "EnvironmentalSampling") # this function is still under developement

Data Transformation

This section addresses cleaning and flaging data quality issues.

Data Loading and Dissemination

This section addresses data loading into the mdb database, generation and sharing of data quality reports.

Development Roadmap

Credit

Owner : WHO AFRO PEP - Data & Information Management (DIM) Team

Focal Point : Derrick Demeveng

Contact : cheford@who.int / demeveng@gmail.com

About

This repository contains code for laboratory data management.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages