-
Notifications
You must be signed in to change notification settings - Fork 63
Processing data using SQL and local database
In this example we will use the Versatile Data Kit to develop a processing Data Job. This job will read data from a local SQLite database, process it and write the result to the same database.
Before you continue, make sure you are familiar with the Getting Started section of the wiki.
The relevant Data Job code is available here.
You can follow along and run this example Data Job on your machine to get first-hand experience working with Data Jobs; alternatively, you can use the available code as a template and extend it to make a Data Job that fits your use case more closely.
We will be using the chinook SQLite database. It is available here. Users of Unix-based operating systems can download it using the following commands, provided you have the curl and unzip tools installed:
curl https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip >> chinook.zip
unzip chinook.zip
rm -r chinook.zipYou should now find a chinook.db file in the same directory that you downloaded the original zip file in.
Alternatively, Windows users can download the zip file through the provided link, and extract it in a convenient location.
NB: After downloading and extracting the chinook.db file, place it in your local directory (the Data Job parent directory), as otherwise the sqlite3 module will create a new file named chinook.db and display an error stating that it cannot find the necessary table in it.
You can install Versatile Data Kit and the plugins required for this example by running the following commands from a terminal:
pip install quickstart-vdkNote that Versatile Data Kit requires Python 3.7+. See the Installation page for more details.
Your instance of Versatile Data Kit is pre-configured to connect to a SQLite database. We must, however, declare the path of the database we just downloaded as an environment variable.
export VDK_SQLITE_FILE=chinook.dbThe structure of our Data Job is as follows:
sqlite-example-job/
├── 10_drop_table.sql
├── 20_create_table.sql
├── 30_do_processing.sql
Note that the name of a Data Job is defined by the name of the directory in which its steps are contained.
The purpose of our Data Job sqlite-example-job is to extract the EmployeeId and names of employees who work with customers, and the number of customers they work with, and insert them into a newly-created table called customer_count_per_employee.
Our Data Job consists of three SQL steps. The reason the step names are prefixed by numbers is that steps are executed in alphabetical order, so it is a good practice to prefix the steps with numbers, which makes their order clear both to Versatile Data Kit and to other users who might read through the Data Job.
Click on each filename to expand the contents of the file.
10_drop_table.sql
DROP TABLE IF EXISTS customer_count_per_employee;
20_create_table.sql
CREATE TABLE customer_count_per_employee (EmployeeId, EmployeeFirstName, EmployeeLastName, CustomerCount);
30_do_processing.sql
INSERT INTO customer_count_per_employee SELECT SupportRepId, employees.FirstName, employees.LastName, COUNT(CustomerId) FROM (customers INNER JOIN employees ON customers.SupportRepId = employees.EmployeeId) GROUP BY SupportRepId;
Each SQL step is a separate query:
- The first step deletes the new table if it exists. This query only serves to make the Data Job repeatable;
- The second step creates the table we will be inserting data into;
- The third step performs the described processing and inserts the new data into the
customer_count_per_employeetable.
All three of these files must be located in the same /sqlite-example-job/ directory.
To run the Data Job, we navigate to the parent directory of the Job, and run the following command from a terminal:
vdk run sqlite-example-job/NB: If your current directory is not the parent directory of the Job, some command and path tweaks might be needed for the Job to complete successfully.
Upon successful completion of the Data Job, we should see a log similar to this:
Result logs
2021-08-30 09:34:42,143=1630305282[VDK] sqlite-example-job [INFO ] vdk.internal.builtin_plugins.run cli_run.py:66 run_job [OpId:1630305281-1bffad-30ea28]- Data Job execution summary: {
"data_job_name": "sqlite-example-job",
"execution_id": "1630305281-1bffad",
"start_time": "2021-08-30T06:34:42.131133",
"end_time": "2021-08-30T06:34:42.142670",
"status": "success",
"steps_list": [
{
"name": "10_drop_table.sql",
"type": "sql",
"start_time": "2021-08-30T06:34:42.131158",
"end_time": "2021-08-30T06:34:42.138567",
"status": "success",
"details": null,
"exception": null
},
{
"name": "20_create_table.sql",
"type": "sql",
"start_time": "2021-08-30T06:34:42.138622",
"end_time": "2021-08-30T06:34:42.140758",
"status": "success",
"details": null,
"exception": null
},
{
"name": "30_do_processing.sql",
"type": "sql",
"start_time": "2021-08-30T06:34:42.140801",
"end_time": "2021-08-30T06:34:42.142646",
"status": "success",
"details": null,
"exception": null
}
],
"exception": null
}
After running the Data Job, we can check whether the new table was populated correctly by using the sqlite-query command afforded to us by the vdk-sqlite plugin, which we can use to execute queries against the configured SQLite database without having to set up a Data Job:
vdk sqlite-query -q 'SELECT * FROM customer_count_per_employee'
We should see the following output:
- -------- ------- --
3 Jane Peacock 21
4 Margaret Park 20
5 Steve Johnson 18
- -------- ------- --
You can find a list of all Versatile Data Kit examples here.
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts