Hedera ETL

Hedera ETL populates BigQuery dataset with transactions and records generated by the Hedera Mainnet (or Testnet, if so configured).

Extract: Stream of transactions (and records) are ingested from a Google Cloud Storage Bucket
Transform: Filters for important fields, formats data types, etc
Load: Streaming insert into BigQuery dataset

Overview

Record Files contains Protobuf serialized hedera transactions published by Hedera Mirror Node. More details can be found here.
Apache Beam pipeline pulls Record Files from Google Cloud Storage and inserts them into BigQuery. GCP Dataflow is used as runner for the pipeline.
Hedera ETL could be run

Setup

BigQuery

Schema of BigQuery tables is for now managed by Apache Beam. It will be managed by Terraform in near future.

Creating tables

terraform CLI will be needed to create the tables. See terraform docs how to set them up.

ETL to BigQuery

Requirements

Cloud infrastructure for ETL should exist before deployment.

For requirements to deploy on GCP Dataflow, refer deployment.

Common parameters

Configure GCP project id and cloud infrastructure.

To login to your GCP project use:

gcloud auth application-default login

To setup infrastructure see terraform docs.

Running from local machine

Streaming mode:

./gradlew run --args=" \
--rewindToTimestamp=2025-04-08T15:10:00 \
--startingTimestamp=2025-04-02T00:00:00 \
--mode=STREAMING \
--lastValidHash=bff780e2d9b659bed609a5385f7544b3600af09fcb0ec1c4c4499af270f0a16607f19194e465a4a662eb5d56fe46cc21 \
--ingestionDate=2025-04-02 \
--inputNodes=0.0.3,0.0.4,0.0.5 \
--inputBucket=bucket-with-rcd-files \
--openAccessDataset=open_dataset \
--restrictedAccessDataset=restricted_dataset \
--runner=DataflowRunner \
--project=hedera-tests \
--region=us-central1 \
--tempLocation=gs://dataflow-temp/tmp \
--numWorkers=1 \
--maxNumWorkers=4 \
--workerMachineType=e2-standard-2 \
--usePublicIps=false \
--streaming \
--experiments=enable_streaming_engine \
--jdkAddOpenModules=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
"

Batch mode:

./gradlew run --args=" \
--mode=BATCH \
--lastValidHash=bff780e2d9b659bed609a5385f7544b3600af09fcb0ec1c4c4499af270f0a16607f19194e465a4a662eb5d56fe46cc21 \
--ingestionDate=2025-04-02 \
--inputNodes=0.0.3,0.0.4,0.0.5 \
--inputBucket=bucket-with-rcd-files \
--openAccessDataset=open_dataset \
--restrictedAccessDataset=restricted_dataset \
--runner=DataflowRunner \
--project=hedera-tests \
--region=us-central1 \
--tempLocation=gs://dataflow-temp/tmp \
--numWorkers=1 \
--maxNumWorkers=4 \
--workerMachineType=e2-standard-2 \
--usePublicIps=false \
--jdkAddOpenModules=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
"

Extra arguments:

--startAbove=<uri pattern>: you can force pipeline to start reading since this, even if it's not the first file that matches your starting point
--disableMergeHistoryInput=true: some records require tracking their state. If you don't require to do that, you can disable it using this option.

Running on GCP Dataflow

Setup GCS bucket and Docker repository which is used for staging, templates, and temp location.

DOCKER_IMAGE=... # Set your docker name
TEMPLATE_FILE=gs://... # Set your template file name

Build and upload template to GCS bucket

./gradlew deployFlexTemplate \
 -Pflex.dockerImage=$DOCKER_IMAGE \
 -Pflex.templateFile=$TEMPLATE_FILE

Start Dataflow job using the template

gcloud dataflow flex-template run \
 etl-bigquery-`date +"%Y%m%d-%H%M%S%z" \
 --template-file-gcs-location=$TEMPLATE_FILE \
 --region=europe-west1 \
 --max-workers=5 \
 --parameters=ingestionDate=`date +"%Y-%m-%d",inputBucket="<bucket with your data>",inputNodes="0.0.3",...

Controller service account can be configured by adding --service-account-email=my-service-account-name@<project-id>.iam.gserviceaccount.com. See Controller service account for more details.

Code of Conduct

This project is governed by the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code of conduct. Please report unacceptable behavior to oss@hedera.com.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
docs		docs
gradle/wrapper		gradle/wrapper
hedera-balances		hedera-balances
hedera-etl-bigquery		hedera-etl-bigquery
hedera-reader		hedera-reader
scripts		scripts
terraform		terraform
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
lombok.config		lombok.config
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hedera ETL

Overview

Setup

BigQuery

Creating tables

ETL to BigQuery

Requirements

Common parameters

Running from local machine

Running on GCP Dataflow

More documentation

Code of Conduct

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

blockchain-etl/hedera-etl

Folders and files

Latest commit

History

Repository files navigation

Hedera ETL

Overview

Setup

BigQuery

Creating tables

ETL to BigQuery

Requirements

Common parameters

Running from local machine

Running on GCP Dataflow

More documentation

Code of Conduct

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages