Hedera ETL populates BigQuery dataset with transactions and records generated by the Hedera Mainnet (or Testnet, if so configured).
- Extract: Stream of transactions (and records) are ingested from a Google Cloud Storage Bucket
- Transform: Filters for important fields, formats data types, etc
- Load: Streaming insert into BigQuery dataset
- Record Files contains Protobuf serialized hedera transactions published by Hedera Mirror Node. More details can be found here.
- Apache Beam pipeline pulls Record Files from Google Cloud Storage and inserts them into BigQuery. GCP Dataflow is used as runner for the pipeline.
- Hedera ETL could be run
Schema of BigQuery tables is for now managed by Apache Beam. It will be managed by Terraform in near future.
terraform CLI will be needed to create the tables. See terraform docs how to set them up.
Cloud infrastructure for ETL should exist before deployment.
For requirements to deploy on GCP Dataflow, refer deployment.
Configure GCP project id and cloud infrastructure.
To login to your GCP project use:
gcloud auth application-default loginTo setup infrastructure see terraform docs.
Streaming mode:
./gradlew run --args=" \
--rewindToTimestamp=2025-04-08T15:10:00 \
--startingTimestamp=2025-04-02T00:00:00 \
--mode=STREAMING \
--lastValidHash=bff780e2d9b659bed609a5385f7544b3600af09fcb0ec1c4c4499af270f0a16607f19194e465a4a662eb5d56fe46cc21 \
--ingestionDate=2025-04-02 \
--inputNodes=0.0.3,0.0.4,0.0.5 \
--inputBucket=bucket-with-rcd-files \
--openAccessDataset=open_dataset \
--restrictedAccessDataset=restricted_dataset \
--runner=DataflowRunner \
--project=hedera-tests \
--region=us-central1 \
--tempLocation=gs://dataflow-temp/tmp \
--numWorkers=1 \
--maxNumWorkers=4 \
--workerMachineType=e2-standard-2 \
--usePublicIps=false \
--streaming \
--experiments=enable_streaming_engine \
--jdkAddOpenModules=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
"Batch mode:
./gradlew run --args=" \
--mode=BATCH \
--lastValidHash=bff780e2d9b659bed609a5385f7544b3600af09fcb0ec1c4c4499af270f0a16607f19194e465a4a662eb5d56fe46cc21 \
--ingestionDate=2025-04-02 \
--inputNodes=0.0.3,0.0.4,0.0.5 \
--inputBucket=bucket-with-rcd-files \
--openAccessDataset=open_dataset \
--restrictedAccessDataset=restricted_dataset \
--runner=DataflowRunner \
--project=hedera-tests \
--region=us-central1 \
--tempLocation=gs://dataflow-temp/tmp \
--numWorkers=1 \
--maxNumWorkers=4 \
--workerMachineType=e2-standard-2 \
--usePublicIps=false \
--jdkAddOpenModules=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
"
Extra arguments:
--startAbove=<uri pattern>: you can force pipeline to start reading since this, even if it's not the first file that matches your starting point--disableMergeHistoryInput=true: some records require tracking their state. If you don't require to do that, you can disable it using this option.
- Setup GCS bucket and Docker repository which is used for staging, templates, and temp location.
DOCKER_IMAGE=... # Set your docker name
TEMPLATE_FILE=gs://... # Set your template file name- Build and upload template to GCS bucket
./gradlew deployFlexTemplate \
-Pflex.dockerImage=$DOCKER_IMAGE \
-Pflex.templateFile=$TEMPLATE_FILE- Start Dataflow job using the template
gcloud dataflow flex-template run \
etl-bigquery-`date +"%Y%m%d-%H%M%S%z" \
--template-file-gcs-location=$TEMPLATE_FILE \
--region=europe-west1 \
--max-workers=5 \
--parameters=ingestionDate=`date +"%Y-%m-%d",inputBucket="<bucket with your data>",inputNodes="0.0.3",...Controller service account can be configured by adding
--service-account-email=my-service-account-name@<project-id>.iam.gserviceaccount.com. See
Controller service account
for more details.
This project is governed by the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code of conduct. Please report unacceptable behavior to oss@hedera.com.
