Skip to content

blockchain-etl/hedera-etl

Repository files navigation

Hedera ETL

Build Status Discord

Hedera ETL populates BigQuery dataset with transactions and records generated by the Hedera Mainnet (or Testnet, if so configured).

  • Extract: Stream of transactions (and records) are ingested from a Google Cloud Storage Bucket
  • Transform: Filters for important fields, formats data types, etc
  • Load: Streaming insert into BigQuery dataset

Overview

  • Record Files contains Protobuf serialized hedera transactions published by Hedera Mirror Node. More details can be found here.
  • Apache Beam pipeline pulls Record Files from Google Cloud Storage and inserts them into BigQuery. GCP Dataflow is used as runner for the pipeline.
  • Hedera ETL could be run

Setup

BigQuery

Schema of BigQuery tables is for now managed by Apache Beam. It will be managed by Terraform in near future.

Creating tables

terraform CLI will be needed to create the tables. See terraform docs how to set them up.

ETL to BigQuery

Requirements

Cloud infrastructure for ETL should exist before deployment.

For requirements to deploy on GCP Dataflow, refer deployment.

Common parameters

Configure GCP project id and cloud infrastructure.

To login to your GCP project use:

gcloud auth application-default login

To setup infrastructure see terraform docs.

Running from local machine

Streaming mode:

./gradlew run --args=" \
--rewindToTimestamp=2025-04-08T15:10:00 \
--startingTimestamp=2025-04-02T00:00:00 \
--mode=STREAMING \
--lastValidHash=bff780e2d9b659bed609a5385f7544b3600af09fcb0ec1c4c4499af270f0a16607f19194e465a4a662eb5d56fe46cc21 \
--ingestionDate=2025-04-02 \
--inputNodes=0.0.3,0.0.4,0.0.5 \
--inputBucket=bucket-with-rcd-files \
--openAccessDataset=open_dataset \
--restrictedAccessDataset=restricted_dataset \
--runner=DataflowRunner \
--project=hedera-tests \
--region=us-central1 \
--tempLocation=gs://dataflow-temp/tmp \
--numWorkers=1 \
--maxNumWorkers=4 \
--workerMachineType=e2-standard-2 \
--usePublicIps=false \
--streaming \
--experiments=enable_streaming_engine \
--jdkAddOpenModules=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
"

Batch mode:

./gradlew run --args=" \
--mode=BATCH \
--lastValidHash=bff780e2d9b659bed609a5385f7544b3600af09fcb0ec1c4c4499af270f0a16607f19194e465a4a662eb5d56fe46cc21 \
--ingestionDate=2025-04-02 \
--inputNodes=0.0.3,0.0.4,0.0.5 \
--inputBucket=bucket-with-rcd-files \
--openAccessDataset=open_dataset \
--restrictedAccessDataset=restricted_dataset \
--runner=DataflowRunner \
--project=hedera-tests \
--region=us-central1 \
--tempLocation=gs://dataflow-temp/tmp \
--numWorkers=1 \
--maxNumWorkers=4 \
--workerMachineType=e2-standard-2 \
--usePublicIps=false \
--jdkAddOpenModules=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
"

Extra arguments:

  • --startAbove=<uri pattern>: you can force pipeline to start reading since this, even if it's not the first file that matches your starting point
  • --disableMergeHistoryInput=true: some records require tracking their state. If you don't require to do that, you can disable it using this option.

Running on GCP Dataflow

  1. Setup GCS bucket and Docker repository which is used for staging, templates, and temp location.
DOCKER_IMAGE=... # Set your docker name
TEMPLATE_FILE=gs://... # Set your template file name
  1. Build and upload template to GCS bucket
./gradlew deployFlexTemplate \
 -Pflex.dockerImage=$DOCKER_IMAGE \
 -Pflex.templateFile=$TEMPLATE_FILE
  1. Start Dataflow job using the template
gcloud dataflow flex-template run \
 etl-bigquery-`date +"%Y%m%d-%H%M%S%z" \
 --template-file-gcs-location=$TEMPLATE_FILE \
 --region=europe-west1 \
 --max-workers=5 \
 --parameters=ingestionDate=`date +"%Y-%m-%d",inputBucket="<bucket with your data>",inputNodes="0.0.3",...

Controller service account can be configured by adding --service-account-email=my-service-account-name@<project-id>.iam.gserviceaccount.com. See Controller service account for more details.

More documentation

Deployment
Configurations

Code of Conduct

This project is governed by the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code of conduct. Please report unacceptable behavior to oss@hedera.com.

License

Apache License 2.0