Skip to content

Conversation

@netroms
Copy link
Collaborator

@netroms netroms commented Jan 18, 2026

Add Distributed Tracing with Grafana Tempo and OpenTelemetry

Summary

  • Add Grafana Tempo for distributed trace storage and visualization
  • Instrument DHIS2 with OpenTelemetry Java Agent for automatic tracing of HTTP requests and database queries
  • Enable seamless correlation between traces, logs, and metrics in Grafana
  • Added boasystem PostGIS as an option, because it has support for Arm64 architecture
  • Added update-admin-password to depends_on in the Traefik service, so we don't serve DHIS2 public if password fails to be set

Architecture

┌────────────────────────────────────────────────────────────┐
│                     DHIS2 Container                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  DHIS2 Application                                  │   │
│  │         │                                           │   │
│  │  OpenTelemetry Java Agent (auto-instrumentation)    │   │
│  └─────────────────────────────────────────────────────┘   │
│              │ traces (OTLP)       │ logs (with trace_id)  │
└──────────────┼─────────────────────┼───────────────────────┘
               ▼                     ▼
         ┌──────────┐          ┌──────────┐
         │  Tempo   │◄────────►│   Loki   │
         └──────────┘          └──────────┘
               │                     │
               └──────────┬──────────┘
                          ▼
                    ┌──────────┐
                    │ Grafana  │
                    └──────────┘

Changes

New Services:

  • tempo - Grafana Tempo for trace storage (OTLP receiver on port 4318)
  • tempo-init - Volume permission initialization
  • otel-init - Downloads OpenTelemetry Java Agent JAR

DHIS2 Instrumentation:

  • OpenTelemetry Java Agent added via JAVA_TOOL_OPTIONS
  • Auto-instruments: HTTP requests, JDBC/PostgreSQL, Hibernate, Spring MVC
  • Log pattern updated to include trace_id and span_id for correlation

What This Enables

  1. View traces in Grafana Tempo—See distributed traces with spans for HTTP requests and SQL queries
  2. Jump from logs to traces—Click on a log line in Loki to see the full trace
  3. Jump from traces to logs—Click on a trace span to see all related logs
  4. See SQL query timing—The OTel agent captures JDBC calls as spans with query duration
  5. Debug slow requests—Follow a request from Traefik → DHIS2 → Database

Test Plan

  • Verify Tempo starts and is healthy
  • Make HTTP requests to DHIS2 and verify traces appear in Tempo
  • Verify OpenTelemetry agent attaches to DHIS2 (check logs for agent version)
  • Verify SQL queries appear as spans within traces
  • Test log-to-trace correlation in Grafana (click TraceID in Loki)
  • Test trace-to-log correlation (click "Logs" button in Tempo trace view)

Add profiling overlay with Grafana Tempo and OpenTelemetry Java Agent
for distributed tracing of DHIS2 requests. Changes include:

- Add Tempo service for trace storage and querying
- Add OpenTelemetry Java Agent download with SHA256 verification
- Configure trace context in log4j2.xml for log correlation
- Add httpMethod: POST to Prometheus datasource for long queries
- Add Tempo datasource with trace-to-log and trace-to-metric links
- Create detailed README for the profiling overlay
- Document JAVA_TOOL_OPTIONS override behavior

Signed-off-by: Morten Svanaes <msvanaes@dhis2.org>
The SHA256 checksum for opentelemetry-javaagent.jar v2.11.0 was
incorrect, causing the otel-init container to fail verification.

Correct hash: 4cff4ab46179260a61fc0d884f3f170cfbd9d2962dd260be2cff31262d0c7618

Signed-off-by: Morten Svanaes <msvanaes@dhis2.org>
Copy link
Contributor

@radnov radnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Minor comment added.

Copy link
Collaborator

@bobjolliffe bobjolliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small comments on tempo config.

docker compose run --rm compose-docs > docs/environment-variables.md

COMPOSE_CMD = docker compose -f docker-compose.yml -f overlays/traefik-dashboard/docker-compose.yml -f overlays/monitoring/docker-compose.yml -f overlays/glowroot/docker-compose.yml
COMPOSE_CMD = docker compose -f docker-compose.yml -f overlays/traefik-dashboard/docker-compose.yml -f overlays/monitoring/docker-compose.yml -f overlays/profiling/docker-compose.yml -f overlays/glowroot/docker-compose.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For profiling you will probably not want to have both tempo and glowroot. Better to pick one (or none)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to have a separate make command for this, like "make profiling"?

Added details on Tempo default overrides for DHIS2's high trace volume, including settings and their purposes.
Updated README to include error message for high trace volume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants