Fix ~5x fetch regression by caching TelemetryCollector in DatabricksResultSet by gopalldb · Pull Request #1223 · databricks/databricks-jdbc

gopalldb · 2026-02-21T03:39:34Z

Summary

Cache the TelemetryCollector reference at DatabricksResultSet construction time instead of resolving it on every next() call
Eliminates per-row overhead introduced by PR [PECOBLR-1408] [PECOBLR-1407] [PECOBLR-1409][PECOBLR-1411] Address telemetry code audit comments- part1 #1163 (telemetry refactor from singleton to per-connection)
Restores Arrow cloud-fetch throughput to baseline levels

Problem

PR #1163 refactored telemetry collection from a singleton pattern to per-connection, but inadvertently introduced per-row overhead in DatabricksResultSet.next(). On every row iteration (~6.5M times for a 1025MB query), the code now traverses:

parentStatement.getStatement().getConnection()
  → cast to DatabricksConnection
  → getConnectionContext()
  → TelemetryCollectorManager.getInstance().getOrCreateCollector(context)
  → ConcurrentHashMap.computeIfAbsent()

This adds ~2.3 microseconds per row, compounding to ~15 seconds of overhead over 6.5M rows.

Fix

Resolve the TelemetryCollector once in each constructor via a resolveTelemetryCollector() helper and store it in a final field. The next() method calls the cached collector directly, bypassing the entire per-row resolution chain. The connection-to-collector mapping is stable for the lifetime of a result set, so caching at construction time is safe.

Before:

public boolean next() throws SQLException {
  checkIfClosed();
  boolean hasNext = this.executionResult.next();
  TelemetryHelper.recordResultSetIteration(
      parentStatement, statementId, resultSetMetaData.getChunkCount(), hasNext);
  return hasNext;
}

After:

public boolean next() throws SQLException {
  checkIfClosed();
  boolean hasNext = this.executionResult.next();
  if (cachedTelemetryCollector != null) {
    cachedTelemetryCollector.recordResultSetIteration(
        statementId.toSQLExecStatementId(), resultSetMetaData.getChunkCount(), hasNext);
  }
  return hasNext;
}

Benchmark Results

Regression Bisection (1025MB query, ~6.5M rows, Arrow cloud-fetch, Thrift mode, warm iteration)

Build	Commit	Fetch (ms)	Rows/s	Per-row cost	Status
v3.1.1 (baseline)	`0e8b1ad`	5,037	1.29M	0.78 us	✅ Baseline
Pre-#1163	`9c9ffe2`	4,720	1.38M	0.73 us	✅ Matches baseline
Post-#1163	`8a2f333`	19,727	329K	3.04 us	❌ 3.9x regression
v3.1.2 HEAD (unfixed)	`47235a1`	20,382	318K	3.14 us	❌ 4.0x regression
v3.1.2 (this fix)	`eaa3f64`	4,447	1.46M	0.68 us	✅ 12% faster than baseline

Post-Fix Segment Latency (500K rows per segment, warm iteration)

The fix restores the expected bursty pattern (fast in-memory iteration alternating with chunk-download waits):

 500K rows:   73ms    (in-memory, fast)
   1M rows:  393ms    (chunk wait)
 1.5M rows:  147ms    (in-memory)
   2M rows:  317ms    (chunk wait)
 2.5M rows:  409ms    (chunk wait)
   3M rows:  282ms    (mixed)
 3.5M rows:  250ms    (mixed)
   4M rows:  314ms    (chunk wait)
 4.5M rows:  197ms    (in-memory)
   5M rows:  370ms    (chunk wait)
 5.5M rows:  361ms    (chunk wait)
   6M rows:  146ms    (in-memory)

vs the unfixed v3.1.2 which showed flat ~1,450ms per segment (dominated by uniform per-row overhead).

Comprehensive Comparison (1025MB, warm iteration)

Driver	Mode	Arrow	Fetch (ms)	Rows/s
Simba 2.7.6	Thrift	N/A	109,994	59.0K
OSS v3.1.1	Thrift	Yes	5,037	1.29M
OSS v3.1.1	SEA	Yes	4,202	1.54M
OSS v3.1.1	Thrift	No	163,051	39.8K
OSS v3.1.2 (unfixed)	Thrift	Yes	20,382	318K
OSS v3.1.2 (unfixed)	SEA	Yes	19,882	326K
OSS v3.1.2 (unfixed)	Thrift	No	145,131	44.7K
OSS v3.1.2 (this fix)	Thrift	Yes	4,447	1.46M

NoArrow mode is unaffected by the regression (~145-163s for both versions) because the per-row telemetry overhead (~2.3us) is negligible compared to the ~25us/row cost of non-Arrow iteration.

Test plan

mvn test -Dtest=DatabricksResultSetTest — all 48 tests pass (43 existing + 5 new)
New tests cover: next() with cached telemetry, null parentStatement (no NPE), exception handling in resolution, null connection context, collector resolved once at construction
Benchmark verified: fetch time drops from 20,382ms → 4,447ms (4.6x improvement)
CI pipeline

🤖 Generated with Claude Code

NO_CHANGELOG=true
OVERRIDE_FREEZE=true

…esultSet PR #1163 refactored telemetry from singleton to per-connection, but introduced per-row overhead in DatabricksResultSet.next() by traversing the parentStatement → connection → context → ConcurrentHashMap chain on every row (~6.5M times for a 1025MB query). This added ~2.3us/row, compounding to ~15s of overhead. Fix: resolve the TelemetryCollector once at construction time and store it in a final field. The next() method calls the cached collector directly, eliminating per-row object traversal, type casts, UUID hashing, and ConcurrentHashMap lookups. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

samikshya-db · 2026-02-21T04:12:05Z

NEXT_CHANGELOG.md

 - **Enhanced `enableMultipleCatalogSupport` behavior**: When this parameter is disabled (`enableMultipleCatalogSupport=0`), metadata operations (such as `getSchemas()`, `getTables()`, `getColumns()`, etc.) now return results only when the catalog parameter is either `null` or matches the current catalog. For any other catalog name, an empty result set is returned. This ensures metadata queries are restricted to the current catalog context. When enabled (`enableMultipleCatalogSupport=1`), metadata operations continue to work across all accessible catalogs.

 ### Fixed
+- Optimized telemetry collection during result set iteration by caching the TelemetryCollector at construction time, eliminating per-row lookup overhead for large result sets.


as this was not released, do we need a changelog? wondering if that would confuse readers

fair point, will remove

samikshya-db

Thanks for fixing this!

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jayantsing-db · 2026-02-21T06:05:52Z

src/main/java/com/databricks/jdbc/api/impl/DatabricksResultSet.java

    boolean hasNext = this.executionResult.next();
-    TelemetryHelper.recordResultSetIteration(
-        parentStatement, statementId, resultSetMetaData.getChunkCount(), hasNext);
+    if (cachedTelemetryCollector != null) {


why do we have to do this with every next? this is super wasteful and anti-pattern with no benefit. Please fix this to have it on fetch request boundary.

+1, also why we needed #1210

gopalldb and others added 2 commits February 21, 2026 03:37

Add changelog entry for telemetry collection optimization

bf051f5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gopalldb requested review from jayantsing-db, samikshya-db and tejassp-db February 21, 2026 03:44

samikshya-db reviewed Feb 21, 2026

View reviewed changes

samikshya-db approved these changes Feb 21, 2026

View reviewed changes

Revert changelog entry - unreleased change needs no changelog

344ba3d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gopalldb merged commit 1af2dec into main Feb 21, 2026
12 of 15 checks passed

gopalldb deleted the gopal.lal/fix-telemetry-per-row-overhead branch February 21, 2026 05:23

jayantsing-db reviewed Feb 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix ~5x fetch regression by caching TelemetryCollector in DatabricksResultSet#1223

Fix ~5x fetch regression by caching TelemetryCollector in DatabricksResultSet#1223
gopalldb merged 3 commits intomainfrom
gopal.lal/fix-telemetry-per-row-overhead

gopalldb commented Feb 21, 2026 •

edited

Loading

Uh oh!

samikshya-db Feb 21, 2026

Uh oh!

gopalldb Feb 21, 2026

Uh oh!

samikshya-db left a comment

Uh oh!

Uh oh!

jayantsing-db Feb 21, 2026

Uh oh!

vikrantpuppala Feb 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

gopalldb commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Benchmark Results

Regression Bisection (1025MB query, ~6.5M rows, Arrow cloud-fetch, Thrift mode, warm iteration)

Post-Fix Segment Latency (500K rows per segment, warm iteration)

Comprehensive Comparison (1025MB, warm iteration)

Test plan

Uh oh!

samikshya-db Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

gopalldb Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

samikshya-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jayantsing-db Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

vikrantpuppala Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gopalldb commented Feb 21, 2026 •

edited

Loading

vikrantpuppala Feb 22, 2026 •

edited

Loading