Skip to content

Commit 4d100ba

Browse files
Merge pull request #163 from Divyn/main
BUSINESS-1025 - real -time indexer with kafka article
2 parents fd7b7c1 + ed4105a commit 4d100ba

File tree

1 file changed

+266
-0
lines changed

1 file changed

+266
-0
lines changed
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
---
2+
sidebar_position: 2
3+
---
4+
5+
# Real-Time Blockchain Indexer: Build Reliable Indexers with Kafka Streams Instead of Archive Nodes, gRPC, or Webhook Services
6+
7+
Building a real-time blockchain indexer is one of the most challenging infrastructure tasks in the process of tracking trades, monitoring token transfers, parsing internal calls, or building a comprehensive on-chain analytics platform. If you decide to run your own archive node, you need to plan for big SSD storage (multiple TBs), fast disks / high IOPS, and robust backup/monitoring. If not, you might be looking for need reliable, low-latency access to blockchain data at scale.
8+
9+
We know about popular approaches like running your own archive nodes, setting up gRPC indexers with Geyser plugins (for Solana), using webhook-based services like Helius, relying on third-party RPC providers, or using graph-based indexers.
10+
11+
**So why do we need another option?**
12+
13+
## The Challenge of Blockchain Indexing
14+
15+
Blockchain indexing requires processing massive volumes of data in real-time. Whether you're building a custom indexer for transaction traces, internal transactions, or on-chain data extraction, consider these requirements:
16+
17+
- **High Throughput**: Ethereum processes thousands of transactions per block, while Solana can handle hundreds of thousands per second
18+
- **Zero Data Loss**: Missing a single transaction can break your indexer's consistency
19+
- **Low Latency**: For trading bots, MEV applications, and real-time dashboards, every millisecond counts
20+
- **Data Completeness**: You need both raw blockchain data and enriched, decoded information—including internal transactions that don't emit events
21+
- **Reliability**: Your indexer must handle network issues, node failures, and data gaps gracefully
22+
- **Historical Backfilling**: You need to process historical blocks while maintaining live subscription to new blocks
23+
24+
Traditional indexing approaches struggle with these requirements:
25+
26+
### The Archive Node Management Problem
27+
28+
**What is an Archive Node?**
29+
30+
An archive node requires enabling all historical state data:
31+
32+
- `--pruning=archive`: Maintains all states in the state-trie (not just recent blocks)
33+
- `--fat-db=on`: Roughly doubles storage by storing additional information to enumerate all accounts and storage keys
34+
- `--tracing=on`: Enables transaction tracing by default for EVM traces
35+
36+
This trades massive disk space for expensive computation—essentially a full node with a "super heavy cache" enabled.
37+
38+
**The Infrastructure Reality**
39+
40+
- Storage size growth is massive: Running an archive node often requires many terabytes (for Ethereum, often > 10 TB), which grows over time.
41+
- Disk performance / IOPS bottlenecks: As chain history grows, read/write performance becomes critical; archive nodes tend to be much slower unless powerful SSDs or optimized storage are used.
42+
- Synchronization time is huge / resource-intensive: Bootstrapping (full sync) can take days or weeks; replaying chain history is compute-heavy.
43+
- Maintenance overhead & cost: Archive nodes often require dedicated hardware, monitoring, careful storage planning. This makes them costly and hard to manage for small teams or projects.
44+
- Operational complexity / configuration risk: Proper config (e.g. pruning/“gcmode=archive”, snapshot management, backups, disk planning) is necessary — misconfig can lead to data loss or unusable node.
45+
46+
**The Bottom Line**
47+
48+
Running an archive node is not a matter of hours or days—it's a matter of **weeks** even with enterprise hardware. The infrastructure requirements are substantial:
49+
50+
- **Storage**: Nearly 2TB of fast SSD storage (and growing)
51+
- **Time**: Weeks of continuous syncing
52+
- **Performance**: Degrades significantly as the database grows
53+
- **Maintenance**: Constant monitoring and intervention required
54+
55+
Bitquery Kafka streams eliminate all of these challenges by providing pre-synced, maintained archive node data through a managed streaming service.
56+
57+
### gRPC Indexers and Webhook-Based Services Limitations
58+
59+
Relying on gRPC indexers (like Solana's Geyser plugin approach), webhook-based services (like Helius), or third-party RPC providers introduces different problems that Bitquery Kafka streams solve:
60+
61+
- **gRPC Complexity**: Setting up gRPC indexers requires running validators with plugins (like Geyser for Solana), which is resource-intensive and complex—Bitquery Kafka eliminates this need
62+
- **Webhook Reliability**: Webhook-based services can miss events during downtime, have delivery failures, and lack replay capabilities—Bitquery Kafka's retention solves this
63+
- **Rate Limiting**: Most providers enforce strict rate limits that can throttle your indexing speed—Bitquery Kafka has no rate limits
64+
- **Bandwidth Costs**: Many providers charge based on data transfer, making high-volume indexing expensive—Bitquery Kafka offers predictable pricing without bandwidth charges
65+
- **Reliability Issues**: RPC endpoints, gRPC streams, and webhooks can go down, rate-limit you, or provide inconsistent data—Bitquery Kafka provides enterprise-grade reliability
66+
- **Data Gaps**: If your indexer crashes or loses connection, you may miss transactions with no way to replay—Bitquery Kafka's 24-hour retention allows you to replay missed data
67+
- **Transaction Trace Limitations**: Many services don't provide full transaction traces or internal transaction data—Bitquery Kafka includes comprehensive transaction data
68+
69+
## Why Bitquery Kafka Streams Excel for Blockchain Indexing
70+
71+
Bitquery's Kafka streams are designed as an alternative to running your own archive nodes, setting up gRPC indexers, using webhook-based services, or relying on RPC providers for blockchain indexing. Unlike traditional indexing approaches (self-hosted indexers, archive node-based indexing, gRPC indexers with Geyser plugins, or webhook services), Bitquery's Kafka streams provide several critical advantages:
72+
73+
### 1. Built-in Data Retention and Replay
74+
75+
**Bitquery Kafka streams' retention mechanism is a game-changer for blockchain indexing.**
76+
77+
Unlike RPC providers or WebSocket subscriptions that lose data on disconnect, Bitquery's Kafka streams retain messages for 24 hours. This means:
78+
79+
- **No Data Loss**: If your indexer crashes or needs to restart, you can resume from where you left off
80+
- **Gap Recovery**: You can replay messages from any point within the retention window
81+
- **Testing and Debugging**: You can reprocess historical data to test your indexing logic
82+
- **Checkpoint Management**: Bitquery's Kafka consumer groups track your position, ensuring you never miss a message
83+
84+
This retention capability is especially valuable when:
85+
86+
- Your indexer needs maintenance or updates
87+
- You discover a bug and need to reprocess data
88+
- Network issues cause temporary disconnections
89+
- You want to test new indexing logic against recent historical data
90+
91+
### 2. More Data Than Raw Nodes or Archive Nodes
92+
93+
**Bitquery's Kafka streams provide more enriched data than raw blockchain nodes or archive nodes.**
94+
95+
Raw nodes and archive nodes give you basic transaction data, but Bitquery's streams include:
96+
97+
- **Decoded Data**: Smart contract calls are already decoded using ABI information
98+
- **Transaction Traces**: Full transaction traces and internal transaction data without needing debug_traceBlockByNumber
99+
- **Enriched Metadata**: Token names, symbols, decimals, and USD values are included
100+
- **Protocol-Specific Parsing**: DEX trades, real-time balances, liquidity pool changes, and protocol events are pre-parsed
101+
- **Internal Transactions**: Native ETH transfers and internal calls that don't emit events are included
102+
- **Cross-Chain Consistency**: Same data structure across all supported blockchains
103+
- **Both Raw and Decoded**: Access to both raw blockchain data and enriched, structured formats
104+
105+
This means you spend less time on block parsing, transaction trace extraction, and data processing, and more time building features. Instead of:
106+
107+
1. Fetching raw transaction data from archive nodes
108+
2. Decoding function calls and parsing calldata
109+
3. Parsing event logs
110+
4. Extracting internal transactions that don't emit events
111+
5. Looking up token metadata
112+
6. Calculating USD values
113+
7. Building historical backfilling pipelines
114+
115+
You receive all of this pre-processed and ready to use in a unified data feed.
116+
117+
### 3. Zero Infrastructure Management
118+
119+
**You don't need to manage any nodes or infrastructure.**
120+
121+
With Bitquery's Kafka streams:
122+
123+
- **No Archive Node Setup**: No need to sync, maintain, or upgrade archive nodes (which require significantly more resources than full nodes)
124+
- **No gRPC Indexer Configuration**: No need to set up Geyser plugins or validator-level indexing for Solana
125+
- **No Webhook Infrastructure**: No need to build webhook endpoints or handle webhook delivery failures
126+
- **No Bandwidth Management**: All data transfer happens through Bitquery Kafka, with no per-request bandwidth limits
127+
- **No Scaling Headaches**: Bitquery Kafka handles the scaling automatically
128+
- **No Maintenance Windows**: Bitquery manages uptime, redundancy, and failover
129+
130+
This is particularly important for indexing because:
131+
132+
- **Bandwidth Efficiency**: Traditional RPC-based indexing can consume massive bandwidth. With Bitquery Kafka, you consume data once and process it efficiently
133+
- **Cost Predictability**: Direct Kafka access pricing means no surprise bandwidth bills
134+
- **Focus on Logic**: Spend your time building indexing logic, not managing infrastructure
135+
136+
### 4. Enterprise-Grade Reliability
137+
138+
**Bitquery's Kafka infrastructure is built for mission-critical blockchain indexing.**
139+
140+
Unlike RPC providers that can go down or rate-limit you, Bitquery's Kafka streams provide:
141+
142+
- **At-Least-Once Delivery**: Guarantees that every message is delivered at least once
143+
- **Automatic Failover**: If one broker fails, others take over seamlessly
144+
- **Consumer Groups**: Multiple consumers can share the load, with automatic rebalancing
145+
- **Partitioning**: Data is distributed across partitions for parallel processing
146+
147+
For blockchain indexing, this means:
148+
149+
- **No Lost Transactions**: Even if your consumer crashes, messages are retained and can be replayed
150+
- **Horizontal Scaling**: Add more consumer instances to process data faster
151+
- **Fault Tolerance**: Your indexing system can survive individual component failures
152+
153+
### 5. Direct Access, No Limitations
154+
155+
**Bitquery's Kafka pricing model is designed for high-volume indexing.**
156+
157+
- **No Bandwidth Limits**: Consume as much data as you need without worrying about rate limits
158+
- **No Request Limits**: Unlike RPC providers, there are no per-second request caps
159+
- **Predictable Pricing**: Direct Kafka access means predictable costs, not variable bandwidth charges
160+
- **High Throughput**: Process millions of transactions per day without throttling
161+
162+
This is crucial for indexing because:
163+
164+
- **Full Chain Coverage**: Index every transaction, not just a sample
165+
- **Real-Time Processing**: Keep up with blockchain transaction rates
166+
- **No Throttling**: Process data at your own pace without artificial limits
167+
168+
### 6. Both Raw and Decoded Data
169+
170+
**Access to both raw blockchain data and enriched, decoded formats.**
171+
172+
Bitquery provides:
173+
174+
- **Raw Data Streams**: Access to raw blocks, transactions, and logs for custom processing
175+
- **Decoded Data Streams**: Pre-parsed transactions with decoded function calls and events
176+
- **Protocol-Specific Topics**: Separate topics for DEX trades, token transfers, and transactions
177+
- **Flexible Consumption**: Choose the level of data processing that fits your needs
178+
179+
This flexibility allows you to:
180+
181+
- **Start Simple**: Use decoded data for quick prototyping
182+
- **Go Deep**: Switch to raw data when you need custom parsing
183+
- **Mix and Match**: Use different topics for different parts of your indexer
184+
185+
Build robust indexing pipelines that:
186+
187+
- Never lose data (thanks to retention)
188+
- Can replay and reprocess (for data quality)
189+
- Handle historical backfilling while maintaining live subscription
190+
- Scale horizontally (with consumer groups)
191+
- Extract on-chain data without running archive nodes
192+
193+
## Getting Started with Bitquery Kafka-Based Indexing
194+
195+
Building a real-time indexer with Bitquery's Kafka streams is straightforward:
196+
197+
1. **Get Kafka Access**: Contact Bitquery sales by filling the [form on the website](https://bitquery.io/forms/api) for Kafka credentials
198+
2. **Choose Your Topics**: Select the topics that match your indexing needs. List is available [here](https://docs.bitquery.io/docs/streams/kafka-streaming-concepts/#complete-list-of-topics)
199+
3. **Set Up Consumers**: Create Kafka consumers with proper offset management
200+
4. **Process Messages**: Parse protobuf messages and update your index
201+
5. **Handle Failures**: Use Bitquery Kafka's 24-hour retention to recover from crashes
202+
203+
## Tutorial Tidbits: Building Real-Time Indexers with Bitquery Kafka
204+
205+
### Why Bitquery Kafka is Ideal for Blockchain Indexing
206+
207+
Bitquery Kafka streams provide several advantages over archive node-based indexing, gRPC indexers, webhook-based services, or RPC-based indexing that make them perfect for building real-time blockchain indexers:
208+
209+
**1. Data Retention for Reliability**
210+
211+
- Bitquery Kafka streams retain messages for **24 hours**, allowing you to recover from crashes or restarts
212+
- If your indexer goes down, you can resume from the last processed offset
213+
- Unlike RPC providers, you don't need to worry about missing transactions during downtime
214+
215+
**2. More Data Than Raw Nodes or Archive Nodes**
216+
217+
- Bitquery's Kafka streams include **decoded smart contract calls** and **enriched metadata**
218+
- **Transaction traces and internal transactions** included without needing debug_traceBlockByNumber
219+
- Pre-parsed DEX trades, token transfers, and protocol events
220+
- Both **raw and decoded data** available in separate topics
221+
- USD values and token metadata included automatically
222+
- Native ETH transfers and internal calls that don't emit events are captured
223+
224+
**3. Zero Infrastructure Management**
225+
226+
- **No archive node setup required**: Bitquery manages all blockchain nodes (including archive nodes)
227+
- **No gRPC indexer configuration**: No need for Geyser plugins or validator-level indexing
228+
- **No webhook infrastructure**: No webhook endpoints or delivery handling needed
229+
- **No bandwidth limits**: Direct Kafka access means no per-request throttling
230+
- **No scaling headaches**: Kafka handles horizontal scaling automatically
231+
- Focus on your indexing logic and on-chain data extraction, not infrastructure maintenance
232+
233+
**4. Cost-Effective for High-Volume Indexing**
234+
235+
- **Predictable pricing**: Direct Kafka access, not variable bandwidth charges
236+
- **No rate limits**: Process millions of transactions per day without throttling
237+
- **Efficient consumption**: Consume data once and process it multiple times if needed
238+
239+
**5. Enterprise-Grade Reliability**
240+
241+
- **At-least-once delivery**: Guarantees no message loss
242+
- **Automatic failover**: Seamless handling of broker failures
243+
- **Consumer groups**: Share load across multiple indexer instances
244+
245+
### Quick Tips for Indexer Development
246+
247+
**Handling Duplicates:**
248+
249+
- Messages may have duplicates in Kafka topics
250+
- Implement idempotent processing: track processed transaction hashes
251+
- Use a fast lookup store to check if already processed
252+
253+
**Recovery After Crashes:**
254+
255+
- Bitquery Kafka's retention window (24 hours) allows you to replay recent data
256+
- Store your processing state aka message offset and partition details.
257+
- On restart, you can seek to a specific offset if needed
258+
- This is a major advantage over RPC providers, gRPC indexers, and webhook-based services, which don't offer replay capabilities
259+
260+
**Choosing the Right Topic:**
261+
262+
- Use `*.transactions.proto` for comprehensive transaction indexing
263+
- Use `*.dextrades.proto` for DEX-specific indexing (faster, less data)
264+
- Use `*.tokens.proto` for token transfer indexing
265+
- Use `*.broadcasted.*` topics for mempool-level data (lower latency)
266+

0 commit comments

Comments
 (0)