|
| 1 | +--- |
| 2 | +sidebar_position: 2 |
| 3 | +--- |
| 4 | + |
| 5 | +# Real-Time Blockchain Indexer: Build Reliable Indexers with Kafka Streams Instead of Archive Nodes, gRPC, or Webhook Services |
| 6 | + |
| 7 | +Building a real-time blockchain indexer is one of the most challenging infrastructure tasks in the process of tracking trades, monitoring token transfers, parsing internal calls, or building a comprehensive on-chain analytics platform. If you decide to run your own archive node, you need to plan for big SSD storage (multiple TBs), fast disks / high IOPS, and robust backup/monitoring. If not, you might be looking for need reliable, low-latency access to blockchain data at scale. |
| 8 | + |
| 9 | +We know about popular approaches like running your own archive nodes, setting up gRPC indexers with Geyser plugins (for Solana), using webhook-based services like Helius, relying on third-party RPC providers, or using graph-based indexers. |
| 10 | + |
| 11 | +**So why do we need another option?** |
| 12 | + |
| 13 | +## The Challenge of Blockchain Indexing |
| 14 | + |
| 15 | +Blockchain indexing requires processing massive volumes of data in real-time. Whether you're building a custom indexer for transaction traces, internal transactions, or on-chain data extraction, consider these requirements: |
| 16 | + |
| 17 | +- **High Throughput**: Ethereum processes thousands of transactions per block, while Solana can handle hundreds of thousands per second |
| 18 | +- **Zero Data Loss**: Missing a single transaction can break your indexer's consistency |
| 19 | +- **Low Latency**: For trading bots, MEV applications, and real-time dashboards, every millisecond counts |
| 20 | +- **Data Completeness**: You need both raw blockchain data and enriched, decoded information—including internal transactions that don't emit events |
| 21 | +- **Reliability**: Your indexer must handle network issues, node failures, and data gaps gracefully |
| 22 | +- **Historical Backfilling**: You need to process historical blocks while maintaining live subscription to new blocks |
| 23 | + |
| 24 | +Traditional indexing approaches struggle with these requirements: |
| 25 | + |
| 26 | +### The Archive Node Management Problem |
| 27 | + |
| 28 | +**What is an Archive Node?** |
| 29 | + |
| 30 | +An archive node requires enabling all historical state data: |
| 31 | + |
| 32 | +- `--pruning=archive`: Maintains all states in the state-trie (not just recent blocks) |
| 33 | +- `--fat-db=on`: Roughly doubles storage by storing additional information to enumerate all accounts and storage keys |
| 34 | +- `--tracing=on`: Enables transaction tracing by default for EVM traces |
| 35 | + |
| 36 | +This trades massive disk space for expensive computation—essentially a full node with a "super heavy cache" enabled. |
| 37 | + |
| 38 | +**The Infrastructure Reality** |
| 39 | + |
| 40 | +- Storage size growth is massive: Running an archive node often requires many terabytes (for Ethereum, often > 10 TB), which grows over time. |
| 41 | +- Disk performance / IOPS bottlenecks: As chain history grows, read/write performance becomes critical; archive nodes tend to be much slower unless powerful SSDs or optimized storage are used. |
| 42 | +- Synchronization time is huge / resource-intensive: Bootstrapping (full sync) can take days or weeks; replaying chain history is compute-heavy. |
| 43 | +- Maintenance overhead & cost: Archive nodes often require dedicated hardware, monitoring, careful storage planning. This makes them costly and hard to manage for small teams or projects. |
| 44 | +- Operational complexity / configuration risk: Proper config (e.g. pruning/“gcmode=archive”, snapshot management, backups, disk planning) is necessary — misconfig can lead to data loss or unusable node. |
| 45 | + |
| 46 | +**The Bottom Line** |
| 47 | + |
| 48 | +Running an archive node is not a matter of hours or days—it's a matter of **weeks** even with enterprise hardware. The infrastructure requirements are substantial: |
| 49 | + |
| 50 | +- **Storage**: Nearly 2TB of fast SSD storage (and growing) |
| 51 | +- **Time**: Weeks of continuous syncing |
| 52 | +- **Performance**: Degrades significantly as the database grows |
| 53 | +- **Maintenance**: Constant monitoring and intervention required |
| 54 | + |
| 55 | +Bitquery Kafka streams eliminate all of these challenges by providing pre-synced, maintained archive node data through a managed streaming service. |
| 56 | + |
| 57 | +### gRPC Indexers and Webhook-Based Services Limitations |
| 58 | + |
| 59 | +Relying on gRPC indexers (like Solana's Geyser plugin approach), webhook-based services (like Helius), or third-party RPC providers introduces different problems that Bitquery Kafka streams solve: |
| 60 | + |
| 61 | +- **gRPC Complexity**: Setting up gRPC indexers requires running validators with plugins (like Geyser for Solana), which is resource-intensive and complex—Bitquery Kafka eliminates this need |
| 62 | +- **Webhook Reliability**: Webhook-based services can miss events during downtime, have delivery failures, and lack replay capabilities—Bitquery Kafka's retention solves this |
| 63 | +- **Rate Limiting**: Most providers enforce strict rate limits that can throttle your indexing speed—Bitquery Kafka has no rate limits |
| 64 | +- **Bandwidth Costs**: Many providers charge based on data transfer, making high-volume indexing expensive—Bitquery Kafka offers predictable pricing without bandwidth charges |
| 65 | +- **Reliability Issues**: RPC endpoints, gRPC streams, and webhooks can go down, rate-limit you, or provide inconsistent data—Bitquery Kafka provides enterprise-grade reliability |
| 66 | +- **Data Gaps**: If your indexer crashes or loses connection, you may miss transactions with no way to replay—Bitquery Kafka's 24-hour retention allows you to replay missed data |
| 67 | +- **Transaction Trace Limitations**: Many services don't provide full transaction traces or internal transaction data—Bitquery Kafka includes comprehensive transaction data |
| 68 | + |
| 69 | +## Why Bitquery Kafka Streams Excel for Blockchain Indexing |
| 70 | + |
| 71 | +Bitquery's Kafka streams are designed as an alternative to running your own archive nodes, setting up gRPC indexers, using webhook-based services, or relying on RPC providers for blockchain indexing. Unlike traditional indexing approaches (self-hosted indexers, archive node-based indexing, gRPC indexers with Geyser plugins, or webhook services), Bitquery's Kafka streams provide several critical advantages: |
| 72 | + |
| 73 | +### 1. Built-in Data Retention and Replay |
| 74 | + |
| 75 | +**Bitquery Kafka streams' retention mechanism is a game-changer for blockchain indexing.** |
| 76 | + |
| 77 | +Unlike RPC providers or WebSocket subscriptions that lose data on disconnect, Bitquery's Kafka streams retain messages for 24 hours. This means: |
| 78 | + |
| 79 | +- **No Data Loss**: If your indexer crashes or needs to restart, you can resume from where you left off |
| 80 | +- **Gap Recovery**: You can replay messages from any point within the retention window |
| 81 | +- **Testing and Debugging**: You can reprocess historical data to test your indexing logic |
| 82 | +- **Checkpoint Management**: Bitquery's Kafka consumer groups track your position, ensuring you never miss a message |
| 83 | + |
| 84 | +This retention capability is especially valuable when: |
| 85 | + |
| 86 | +- Your indexer needs maintenance or updates |
| 87 | +- You discover a bug and need to reprocess data |
| 88 | +- Network issues cause temporary disconnections |
| 89 | +- You want to test new indexing logic against recent historical data |
| 90 | + |
| 91 | +### 2. More Data Than Raw Nodes or Archive Nodes |
| 92 | + |
| 93 | +**Bitquery's Kafka streams provide more enriched data than raw blockchain nodes or archive nodes.** |
| 94 | + |
| 95 | +Raw nodes and archive nodes give you basic transaction data, but Bitquery's streams include: |
| 96 | + |
| 97 | +- **Decoded Data**: Smart contract calls are already decoded using ABI information |
| 98 | +- **Transaction Traces**: Full transaction traces and internal transaction data without needing debug_traceBlockByNumber |
| 99 | +- **Enriched Metadata**: Token names, symbols, decimals, and USD values are included |
| 100 | +- **Protocol-Specific Parsing**: DEX trades, real-time balances, liquidity pool changes, and protocol events are pre-parsed |
| 101 | +- **Internal Transactions**: Native ETH transfers and internal calls that don't emit events are included |
| 102 | +- **Cross-Chain Consistency**: Same data structure across all supported blockchains |
| 103 | +- **Both Raw and Decoded**: Access to both raw blockchain data and enriched, structured formats |
| 104 | + |
| 105 | +This means you spend less time on block parsing, transaction trace extraction, and data processing, and more time building features. Instead of: |
| 106 | + |
| 107 | +1. Fetching raw transaction data from archive nodes |
| 108 | +2. Decoding function calls and parsing calldata |
| 109 | +3. Parsing event logs |
| 110 | +4. Extracting internal transactions that don't emit events |
| 111 | +5. Looking up token metadata |
| 112 | +6. Calculating USD values |
| 113 | +7. Building historical backfilling pipelines |
| 114 | + |
| 115 | +You receive all of this pre-processed and ready to use in a unified data feed. |
| 116 | + |
| 117 | +### 3. Zero Infrastructure Management |
| 118 | + |
| 119 | +**You don't need to manage any nodes or infrastructure.** |
| 120 | + |
| 121 | +With Bitquery's Kafka streams: |
| 122 | + |
| 123 | +- **No Archive Node Setup**: No need to sync, maintain, or upgrade archive nodes (which require significantly more resources than full nodes) |
| 124 | +- **No gRPC Indexer Configuration**: No need to set up Geyser plugins or validator-level indexing for Solana |
| 125 | +- **No Webhook Infrastructure**: No need to build webhook endpoints or handle webhook delivery failures |
| 126 | +- **No Bandwidth Management**: All data transfer happens through Bitquery Kafka, with no per-request bandwidth limits |
| 127 | +- **No Scaling Headaches**: Bitquery Kafka handles the scaling automatically |
| 128 | +- **No Maintenance Windows**: Bitquery manages uptime, redundancy, and failover |
| 129 | + |
| 130 | +This is particularly important for indexing because: |
| 131 | + |
| 132 | +- **Bandwidth Efficiency**: Traditional RPC-based indexing can consume massive bandwidth. With Bitquery Kafka, you consume data once and process it efficiently |
| 133 | +- **Cost Predictability**: Direct Kafka access pricing means no surprise bandwidth bills |
| 134 | +- **Focus on Logic**: Spend your time building indexing logic, not managing infrastructure |
| 135 | + |
| 136 | +### 4. Enterprise-Grade Reliability |
| 137 | + |
| 138 | +**Bitquery's Kafka infrastructure is built for mission-critical blockchain indexing.** |
| 139 | + |
| 140 | +Unlike RPC providers that can go down or rate-limit you, Bitquery's Kafka streams provide: |
| 141 | + |
| 142 | +- **At-Least-Once Delivery**: Guarantees that every message is delivered at least once |
| 143 | +- **Automatic Failover**: If one broker fails, others take over seamlessly |
| 144 | +- **Consumer Groups**: Multiple consumers can share the load, with automatic rebalancing |
| 145 | +- **Partitioning**: Data is distributed across partitions for parallel processing |
| 146 | + |
| 147 | +For blockchain indexing, this means: |
| 148 | + |
| 149 | +- **No Lost Transactions**: Even if your consumer crashes, messages are retained and can be replayed |
| 150 | +- **Horizontal Scaling**: Add more consumer instances to process data faster |
| 151 | +- **Fault Tolerance**: Your indexing system can survive individual component failures |
| 152 | + |
| 153 | +### 5. Direct Access, No Limitations |
| 154 | + |
| 155 | +**Bitquery's Kafka pricing model is designed for high-volume indexing.** |
| 156 | + |
| 157 | +- **No Bandwidth Limits**: Consume as much data as you need without worrying about rate limits |
| 158 | +- **No Request Limits**: Unlike RPC providers, there are no per-second request caps |
| 159 | +- **Predictable Pricing**: Direct Kafka access means predictable costs, not variable bandwidth charges |
| 160 | +- **High Throughput**: Process millions of transactions per day without throttling |
| 161 | + |
| 162 | +This is crucial for indexing because: |
| 163 | + |
| 164 | +- **Full Chain Coverage**: Index every transaction, not just a sample |
| 165 | +- **Real-Time Processing**: Keep up with blockchain transaction rates |
| 166 | +- **No Throttling**: Process data at your own pace without artificial limits |
| 167 | + |
| 168 | +### 6. Both Raw and Decoded Data |
| 169 | + |
| 170 | +**Access to both raw blockchain data and enriched, decoded formats.** |
| 171 | + |
| 172 | +Bitquery provides: |
| 173 | + |
| 174 | +- **Raw Data Streams**: Access to raw blocks, transactions, and logs for custom processing |
| 175 | +- **Decoded Data Streams**: Pre-parsed transactions with decoded function calls and events |
| 176 | +- **Protocol-Specific Topics**: Separate topics for DEX trades, token transfers, and transactions |
| 177 | +- **Flexible Consumption**: Choose the level of data processing that fits your needs |
| 178 | + |
| 179 | +This flexibility allows you to: |
| 180 | + |
| 181 | +- **Start Simple**: Use decoded data for quick prototyping |
| 182 | +- **Go Deep**: Switch to raw data when you need custom parsing |
| 183 | +- **Mix and Match**: Use different topics for different parts of your indexer |
| 184 | + |
| 185 | +Build robust indexing pipelines that: |
| 186 | + |
| 187 | +- Never lose data (thanks to retention) |
| 188 | +- Can replay and reprocess (for data quality) |
| 189 | +- Handle historical backfilling while maintaining live subscription |
| 190 | +- Scale horizontally (with consumer groups) |
| 191 | +- Extract on-chain data without running archive nodes |
| 192 | + |
| 193 | +## Getting Started with Bitquery Kafka-Based Indexing |
| 194 | + |
| 195 | +Building a real-time indexer with Bitquery's Kafka streams is straightforward: |
| 196 | + |
| 197 | +1. **Get Kafka Access**: Contact Bitquery sales by filling the [form on the website](https://bitquery.io/forms/api) for Kafka credentials |
| 198 | +2. **Choose Your Topics**: Select the topics that match your indexing needs. List is available [here](https://docs.bitquery.io/docs/streams/kafka-streaming-concepts/#complete-list-of-topics) |
| 199 | +3. **Set Up Consumers**: Create Kafka consumers with proper offset management |
| 200 | +4. **Process Messages**: Parse protobuf messages and update your index |
| 201 | +5. **Handle Failures**: Use Bitquery Kafka's 24-hour retention to recover from crashes |
| 202 | + |
| 203 | +## Tutorial Tidbits: Building Real-Time Indexers with Bitquery Kafka |
| 204 | + |
| 205 | +### Why Bitquery Kafka is Ideal for Blockchain Indexing |
| 206 | + |
| 207 | +Bitquery Kafka streams provide several advantages over archive node-based indexing, gRPC indexers, webhook-based services, or RPC-based indexing that make them perfect for building real-time blockchain indexers: |
| 208 | + |
| 209 | +**1. Data Retention for Reliability** |
| 210 | + |
| 211 | +- Bitquery Kafka streams retain messages for **24 hours**, allowing you to recover from crashes or restarts |
| 212 | +- If your indexer goes down, you can resume from the last processed offset |
| 213 | +- Unlike RPC providers, you don't need to worry about missing transactions during downtime |
| 214 | + |
| 215 | +**2. More Data Than Raw Nodes or Archive Nodes** |
| 216 | + |
| 217 | +- Bitquery's Kafka streams include **decoded smart contract calls** and **enriched metadata** |
| 218 | +- **Transaction traces and internal transactions** included without needing debug_traceBlockByNumber |
| 219 | +- Pre-parsed DEX trades, token transfers, and protocol events |
| 220 | +- Both **raw and decoded data** available in separate topics |
| 221 | +- USD values and token metadata included automatically |
| 222 | +- Native ETH transfers and internal calls that don't emit events are captured |
| 223 | + |
| 224 | +**3. Zero Infrastructure Management** |
| 225 | + |
| 226 | +- **No archive node setup required**: Bitquery manages all blockchain nodes (including archive nodes) |
| 227 | +- **No gRPC indexer configuration**: No need for Geyser plugins or validator-level indexing |
| 228 | +- **No webhook infrastructure**: No webhook endpoints or delivery handling needed |
| 229 | +- **No bandwidth limits**: Direct Kafka access means no per-request throttling |
| 230 | +- **No scaling headaches**: Kafka handles horizontal scaling automatically |
| 231 | +- Focus on your indexing logic and on-chain data extraction, not infrastructure maintenance |
| 232 | + |
| 233 | +**4. Cost-Effective for High-Volume Indexing** |
| 234 | + |
| 235 | +- **Predictable pricing**: Direct Kafka access, not variable bandwidth charges |
| 236 | +- **No rate limits**: Process millions of transactions per day without throttling |
| 237 | +- **Efficient consumption**: Consume data once and process it multiple times if needed |
| 238 | + |
| 239 | +**5. Enterprise-Grade Reliability** |
| 240 | + |
| 241 | +- **At-least-once delivery**: Guarantees no message loss |
| 242 | +- **Automatic failover**: Seamless handling of broker failures |
| 243 | +- **Consumer groups**: Share load across multiple indexer instances |
| 244 | + |
| 245 | +### Quick Tips for Indexer Development |
| 246 | + |
| 247 | +**Handling Duplicates:** |
| 248 | + |
| 249 | +- Messages may have duplicates in Kafka topics |
| 250 | +- Implement idempotent processing: track processed transaction hashes |
| 251 | +- Use a fast lookup store to check if already processed |
| 252 | + |
| 253 | +**Recovery After Crashes:** |
| 254 | + |
| 255 | +- Bitquery Kafka's retention window (24 hours) allows you to replay recent data |
| 256 | +- Store your processing state aka message offset and partition details. |
| 257 | +- On restart, you can seek to a specific offset if needed |
| 258 | +- This is a major advantage over RPC providers, gRPC indexers, and webhook-based services, which don't offer replay capabilities |
| 259 | + |
| 260 | +**Choosing the Right Topic:** |
| 261 | + |
| 262 | +- Use `*.transactions.proto` for comprehensive transaction indexing |
| 263 | +- Use `*.dextrades.proto` for DEX-specific indexing (faster, less data) |
| 264 | +- Use `*.tokens.proto` for token transfer indexing |
| 265 | +- Use `*.broadcasted.*` topics for mempool-level data (lower latency) |
| 266 | + |
0 commit comments