Analyze This: Solana RPC Observability And Its Averages

A single second of Solana RPC lag puts your application two to three slots behind the current state of the network, and you know why. You are no longer reading the chain—you are reading its recent memory. And here’s the troubling part: it usually looks fine. Responses come back with 200 OK. Latency dashboards are green. The data is just wrong.

Analyze this: solana rpc observability and its averages

Observability on Solana is not about catching crashes. It’s about catching drift—the gradual, silent separation between what your RPC node believes and what the network actually is.

How to instrument observability

Slot freshness tracking

The base, foundation, and ground of any Solana RPC observability stack is continuous monitoring of slot freshness. Poll getSlot() from your RPC endpoint every 200ms, and compare against a reference source—a direct validator connection, or at minimum two separate paid providers.

Prometheus metrics for RPC nodes

For self-hosted or managed RPC nodes, Prometheus is the standard collection layer. Key metrics to collect:

MetricTypeAlert condition
solana_rpc_slot_lagGauge> 2 for more than 30s
solana_rpc_request_latency_p99Histogram> 1000ms sustained
solana_rpc_requests_total{status=”error”}CounterError rate > 1% of total
solana_rpc_tx_dropped_totalCounterAny non-zero value during normal operation
solana_rpc_tx_landed_rateGauge< 95% landing rate over 5 min window
solana_ledger_replay_lag_msGauge> 400ms (one slot worth)
solana_geyser_stream_delay_msHistogram> 50ms p99 for latency-sensitive subscribers

Grafana dashboard structure

A production Grafana dashboard should be organized into four panels:

PanelMetrics displayedTime range
Slot healthSlot lag (gauge), slot height vs. reference, fork alignmentLast 15 min, 1s resolution
Latency distributionp50/p90/p99 heatmap per RPC method, tail spike frequencyLast 1 hour, rolling
TX pipelinesendTransaction rate, landing rate %, drop count, 429 error rateLast 30 min
Geyser streamStream delay histogram, subscription count, reconnect eventsLast 15 min, 200ms resolution

Four failure modes you need to instrument

Before building dashboards, understand exactly how Solana RPC degrades. There are four distinct failure modes, each requiring its own instrumentation.

1. Slot lag

Slot lag is the difference between your RPC node’s current slot and the true tip of the network. A lag of 0–1 slots is acceptable under normal conditions. Consistent lag of 2+ slots means the node is overloaded, poorly peered, or falling behind on ledger replay.

Slot lagMeaningImpact
0–1 slotsNormal—within network jitterNo impact
2–3 slotsNode under load or peer delayStale reads, potential blockhash issues
4–5 slotsSerious—likely overloaded or forkedTX failures, simulation errors
>5 slotsNode effectively unusableAll time-sensitive operations fail

The critical detail: a lagging RPC node can respond to getSlot() with a stale slot number while returning a 200 HTTP response. HTTP latency and data freshness are entirely separate metrics. A heavily cached, overloaded node can reply in 30ms—with data that is 800ms stale.

2. Tail latency (p99)

Averages lie. If your average RPC latency is 50ms but your p99 is 2 seconds, you will fail precisely at the moments when it matters most—during high-volatility periods when the market is moving and every millisecond counts.

A p99 latency of 2 seconds on a bot that fires during volatility events means the slowest 1% of requests—the ones that arrive exactly when the price is moving—miss their window entirely.

PercentileWhat it revealsAlert threshold
p50Median request latency—baseline health> 150ms for latency-sensitive apps
p9090th percentile—common load behavior> 300ms sustained over 5 minutes
p99Worst 1%—behavior under peak load> 1000ms—indicates infrastructure problem
p99.9Extreme tail—critical path failures> 3000ms—immediate investigation

3. Transaction drop rate

sendTransaction() returns a TX signature immediately, before the transaction has been forwarded to the leader, let alone confirmed. A successful HTTP response tells you the RPC node received the transaction. It says nothing about whether it reached the validator.

TXs are dropped silently in several scenarios:

  • The RPC node’s rebroadcast queue is full
  • The node is lagging and forwards the TX to the wrong leader slot
  • The blockhash was fetched from one node in an RPC pool and submitted to a lagging node in the same pool—the blockhash appears unrecognized
  • A temporary network fork causes the TX to reference a blockhash on a minority fork that is later abandoned

Measuring actual TX landing rate requires end-to-end tracking: send a memo transaction, poll getSignatureStatuses() every 400ms, and record whether it lands within 3 slots (~1.2 seconds). Run this test against high-congestion windows (typically 14:00–18:00 UTC) to stress the propagation path.

4. Geyser stream drift

For bots and applications consuming Yellowstone gRPC, the relevant metric is not HTTP latency—it is the delay between an account state change on the validator and when your subscriber receives the update. This is Geyser stream drift.

Geyser drift compounds with slot lag: if your node is 2 slots behind and your Geyser subscription has 40ms stream latency, you are effectively 800ms + 40ms behind the tip. For arbitrage bots, that margin is the difference between landing and missing.

Data pathTypical latencyNotes
Geyser gRPC (local, tuned)< 10msSub-slot freshness, requires a dedicated node
Yellowstone via provider10–50msDepends on provider peering and node load
WebSocket subscription100–300msFiltered but slower; degraded under congestion
HTTP polling (getAccountInfo)100–500ms+Worst option; never use for time-sensitive data

What most teams get wrong

Analyze this: solana rpc observability and its averages

Three patterns appear repeatedly when teams try to build Solana RPC observability and fail:

Measuring HTTP latency instead of data actuality

A getSlot() call that returns in 30ms is useless if the slot number is stale. HTTP response time and data freshness are orthogonal. A heavily cached RPC node will respond instantly with data from three slots ago. The only way to detect this is to compare the returned slot against a reference source—not against your own response time baseline.

Trusting sendTransaction() success responses

The sendTransaction() method returns a transaction signature immediately upon receipt by the RPC node—not upon forwarding to the leader, not upon inclusion in a block. Most teams treat a 200 response as confirmation that the TX is in flight. It is not. The only valid measure of transaction health is landing rate: how many TXs submitted actually appear on-chain within 3 slots.

Testing only during off-peak hours

Any RPC provider looks good at 03:00 UTC. The metrics that determine whether your infrastructure is viable are the ones collected during peak congestion—typically during high-volume trading sessions, NFT launches, or major protocol events. Build your baseline during normal conditions, then specifically run stress tests during known high-activity windows.

Michael Kahn

About the Author

Michael Kahn

Founder & Editor

I write about the things I actually spend my time on: home projects that never go as planned, food worth traveling for, and figuring out which plants will survive my Northern California garden. When I'm not writing, I'm probably on a paddle board (I race competitively), exploring a new city for the food scene, or reminding people that I've raced both camels and ostriches and won both. All true. MK Library is where I share what I've learned the hard way, from real costs and real mistakes to the occasional thing that actually worked on the first try. Full Bio.

If you buy something from a MK Library link, I may earn a commission.

Leave a Comment

Share to...