slamonitoringmarket-data

Monitoring SLAs of Market Data Vendors: What to Track and How to Report Outages

UUnknown

2026-02-15

12 min read

Practical SLA metrics and a template to monitor vendor latency, completeness, and freshness — plus steps to correlate outages with market anomalies.

Hook: When your market data vendor misses a tick, your app — and your P&L — feels it

If you run trading systems, risk engines, or analytics that depend on vendor feeds, one thing keeps you up at night: not knowing whether a price move was real or caused by a vendor outage or delayed feed. In 2026, teams expect machine-readable provenance, signed timestamps, and OpenTelemetry traces for feeds. This guide gives you a practical SLA template, the exact metrics to instrument (latency, completeness, freshness, availability), and step-by-step playbooks to detect, report, and correlate vendor outages with market events.

Executive summary — the inverted pyramid

Most important: Track latency percentiles (p50/p95/p99), message completeness (sequence gaps, snapshot coverage), freshness (watermark lag), and availability. Instrument both synthetic probes and passive listeners. Store raw feed timelines, compute metrics in real time, and keep a forensic cold store. For outage reports, use a structured template (timeline, symptoms, impact, experiments, root cause hypotheses, remediation, slate of affected instruments) and always include correlated market metrics (volatility, bid/ask spread, trade count).

Why 2026 is different: trends you must design for

Observability-first SLAs: Buyers now demand machine-readable provenance, signed timestamps, and OpenTelemetry traces for feeds.
Regulatory and vendor transparency: Following high-profile cloud and platform outages in late 2025 and January 2026 (eg. spikes in Cloudflare/AWS/X outage reports), exchanges and vendors are under pressure to offer clearer outage metadata. See guidance on hardening CDN configurations and protecting delivery paths.
Higher granularity expectations: For many applications, millisecond guarantees are no longer good enough — microsecond and deterministic latency profiles are required for market-making and liquidity-sensitive apps. The evolution of cloud-native and edge hosting is part of this shift.
Hybrid measurement approaches: Teams combine active synthetic probes with passive mirror captures and cross-vendor reconciliation to detect subtle data-quality issues.

Core SLA metrics to include (and how to measure them)

Below are the metrics you should put in contracts and operationalize. Each metric has a recommended measurement technique, typical thresholds for three use cases, and sample alert rules.

1) Latency (end-to-end)

Definition: Time delta between the exchange event timestamp and the message ingestion time at your consumer boundary.

How to measure: Ensure vendors publish exchange-event timestamps (or use exchange-provided sequence timestamps). Capture ingress timestamp on a trusted clock (PTP/NTP-synced). Compute event_time -> ingestion_time deltas.
Key percentiles: p50, p95, p99, p999 (HFT requires p999 visibility).
Typical SLA targets:
- HFT/Market making: p99 < 100 microseconds
- Low-latency execution: p99 < 10 ms
- Reference analytics: p99 < 200 ms
Sample alert: p99 latency > target for 30s & message rate > baseline -> Severity 2.

2) Freshness / Watermark lag

Definition: Difference between the current wall-clock time and the latest event timestamp observed for a given instrument or feed partition.

How to measure: Maintain per-partition watermarks; compute max(event_time) per instrument every second.
Why it matters: A feed can be available but stale (e.g., snapshot not updated), which causes incorrect state in real-time calculations.
Typical SLA targets: Freshness < 1s for real-time pricing, < 10s for delayed products.

3) Completeness (sequence continuity and snapshot coverage)

Definition: Degree to which the vendor provides an unbroken ordered stream and periodic snapshots without missing sequence numbers.

How to measure: Track sequence numbers per partition. Compute gap rate (missing_seq_count / total_messages). Validate snapshot coverage (are snapshot messages present at advertised cadence?).
Typical SLA targets: Missing sequence rate < 1e-6; snapshot delivery success > 99.99%.

4) Availability

Definition: Percentage of time the feed meets minimal delivery and authentication requirements (connectivity + auth + heartbeat).

How to measure: Use synthetic probes (connect, auth, subscribe, receive heartbeat) from multiple geographic points; combine with passive telemetry.
Typical SLA targets: 99.99% monthly for critical feeds; include SLO windows for maintenance and planned downtime.

5) Data quality (value constraints)

Examples: Negative prices, out-of-range sizes, zero spread, duplicated trade IDs. Compute value-constraint error rate.

6) Throughput & Capacity

Definition: Sustained messages per second and peak messages per second capacity guarantees.

Instrumentation architecture — what to build

Design for both real-time detection and historical forensic analysis. Use three layers:

Real-time telemetry (hot): Ingest raw feed and emit metrics to a TSDB (Prometheus) and an event stream (Kafka, Pulsar). Compute rolling latency percentiles with streaming frameworks (Flink, ksqlDB).
Forensic cold store: Archive raw messages (compressed, columnar) in an immutable object store (S3/Blob) with retention aligned to SLA claims (e.g., 90/365/999 days for audits). If you operate object stores, operational security and integrity checks are important—see lessons from running a cloud storage bug-bounty program.
Correlation & analytics layer: Use a time-series DB for computed metrics and a data warehouse (Snowflake/BigQuery/Athena) for cross-vendor reconciliation and machine-learning anomaly detection.

Sample pipeline (components)

Ingress: Vendor feed -> Mirror port / logical tap -> Kafka topic per feed/partition
Processing: Flink job computes latency percentiles and sequence-gap detection, writes metrics to Prometheus and aggregated metrics to Kafka
Storage: Raw messages to S3 (Parquet), metrics to TimescaleDB/Prometheus, analytics to Snowflake
Visualization & Alerting: Grafana dashboards + PagerDuty + Slack runbooks

How to compute the key metrics — code examples

Below are minimal examples you can drop into your observability stack. These are examples; adapt to your message schema and clocks.

SQL: p99 latency per minute (assumes ingestion_time and event_time columns)

SELECT
  date_trunc('minute', ingestion_time) AS minute,
  approx_percentile((ingestion_time - event_time) * 1000, 0.99) AS p99_ms
FROM raw_feed_messages
WHERE feed_id = 'vendorA-book'
  AND ingestion_time > current_timestamp - interval '1 hour'
GROUP BY 1
ORDER BY 1 DESC
LIMIT 60;

Python: detect sequence gaps and emit missing rate

def compute_missing_rate(messages):
    # messages is an ordered iterator of (seq_no, ...)
    missing = 0
    total = 0
    prev = None
    for seq, _ in messages:
        total += 1
        if prev is not None and seq != prev + 1:
            missing += seq - (prev + 1)
        prev = seq
    return missing / total if total else 0.0

JS (Node): simple synthetic probe for availability and freshness

const ws = new WebSocket(vendorUrl);
const start = Date.now();
ws.on('open', () => console.log('connected'));
ws.on('message', msg => {
  const obj = JSON.parse(msg);
  const eventTs = new Date(obj.event_time).getTime();
  const lagMs = Date.now() - eventTs;
  if (lagMs >= 1000) {
    // emit alert to metrics
  }
});

Correlating vendor outages with market anomalies

Detecting an outage is only step one. The business impact is felt when missing or delayed feed data changes inferred market state. Here’s how to correlate vendor incidents with market anomalies and produce an actionable outage report.

1) Define market anomaly signals to monitor

Volatility spike: Realized volatility (1m/5m) increases beyond baseline.
Orderbook deterioration: Spread widens dramatically; top-of-book depth drops.
Trade count deviations: Sudden drop to zero trades or burst above 10x baseline.
Price discontinuities: One-sided ticks away from time-weighted average price (TWAP) or across-vendor delta > threshold.

2) Correlation strategy

Align timelines: Use vendor event timestamps and your ingress timestamps. Normalize to a single monotonic clock (PTP preferred).
Compute cross-vendor deltas: For an instrument, compute the difference between vendorA.price and vendorB.price and the number of instruments with gaps at the same time.
Measure causality windows: Look at anomalies that begin within a short window (e.g., 0–10s) after a vendor-latency spike or missing-sequence event.
Use statistical tests: Compute conditional probability P(anomaly | vendor_issue) versus P(anomaly | no_issue). If the relative risk > 2 and p-value < 0.05, escalate for root-cause analysis.

3) Example correlation query (conceptual)

-- Count volatility spikes that occur within 10s after any vendor gap
WITH vendor_gaps AS (
  SELECT time_bucket('10s', ingestion_time) AS bucket, count(*) AS gap_count
  FROM gaps
  WHERE feed_id = 'vendorA'
  GROUP BY 1
), vol_spikes AS (
  SELECT time_bucket('10s', ts) AS bucket, count(*) AS spike_count
  FROM vol_signals
  WHERE value > threshold
  GROUP BY 1
)
SELECT v.bucket, v.gap_count, COALESCE(s.spike_count,0) AS spike_count
FROM vendor_gaps v
LEFT JOIN vol_spikes s ON v.bucket = s.bucket
ORDER BY v.bucket DESC
LIMIT 100;

Outage reporting template — what your post-mortem must include

When you report vendor outages (internally or to clients/regulators), use a structured template so stakeholders can act quickly.

Header: Vendor name, feed name, instruments affected, reporting owner, incident ID.
Timeline: precise timestamps (UTC) for first detection, confirmation, remediation start, and remediation end. Include detection method (synthetic probe, customer ticket, exchange alert).
Symptoms: latency spikes, sequence gaps, stale snapshots, auth failures.
Impact matrix: list affected systems (executions, risk, analytics) and business impact (estimated P&L exposure, failed trades, delayed reports). Quantify: number of instruments, duration, percent of clients impacted.
Metrics at incident: p50/p99 latency, missing sequence rate, freshness lag, message rate change, cross-vendor price delta count.
Correlation evidence: graphs showing vendor metric(s) overlaid with market anomaly metrics (volatility, spread, trade count). Attach CSVs/Parquet slices for auditor review.
Root cause analysis: hypothesis, vendor confirmation (if available), third-party events (cloud outage), or internal issues (misconfiguration).
Remediation & timeline: what was done, short-term mitigations (fallback feeds, replay), and long-term fixes.
Lessons learned & action plan: changes to SLAs, probes, redundancy, runbook updates, and SLA credit if appropriate.

Sample SLA clauses to negotiate (practical language)

Copy-paste friendly snippets to include in RFPs and contracts.

Latency: Vendor guarantees p99 end-to-end latency < 10 ms for primary orderbook messages measured at buyer ingress when synchronized to vendor timestamps; measured monthly. Failure to meet this target for more than 0.01% of minutes per month triggers service credits.
Completeness: Maximum allowed missing sequence rate is 1 in 1,000,000 messages. Any missing-sequence event that impacts > 10 instruments for > 1 minute must be reported via the vendor status API within 5 minutes.
Freshness: Watermark lag not to exceed 1s for real-time instruments. Staleness beyond threshold must be flagged in the vendor status stream with ISO8601 timestamps.
Provenance & forensic access: Vendor must retain original message payloads and sequence logs for 90 days and provide machine-readable exports (Parquet / JSONL) within 24 hours upon request.
Outage metadata: Vendor to publish structured outage events (status API / webhooks) with fields: incident_id, affected_feeds, start_time, estimated_end_time, root_cause_code.

Redundancy & fallback strategies

Even with perfect SLAs, plan for graceful degradation.

Multi-vendor redundancy: Subscribe to at least two independent vendors for critical instruments; implement fast cross-vendor reconciliation to switch quoting sources. Field reviews of edge message brokers show tradeoffs in offline sync and resilience you should consider for distributed probes.
Store-and-replay: Keep a warm cache of recent snapshots to serve while vendor feed recovers; replay gap fills into risk calculators.
Graceful degradation: Configure algo behavior to widen spreads, reduce aggressiveness, or pause execution when freshness > threshold.
Automated failover: Automate failover rules with canary checks (e.g., require 3 consecutive good snapshots from fallback before switching live quoting).

Real-world example: correlating an AWS/Cloudflare outage (Jan 2026) with spread anomalies

In January 2026, public incident reports noted spikes in outage reports for several CDN and cloud platforms. For teams using cloud-based vendor gateways, the same incident can manifest as a combination of connectivity interruptions and delayed updates. For a focused checklist on monitoring provider failures and detecting outages faster, see our network observability for cloud outages guide.

Actionable steps to investigate:

Pull vendor heartbeat logs and synthetic probe failures during the window identified by public outage trackers.
Compute the number of instruments with freshness > target and the simultaneous increase in quoted spreads and TWAP divergence.
Run conditional probability to estimate the chance that the market anomaly would have occurred absent the vendor issue (control period analysis pre/post incident).

Correlating vendor status with market state reduces false positives: sometimes markets do move, and distinguishing real market moves from data artifacts saves capital.

Operational playbooks — runbook snippets

Build short runbooks that operations and SREs use during incidents. Two minimal examples:

Detection playbook (automated)

Alert triggers: p99 latency breach or missing_seq_rate > threshold or probe failures > 2 across regions.
Automated checks: validate vendor status API, attempt reconnect from a secondary POP, perform cross-vendor price delta check.
Initial action: mute automated trading strategy for affected instruments if freshness > safe threshold.
Notify stakeholders: PagerDuty + Slack incident channel + send outage report skeleton.

Forensic playbook (post-detection)

Archive raw messages for the incident window and a buffer (e.g., -10m/+60m).
Run correlation queries (see examples) and attach plots to the incident record.
Contact vendor with sequence logs and request annotated forensic export.
Prepare customer-facing notice with measured impact and mitigation steps.

KPIs to track on your SLA dashboard (operational & business)

Operational: p50/p95/p99 latency, missing sequence rate per minute, freshness lag distribution, synthetic probe success rate per region.
Business: # of failed trades due to feed problems, estimated intraday P&L impact, number of clients affected, time-to-detect (MTTD), time-to-repair (MTTR).

Actionable takeaways

Contractually require machine-readable outage metadata and forensic exports for audits; don’t accept opaque post-incident emails.
Instrument sequence continuity and watermarks — these capture the two most dangerous failure modes: missing data and stale data.
Combine active probes and passive capture to detect both reachability and payload integrity.
Automate correlation between vendor metrics and market signals to quickly separate real market moves from data artifacts.
Keep raw messages in an immutable store for forensic analysis and regulatory compliance.

Final thoughts (2026 outlook)

In 2026, SLAs for market data are evolving from simple uptime numbers to richer, machine-readable guarantees around latency distributions, completeness, and provenance. Cloud and platform outages remain common-enough risks — the right instrumentation and contractual terms let you detect, mitigate, and attribute incidents quickly. Buyers that treat SLAs as both legal and operational frameworks will win: they reduce outage impact, improve auditability, and extract faster value from data-driven features.

Call to action

Need a ready-to-use SLA metrics dashboard and outage report template tailored to your architecture? Download our SLA monitoring kit (Prometheus dashboards, Flink jobs, SQL queries, and an incident report template) or schedule a short workshop with our engineers to map this template to your feeds. Contact the worlddata.cloud team to start a free pilot and bring vendor SLA monitoring into your CI/CD and observability pipelines. For guidance on building developer-facing observability and CI/CD flows, see our piece on building a developer experience platform.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.