Observability in financial systems is not just an engineering convenience. It is a regulatory necessity. When a trade fails to settle, when a risk limit breach goes undetected for even a few minutes, or when a compliance report contains unexplained data gaps, the consequences range from client losses to formal regulatory censure. Building systems that process hundreds of thousands of events per second is hard enough. Building those same systems so that engineers and compliance teams can fully understand their behavior in real time, and reconstruct that behavior historically, is an entirely different and often underestimated challenge.
Over the course of working on trade processing infrastructure at scale, I have found that observability must be treated as a first-class architectural concern, not something added after the fact. The choice of tooling, the structure of log data, and the way traces are propagated across service boundaries all have downstream consequences that are very difficult to unwind once a system is in production.
Structuring Logs for Financial Audit Requirements
The first thing that separates financial system logging from general application logging is the concept of an immutable audit trail. Regulations such as MiFID II in Europe and SEC Rule 17a-4 in the United States require that certain records be retained in a non-rewritable, non-erasable format for defined periods, often seven years or more. This means log pipelines cannot simply write to a rotating file or a standard Elasticsearch index that allows document updates and deletes.
In practice, we separate logs into two categories. Operational logs are used for real-time debugging, alerting, and performance monitoring. These live in Elasticsearch or Splunk with relatively short retention windows and full indexing for fast search. Compliance logs capture the business-meaningful events: order submissions, trade executions, cancellations, and risk decisions. These are written in append-only fashion to object storage such as S3 with WORM (Write Once Read Many) bucket policies enabled, and a secondary index is maintained in Splunk or Elasticsearch purely for search and retrieval. The authoritative record always lives in the immutable store.
Splunk vs. ELK: Choosing the Right Tool for Each Job
A question I encounter frequently is whether to standardize on Splunk or the ELK stack (Elasticsearch, Logstash, Kibana). In my experience, this is a false choice in larger financial organizations. The two tools serve different audiences and different use cases well enough that running both in parallel is often justified.
Splunk excels at ad-hoc investigation by non-engineering users. Compliance officers, risk managers, and operations staff can write SPL queries against Splunk without understanding the underlying data schema intimately. Splunk’s alerting framework is mature and integrates with ticketing and incident management systems that financial firms already have in place. The licensing cost is significant, which is why most teams limit Splunk ingestion to high-value data: trade events, system alerts, authentication logs, and anything touching client data.
ELK, particularly when managed via Elastic Cloud or a self-hosted deployment with the Elastic Operator on Kubernetes, handles much higher ingestion volumes at lower cost. We route high-frequency operational logs, application metrics, and infrastructure telemetry into Elasticsearch. Engineers use Kibana for dashboards and Discover queries during incident response. The trade-off is that ELK requires more engineering investment to operate reliably at scale, including index lifecycle management, shard sizing, and rollover policies tuned to query patterns.
Distributed Tracing Across a Multi-Service Trade Pipeline
Logs and metrics answer what happened and how often. Distributed tracing answers why a specific request was slow and exactly which services were involved. For a trade that traverses an order management system, a pre-trade risk engine, an exchange gateway, and a post-trade allocations service, a single trace can show the full journey with per-service latency broken down at the span level.
We instrument services using the OpenTelemetry SDK, which provides a vendor-neutral API for emitting traces, metrics, and logs. Trace context is propagated through Kafka message headers using the W3C TraceContext standard, so a trace that begins when an order enters the system continues seamlessly as events flow through Kafka topics and are consumed by downstream services. This is a detail that is easy to overlook when first adopting distributed tracing: if context is not explicitly propagated through message headers, each consumer starts a new disconnected trace and the end-to-end picture is lost.
Traces are exported to a Jaeger or Tempo backend, with Grafana used as the query and visualization layer. For high-volume systems, sampling is essential. Recording every trace at full fidelity is prohibitively expensive, so we use a tail-based sampling strategy that guarantees all error traces and all traces exceeding a latency threshold are retained, while sampling normal successful traces at a lower rate. This ensures that the traces most useful for debugging are always available.
Correlating Across Systems: The Trade Correlation ID Pattern
The single most impactful observability practice I have implemented in financial systems is enforcing a universal correlation ID that follows a trade from inception through settlement. This ID is generated at the point of order entry and written into every log line, every Kafka message header, every database record, and every outbound API call related to that trade. When something goes wrong, an engineer or compliance analyst can search any system, Splunk, Elasticsearch, or a Jaeger trace, using the correlation ID and immediately see the complete picture of what happened and in what order.
Without this pattern, incident response in a multi-service architecture becomes an exercise in forensic reconstruction, manually joining log lines across systems by timestamp and hoping the clocks are synchronized closely enough. With it, a root cause that previously took hours to identify can often be found in minutes. In regulated environments where regulators may request a full audit trail of a specific trade on short notice, this capability shifts from a nice-to-have to a practical necessity.
Alerting with Intent, Not Just Thresholds
The final principle worth emphasizing is that alerting in financial systems needs to reflect business intent, not just technical thresholds. An alert that fires because CPU utilization crossed 80 percent is not inherently useful. An alert that fires because trade confirmation latency exceeded 500 milliseconds for more than 30 consecutive seconds is actionable and has a clear business impact.
We define SLOs (Service Level Objectives) for each stage of the trade pipeline, for example, 99.9 percent of orders should be validated within 100 milliseconds, and build alerts from error budget burn rates rather than raw metric thresholds. This approach produces far fewer false positive pages and ensures that on-call engineers are alerted to conditions that genuinely threaten business outcomes rather than transient infrastructure noise. Combined with structured logs, immutable audit storage, and end-to-end distributed tracing, it forms an observability foundation capable of meeting both the engineering demands and the regulatory expectations of modern financial services.




