Security Observability: Principles & Best Practices at Scale

Distributed systems require an active approach to security monitoring and threat detection. The discipline of security observability provides the insights and framework needed to protect complex applications and infrastructure.

This article explores the essential principles of security observability and provides practical guidance for implementing observability techniques on a large scale.

Summary of key security observability concepts

Concept	Description
Security observability architecture	Understand how a layered observability architecture can enable real-time edge collection, detection, filtering, and analysis. While still ensuring the ability to process and analyze centrally using a data warehouse/lake architecture.
The security observability signals	Observability signals are critical early indicators driving proactive prevention and response. Metrics, Events, Logs, and Traces (MELT) are collected from various data sources as telemetry data.
The security observability data sources	Different types of telemetry data are contained in distributed data sources across applications, networks, systems, and databases. This data must be processed, parsed, and analyzed to create signals and alerts.
Security observability telemetry	Observability telemetry means collecting signals from all corners of your infrastructure and efficiently delivering them for analysis. Techniques like Edge Collection and Smart Data Reduction manage cost and efficiency by pre-processing telemetry data.
The security operations center (SOC) tools	SOC teams integrate SIEM (Security Information and Event Management) and SOAR (Security Orchestration, Automation, and Response) tools into their processes and Observability stack.
The security observability techniques	The lifecycle of monitoring observability from development to production, threat response, and ongoing compliance. All teams, from Cyber and Ops to Platforms and Development, must identify where to introduce checkpoints, procedures, and automation pipelines to enable observability.
The security observability best practices	Best practices include filtering and processing at the Edge, leveraging AI analysis tools, handling alerts wisely, and optimizing pipelines.

Security observability architecture

A modern security observability architecture designed to process a high volume of real-time data consists of three layers: the edge collection layer, the telemetry transport layer, and the central analysis layer.

The edge collection layer deploys specialized collectors for different data sources. For example, application collectors gather live metrics from running services, Kubernetes collectors capture container and cluster logs, cloud-focused collectors track traces across distributed services, and network collectors monitor security events. This distributed collection approach ensures complete coverage of the security footprint while operating at the edge and can minimize data transport overhead.

The telemetry transport layer uses a secure transport pipeline, often based on the OpenTelemetry Protocol (OTLP) and TLS encryption, to move the data you wish to observe reliably and securely. This layer is a unified highway for all telemetry data, ensuring secure and efficient transport from collection points to analysis systems while maintaining data integrity and confidentiality.

The central analysis layer contains four key components that process the collected data:

SIEM integration processes and correlates security events.
Threat detection analyzes patterns to identify potential security incidents.
Alert management handles notification and escalation workflows.
Compliance monitoring ensures that regulatory requirements are met.

It’s worth noting that analysis in the most advanced systems will not always be centralized and can be pushed directly into the collection layer. This layered architecture enables real-time (and near-real-time) threat detection while supporting compliance requirements through comprehensive data audit trails. Each layer is designed to scale horizontally, allowing the system to handle growing data volumes while maintaining consistent performance. The clear separation of concerns between collection, transport, and analysis makes this architecture both maintainable and extensible. Adopting such architecture is not optional but necessary when data collection is measured in milliseconds, as is the case for real-time applications with a global scope.

Next-generation telemetry data pipelines for logs, metrics, and traces built with leading-edge technologies

1
Optimize, enrich & route your real-time telemetry data at the ingestion point
2
Drag-and-drop data pipeline creation & maintenance with the need for Regex
3
Rely on a future-proof, simple, efficient & flexible architecture built for hyperscale

Learn more

Security observability signals

Security observability signals are telemetry data points that help detect and investigate security issues. There are many types of signals, which typically fall into four main categories, often abbreviated as MELT:

Metrics provide quantitative measurements taken over time. When monitoring security, you'll often track key metrics, like the number of failed login attempts per minute, number of requests from unique IP addresses, or unusual CPU and memory usage patterns.
Events represent discrete activities that change the state of your system. These include user permission changes, container lifecycle changes, resource creation or deletion in cloud environments, and security group modifications. Both successful and unsuccessful authentication attempts are significant security events to track.
Logs offer detailed records of system and application activity. Web server access logs show HTTP request patterns, while Kubernetes audit logs track cluster operations. Cloud provider logs like AWS CloudTrail or Azure Activity logs record infrastructure changes. Application security logs capture authentication attempts, and system logs record process execution. By combining them, these detailed records become crucial for your security forensics.
Traces track end-to-end request flows through distributed systems. They show how requests like API calls move through microservices, how user sessions span multiple services, and they illustrate patterns in database queries. Traces help security teams understand service-to-service communication flows and track authentication request paths.

Security observability data sources

Effective security observability requires collecting data from diverse sources across the technology stack.

Typical Data Source	Subject of Monitoring
Firewalls, servers, and workstations	Networks, hosts & user activity
Databases & search indexes, data lakes	Data access and integrity
Serverless functions (e.g., AWS Lambda)	Cloud service access, logs, and errors
Kubernetes	Service availability, deployment success, performance, and logs
Application messaging and event streams (e.g., Kafka)	Application events and data transport
Identity Access Management (IAM) (e.g., Microsoft Active Directory)	Identity and authorisation
Web servers	HTTP requests
Monitoring and log aggregation tools	System and application events

Here's a detailed look at some examples of these key sources and their telemetry:

Network firewalls are often the first line of defense and generate rich security event streams. A typical firewall log entry might look like this:

timestamp="2025-01-01 10:15:22" src_ip="192.168.1.100" dest_ip="212.250.113.45" src_port="80" dest_port="443" protocol="TCP" action="BLOCK" rule_id="FW_RULE_123" reason="SUSPICIOUS_PATTERN"

Server and workstation telemetry provides critical system-level insights. For example, a Linux host security event might appear if a user attempts to access restricted account files:

Jan 01 10:15:23 webserver sudo: user=bob command=/usr/bin/vi /etc/passwd terminal=pts/0 address=192.168.1.50 res=FAILED
Jan 01 10:15:24 webserver sshd[12345]: Failed password for invalid user admin from 212.250.113.45 port 22 ssh2

Database monitoring captures access patterns and potential threats. A typical PostgreSQL audit log entry might be created if a user attempts to access a table they do not have privileges for:

2025-01-03 15:15:25.123 UTC [12345] user=app_user database=customers ERROR: permission denied for table customer_data
STATEMENT: SELECT * FROM customer_data WHERE credit_card IS NOT NULL;

Kubernetes audit logs track cluster operations that are critical to securing your cluster's admin interfaces. An admin attempt to create a secret in the cluster may look like this:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "metadata": {
"timestamp": "2025-01-02T17:10:22Z"
  },
  "level": "RequestResponse",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/default/secrets",
  "verb": "create",
  "user": {"username": "system:serviceaccount:default:app-sa"},
  "sourceIPs": ["212.250.113.45"]

 }

AWS Lambda function logs demonstrate serverless security events; however, with bespoke code, these are often dependent on the amount of logging your developers have included in the serverless code:

Dynamo authentication failed for user: unauthorized_user
REPORT RequestId: 45thd-ee68h0-Jy678 Duration: 800.55 ms Memory Size: 250

S3 bucket access logs capture data access failures, such as a malicious user trying to download a restricted PDF file:

awsexamplebucket [02/Jan/2025:03:44:28 +0000] 192.250.113.41 arn:aws:iam::12345678:user/backup-user 34234511532 "GET /protected-file.pdf" 403 AccessDenied

Identity management systems produce authentication logs. These can be observed, for example, when checking attempts to log in to privileged user accounts:

{
  "timestamp": "2025-01-02T11:01:49Z",
  "event_type": "login_attempt",
  "user_id": "ceo@mycompany.com",
  "source_ip": "192.168.1.120",
  "auth_method": "MFA",
  "result": "success",
  "location": "Moscow, Russia",
  "device_id": "Windows-Laptop"
}

Likewise, message queue security events reveal data flow patterns. For example, orders and shipping may be important to correlate with fraudulent or reversed payment attempts:

{
  "timestamp": "2025-01-01T10:15:30Z",
  "queue": "payment_processing",
  "action": "subscribe",
  "client_id": "payment-processor-1",
  "auth_status": "rejected",
  "reason": "invalid_credentials"
}

Tracing and correlation IDs enable distributed service observability, allowing teams to follow security events across microservices. A trace represents a single user action as it flows through multiple services, while correlation IDs (like trace_id and session_id identifiers) stitch together the related events. Here's a simplified example showing how tracing captures a login attempt as it moves through authentication, cart, and MFA services:

// Key correlation identifiers:
// - trace_id: links the entire transaction
// - session_id: links events across services
{
  "trace_id": "tx789",
  "spans": [
    {
      "service": "auth",
      "event": "login",
      "session_id": "s123",    // Initial session created
      "ip": "192.168.1.100"
    },
    {
      "service": "cart",
      "event": "check",
      "session_id": "s123",    // Same session tracked in cart service
      "score": "high"
    },
    {
      "service": "mfa",
      "event": "notify",
      "session_id": "s123"     // Session ID traced through to MFA service
    }
  ]
}

Each of these data sources requires specific collection and filtering strategies. Firewalls typically support the Syslog format or NetFlow, while cloud services often provide API-based collection. Advanced observability platforms like Onum can aggregate diverse formats into a unified data model for analysis. This standardization can be a game changer for quickly finding the right data sources and ensuring that key data is retained.

Security observability telemetry

Managing security telemetry at scale requires sophisticated data-handling strategies that balance coverage with operational efficiency. Consider a large e-commerce platform processing millions of transactions daily; its edge preprocessing implementation filters out routine health checks and successful CDN cache hits while enriching critical events like payment processing attempts with customer context, geolocation data, and device fingerprints.

Smart sampling may be applied using different strategies based on event criticality. For example, authentication attempts from new IP addresses might be captured at 100%, while routine successful authentications from known corporate networks are sampled at 10%. Database query logs could implement adaptive sampling, where collection rates automatically increase when unusual patterns emerge. This dynamic approach can significantly reduce storage and processing costs while ensuring that suspicious activity is fully captured.

Efficient transport relies on optimized protocols like OTLP with TLS encryption. Organizations can deploy regional collectors that batch and compress telemetry before transmission, potentially reducing network overhead through efficient data encoding and compression. Implementing store-and-forward mechanisms at edge collectors helps handle network interruptions so no security data loss occurs during connectivity issues.

A healthcare provider's data lifecycle management system automatically identifies protected health information (PHI) in security logs and applies appropriate retention policies. Its system maintains detailed access logs for seven years, as HIPAA requires, while infrastructure logs unrelated to patient data are archived after six months. The system uses pattern matching to identify and preserve security events related to privileged account access, regardless of age. When working with sensitive personal data, your observability solution must support automated redaction for compliance.

Security operations center (SOC) tools

Security operations centers must integrate observability into their incident management and response workflows. Let's examine how threat detection, investigation, and response (TDIR) processes handle common security scenarios through the SIEM process (used for event detection) and SOAR process (used for event response). SIEM identifies the suspicious patterns in security data through correlation and analysis, while SOAR informs and takes automated actions based on predefined playbooks to respond to the detected threats.

Consider a potential data exfiltration attempt. The SIEM correlates multiple signals; suspicious log entries might be spread across the relevant data sources.

09:04:31 UTC - Database query retrieves 50,000 customer records 09:04:32 UTC - Unusual outbound traffic spike from application server 09:05:33 UTC - S3 bucket access from unrecognized IP address 09:05:34 UTC - Multiple file downloads from internal file share

In an ideal scenario, the SOAR system automatically triggers a response playbook to do the following:

Quarantine the suspected compromised server.
Revoke active session tokens for users who are creating suspicious traffic.
Enable enhanced logging for the affected database.
Create an incident ticket and alert SOC analysts.

For more basic brute-force authentication attacks, the SIEM detection chain might look like this:

10:20:15 UTC - Failed login: user=admin src=212.0.113.100 10:20:16 UTC - Failed login: user=admin src=212.0.113.100 10:20:17 UTC - Failed login: user=admin src=212.0.113.100 [...100 more similar events in 60 seconds...]

The SOAR automated response sequence might be:

Block the source IP address at the firewall.
Enable additional authentication factors for targeted accounts.
Scan for similar patterns.
Generate a security incident report.

More subtle attacks like ransomware threats may involve correlating multiple indirect behavioral signals:

10:30:45 UTC - Multiple file extensions changed rapidly 10:30:46 UTC - High disk I/O from unexpected process 10:30:47 UTC - Network connection to known exploit-linked servers 10:30:48 UTC - Mass file deletion attempts

Here, SOAR can initiate containment procedures:

Isolate affected systems from the network.
Suspend compromised credentials.
Block detected malware signatures.
Trigger backup verification processes.

A good SIEM platform enriches and standardizes raw security data with context. For example, a suspicious login attempt record becomes like the following:

{
  "timestamp": "2025-01-01T10:45:00Z",
  "event_type": "authentication_failure",
  "user": {
"id": "cto-user<masked>",
"department": "Engineering",
"access_level": "Admin",
"last_successful_login": "2025-01-01T09:00:00Z"
  },
  "source": {
"ip": "212.0.113.100",
"geo_location": "Unknown",
"known_vpn": false,
"threat_intel": "Location previously associated with attacks"
  },
  "risk_score": 85,
  "recommended_actions": [
"Block source IP",
"Enable step-up authentication",

  "Review user permissions"
  ]
}

This enriched data enables playbook SOAR response rules like this:

IF risk_score > 80 AND source.known_vpn == false THEN
  TRIGGER incident_response_playbook_high_risk
  SET incident_priority = "P1"
  ALERT soc_team_lead
END

SOC teams leverage these integrated systems to implement TDIR in several key ways. They continuously conduct ongoing threat hunting using enriched telemetry data to proactively identify potential compromises. They spot emerging attack trends and refine detection rules by analyzing patterns across historical incidents. Automated response validation and tuning help them ensure that playbooks remain effective as threats evolve. Regular compliance status monitoring and reporting keeps the organization aligned with regulatory requirements. Through careful incident post-mortem analysis, the team continuously learns from past events to strengthen future detection and response capabilities.

Security observability techniques

Development (code security)

Security starts in development, and you must aim to catch security issues early by scanning code as you write it and identifying vulnerable dependencies in your software supply chain. Automate security testing during builds and deployments and enforce compliance requirements through policy as code and scans. Shift left by reviewing the log output from all development security tools (such as Trivy) before the code reaches production. Ensure that architecture and design teams build in instrumentation and log standards consistently across your applications.

Operations (runtime security)

Monitor applications in production for active threats and analyze system and user behavior patterns to detect compromises. This means ensuring real-time monitoring and alerting are in place (using tools such as Data Dog or Prometheus). It’s also important to ensure that logging is configured correctly and is not too noisy to prevent high costs. Real-time security dashboards can be a powerful tool to help focus your team's attention and resources.

Data protection (data security)

Process sensitive information at the edge by redacting, masking credentials, and filtering out sensitive data with tools like Onum. Onum is an excellent data protection tool because it processes sensitive information directly at the edge, reducing exposure to vulnerabilities. More specifically, to protect data, you must:

Redact data: Automatically remove or obscure information, ensuring confidential data is not exposed during processing or transmission.
Mask credentials: Protect sensitive user credentials by replacing them with masked versions, safeguarding authentication data.
Filter sensitive data: This process identifies and eliminates sensitive information from data streams before it reaches centralized systems or analytics platforms.

These steps protect sensitive data in real time, minimizing the risks of breaches, ensuring compliance with data protection regulations, and enhancing overall data security.

Security observability best practices

Filter at the edge

It is necessary to filter and deduplicate intelligently on the edge at collection time to avoid drowning in data and exponential costs. This means processing and filtering security data at collection points before transport, dropping irrelevant events, and enriching important ones.

A good example would be a bank with ATMs where the ATM camera takes pictures of users. The bank can take a live picture or send a video feed and compare that image with an AI model to understand the sex and approximate age of the person. This can then be compared with account details within a pipeline to detect potential fraud within milliseconds. If you had to wait until the image was dropped centrally into a data lake and compared, the analysis lag might take several minutes, at which point the transaction would already be completed.

Real-time edge filtering tools like Onum are perfect for any high-risk security event that needs to be acted on immediately, such as unauthorized access to a server in a highly restricted network segment or the targeting of someone's login in the C-suite using the company's SSO system.

Optimize your pipeline

Deploy collectors strategically close to your sources to minimize latency and reduce network overhead. Use purpose-built protocols that efficiently handle high-volume telemetry data. Implement aggressive compression to manage bandwidth; most security telemetry compresses exceptionally well due to its repetitive nature. For example, a large enterprise environment generating 10 TB of raw security logs daily can often reduce this by 50% through proper compression and deduplication.

Place particular focus on high-volume sources like authentication logs, network flows, and endpoint telemetry, where data volumes can quickly overwhelm collection systems. Stress test and know your collection limits to ensure you have a scalable solution. This is particularly important for protecting against volume-based attacks like distributed denial of service (DDOS).

Handle alerts wisely

Define severity levels based on business impact and threat context, not just technical indicators. Correlate related alerts to build complete attack narratives rather than investigating isolated events. Combat alert fatigue through automation—start with simple aggregation and escalate to ML-based alert grouping as volumes grow.

For instance, rather than generating separate alerts for failed logins, suspicious process execution, and network connections, correlate these into a single incident showing potential lateral movement. Be sure not to log personal data like email addresses, as this can create complex privacy and retention/removal concerns. Carefully sanitize logged data to remove PII and sensitive information while preserving investigative value.

Use AI-enhanced tools

Artificial intelligence can be leveraged to enhance both threat detection and response. AI models can analyze log patterns to establish behavioral baselines and flag anomalies, such as unusual login times or data access patterns. Machine learning algorithms can process metrics data to detect subtle deviations that might indicate security threats, e.g., identifying when an application's memory usage pattern suggests potential crypto mining malware.

In telemetry correlation, AI helps connect seemingly unrelated events across different services and infrastructure components, revealing attack patterns that traditional rule-based systems might miss. For example, AI can correlate a slight increase in DNS queries with changes in network traffic patterns and system calls to identify malicious command-and-control communications.

Stay operationally sharp

Automate common response actions through playbooks and workflows to ensure consistent incident handling. Maintain detailed runbooks that capture tribal knowledge and enable junior people to handle routine issues. Watch your monitoring pipelines as carefully as production systems because the loss of visibility creates dangerous blind spots.

Without robust tooling to handle these operational challenges, security teams often struggle with the following:

Manual correlation across disconnected data sources
Inconsistent incident response procedures
Alert backlogs and missed detections
Configuration drift in detection rules
Difficulty measuring program effectiveness

Building effective security observability requires significant engineering investment in collection, processing, storage, and analysis capabilities. Modern security observability tools can dramatically reduce this complexity by providing integrated platforms for visibility and response.

Conclusion

Implementing security observability requires careful planning and tool selection. Start with edge filtering to control costs and data volumes. Automate data collection to maintain consistency and reduce workload. Keep all involved teams and systems aware of the critical need to ensure they can supply the full range of MELT you need.

Remember that active security observability makes threats visible before they become incidents. Comprehensive visibility pays significant dividends through improved security posture and instant incident response.

Design and build to scale from day one. Filter noise early by identifying the most important types of data. Automate responses where possible and protect your valuable data at the edge.

The true value of security observability emerges when it becomes part of your security team's DNA. Your teams move from reactive firefighting to proactive threat hunting. Engineers build security telemetry into new systems from day one. Leadership gets clear metrics on security progress, and threats become visible before they become incidents or breaches.

Want more?

Subscribe to Onum's LinkedIn newsletter to stay up to date on technical best practices for building resilient and scalable observability solutions and telemetry pipelines.