The Onum Blog
2025-03-25T04:00:00+00:0017 min

Telemetry Cybersecurity: Tutorial & Best Practices

Learn how to implement scalable security telemetry pipelines that efficiently process massive volumes of data for real-time threat detection and prevention

Onum
Onum
2025-03-25T04:00:00+00:0017 min

Picture this: A major financial institution's cybersecurity team receives an alert about suspicious login attempts across its global infrastructure. Within seconds, their telemetry pipeline correlates authentication logs for thousands of users, network traffic patterns across hundreds of subnets, and user behavior data to reveal a coordinated credential-stuffing attack targeting multiple regions.  

This is the power of well-implemented cybersecurity telemetry. It requires a sophisticated pipeline to handle massive data volumes while enabling instant analysis. As shown in the diagram below, telemetry data moves through distinct stages: Raw data enters the pipeline from sources like logs, metrics, traces, and events, is filtered, transformed, enriched, and partially analyzed at the edge, and then securely transported into destination applications—ranging from analytics engines that spot global attack patterns to action systems that automate responses.

High-level overview of telemetry cybersecurity

While most security teams collect telemetry data, the challenge lies in building pipelines that can process massive volumes efficiently while enabling real-time threat detection. This article provides patterns for implementing scalable telemetry pipelines that deliver actionable cybersecurity intelligence at an enterprise scale.

Summary of key best practices for effective telemetry in cybersecurity

Aspect of telemetry Best practices

Security telemetry and source mapping

Methodically organize and prioritize data sources based on the value they provide to security management systems.

Telemetry Collection

Implement a telemetry pipeline architecture that avoids single points of failure, scales to support peak traffic, and intelligently processes data at the edge.

Typical telemetry pipeline issues

Resilience strategies for security telemetry pipelines while facing challenges such as network issues (using backoff and buffering), SIEM downtime (using circuit breakers), attack volume spikes (using throttling and prioritization), and data integrity concerns (using cryptographic verification).

Edge analytics for real-time monitoring and threat detection Deploy edge analytics to reduce latency, filter and enrich data, and perform real-time threat detection.
Advanced correlation, enrichment, and contextualization Correlate enriched data across regions and data sources to detect company-wide attack patterns.

Cybersecurity telemetry and source mapping

While traditional IT telemetry tracks app performance, user behavior, and system health, security telemetry focuses specifically on threat detection and prevention. The processed data enables security teams to identify attack patterns and vulnerabilities in real-time before attackers have time to exploit them.

Cybersecurity telemetry transforms raw infrastructure and application data into actionable security insights through a four-stage pipeline:

  • Collection: The pipeline gathers telemetry data from diverse sources across the environment.

  • Parsing and Normalization: Raw data is standardized into consistent formats, with initial filtering to reduce noise and unneeded information.

  • Enrichment: The pipeline adds crucial context by correlating information across sources and augmenting with metadata (IP geolocation, user details, threat intelligence).

  • Data Routing: The enriched telemetry is directed to appropriate analysis systems based on security use cases and priority levels.

Modern data observability pipeline

How cybersecurity telemetry enables root-cause analysis

Before exploring telemetry pipeline design concepts, let's review examples of data types and how they are used in security analysis.

Traffic Pattern Analysis

Organizations analyze telemetry data from network traffic, such as failed login attempts or API calls, within a time frame to define metrics that help identify and quantify unusual patterns. For example:

  • Rate of login failures (time period) = Total number of login failures over a given time interval

  • Rate of API requests (time period) = Total number of API requests over a given time period

API Traffic (Sample)

Teams can track metrics like these to detect traffic spikes and patterns indicating potential attacks, such as a sudden increase in authentication attempts or unusual patterns in API endpoint access from a client.

Authentication Analysis

Authentication logs capture failed login attempts from various IP addresses along with the timestamps. Security teams can monitor these patterns for geographic anomalies in authentication attempts, rapid succession of login failures, or successful logins shortly following multiple failures.  

Log snippet (sample)

Transaction Flow Analysis

Transaction traces help reveal unusual behavior, such as bots attempting to bypass transaction steps or exhibiting abnormally fast checkout times, which indicate that automation generated the transaction instead of a real user.

Trace snippet (sample)

Security Event Analysis

Security systems generate events based on anomalies detected, as shown in the example below. These events can be a data source to help correlate logs, metrics, and traces to confirm the attack.

Event snippet (sample)

Root-cause analysis depends on layered telemetry correlation—first identifying anomalies, then enriching them with relevant context, and finally evaluating them against historical behavioral patterns. Instead of responding reactively to scattered alerts, organizations can trace security events back to their origin, distinguishing between a misconfiguration, an insider threat, or an external attack. The deeper the telemetry pipeline integrates across an infrastructure, the faster and more accurately teams can isolate threats, minimize response time, and prevent future incidents.

With this foundation in place, the next step is ensuring that security teams prioritize the right telemetry sources. Not all data is equally valuable, and the ability to map and categorize cybersecurity telemetry efficiently determines how quickly and effectively an organization can detect and respond to threats. This is where source mapping frameworks become critical.

Source mapping frameworks

Source mapping is a structured, methodical approach to determining which telemetry sources are vital for real-time detection and response (critical or “must have”) and which are “nice to have.” The sources in the second tier of importance add context or assist with post-incident analysis but are optional sources for data collection. This exercise helps prioritize the engineering resources so that data collection from the most important sources is not delayed.

For example, we can update our telemetry source classification as follows:

Source

Description

Classification

Network traffic

Bandwidth, latency, packet loss, and unusual behaviors

Critical

Log management

Logs for operational and security issues

Critical

Intrusion detection and threat intelligence

Unauthorized access and malware indicators

Critical

Authentication and access control

Failed logins, MFA, role-based access

Critical 

System performance

CPU, memory, disk usage

Nice to have

IoT devices

Sensory data, device status, and security events 

Critical 

Implementing a source mapping framework ensures efficient prioritization of critical telemetry, improving real-time threat detection and reducing response times. 

Best practices for cybersecurity telemetry collection

Security telemetry collection requires careful optimization to balance real-time detection capabilities with system performance and infrastructure cost. Understanding how to build efficient collection architectures can make the difference between catching threats early and drowning in noise.

Avoid bottlenecks and single points of failure

Traditional centralized collection approaches create single points of failure and processing bottlenecks. Modern architectures distribute collection closer to data sources, enabling local processing and reducing network overhead.

Consider a financial services organization processing millions of authentication events daily. With a centralized collection, a surge in failed login attempts could overwhelm the central collector, potentially missing critical attack indicators. Distributed collection allows each regional data center to process authentication events locally, identifying attack patterns before they impact the broader infrastructure.

The two diagrams below compare two scenarios: one where data is collected and analyzed centrally, the other where data collectors are distributed, and where data is analyzed at the collection points. The traditional centralized model results in higher data transfer latencies and scalability challenges.

Centralized vs distributed data collection

Design a distributed scalable collection architecture

Your collection architecture needs three foundational elements: strategic placement, resilient data flows, and intelligent processing.

Start with collector placement. Map your network topology and security boundaries, then position collectors to minimize data movement while maintaining security isolation. This means implementing strict zone-based isolation for organizations handling regulated data to prevent sensitive telemetry from crossing unauthorized boundaries.

Next, design your data flows for resilience. Use load balancing to distribute telemetry data across collectors. However, scaling collection requires an infrastructure like Kubernetes or AWS Auto Scaling

The code designed to collect data runs as a containerized microservice whose job is to submit the collected data to a message bus such as Apache Kafka for transport. The containerized microservice code runs on a Kubernetes container orchestration platform, which can replicate a containerized microservice using Horizontal Pod Autoscaler (HPA) when required to support an increasing workload. When a potential attack generates a surge of security events, the data collection infrastructure will expand automatically to handle the increased load.

The final piece is intelligent processing close to the source. Rather than forwarding raw data, collectors should normalize formats, enrich events with additional meta-data context, and filter unnecessary fields at the edge and close to the data collection point. This approach reduces data transfer costs and latency. 

Distinguish between data types 

Effective telemetry collection must scale across multiple dimensions: data volume, infrastructure growth, and processing complexity. 

The latter (processing complexity) increases when processing multiple data types (traces, metrics, and logs) without recognizing their unique attributes, like volume changes, collection formats, and analytics requirements.

Dedicate collector pools to specific telemetry types or security use cases. This allows optimized processing paths while simplifying scaling decisions. Authentication events, for instance, might route to collectors optimized for high-throughput pattern matching.

Rather than scaling collectors vertically (adding more CPU and memory to an existing node), implement horizontal scaling with load-balanced collector pools (adding more nodes vs. adding more computing power to one node). Each collector handles a portion of your telemetry stream, automatically scaling based on volume metrics. Set scaling thresholds based on collector performance profiles to trigger new collector deployment before existing resources reach their processing limits.

Infrastructure scaling requires careful collector topology management. As your infrastructure grows, deploy new collectors following established patterns:

  • Regional expansion needs new collector clusters per region

  • Cloud expansion demands cloud-native collection points

  • New applications and infrastructure might require specialized collectors

Design resilient collection systems

Resilience means more than handling high volumes for cybersecurity telemetry collection. Your collection infrastructure must maintain security visibility even during partial failures or network issues. Implement active-active collector pairs with real-time state synchronization to allow all collection nodes to be operational simultaneously, eliminate a single point of failure, and prevent gaps in security coverage during failovers.

Consider how your collection infrastructure responds to different types of stress:

  • During network partitions, collectors should buffer telemetry locally until connectivity resumes.

  • When processing bottlenecks occur, prioritize security-critical events while queuing routine telemetry.

  • If storage systems slow down, implement smart retention policies that preserve security-relevant data.

  • If a service degrades, use circuit breakers to prevent cascading failures while gracefully reducing functionality rather than failing completely.

You can use chaos engineering techniques to create a telemetry resilience test plan, leveraging tools like Gremlin or AWS FIS. These tools can simulate many failure scenarios to identify how your observability pipeline behaves under stress and reacts to various failure modes.

Filter out the noise

Effective stream processing in cybersecurity telemetry requires sophisticated filtering and sampling strategies that preserve security insights while managing data volumes. 

Begin with baseline filtering rules that identify high-value security events. For example, unusual authentication patterns, privilege escalations, and network anomalies should always flow through—but don't stop there. Implement dynamic thresholds that automatically adjust based on your environment's behavior patterns. For instance, during a suspected security incident, your pipeline should automatically decrease sampling rates and capture more detailed telemetry from affected systems while maintaining normal sampling elsewhere.

In high-volume systems, you can incorporate the following commonly used techniques to prioritize data collection and reduce the volume of data automatically: 

  • Threshold-based filtering: Collect data only when security thresholds are exceeded.

  • Frequency-based sampling: Collect data at fixed intervals to manage volume.

  • Log-level filtering: Collect only high-severity security events.

  • Duplicate detection: Filter out duplicate alerts or logs.

  • Adaptive sampling: Adjust collection rates based on traffic or event severity.

One key takeaway is to filter according to your security use cases. Rather than applying blanket sampling rates, context-aware filtering should be implemented that considers the security importance of different data sources. Use dynamic sampling rates that adjust based on system state—sample more heavily during suspicious activity and reduce sampling during normal operations. This approach not only improves data quality but also significantly reduces infrastructure spending.

While filtering is essential, it also presents significant challenges. One major hurdle is writing and managing the complex regex patterns required to parse raw data effectively.

A Regex pattern for parsing firewall logs for illustration of complexity (sample) 

Modern platforms like Onum employ distributed intelligent collectors to tackle this challenge by providing UI-based, drag-and-drop capabilities and leveraging advanced techniques like machine learning-driven parsing, schema validation, and structured data formats (e.g., JSON), reducing the need for manual regex management.

Optimize data volume 

With collection and filtering optimized, turn your attention to processing efficiency. Every byte of telemetry should justify its resource cost through security value.

Strip unnecessary fields at collection time – standardize timestamps, normalize event schemas, and drop fields irrelevant to security analysis. For high-volume data sources like network flows, implement progressive enhancement where initial processing captures security-critical fields while maintaining links to full records when needed for investigation.

Apply compression strategically. Use content-aware compression optimized for cybersecurity telemetry types. Time series data benefits from differential encoding (e.g., data aggregation using min, max, 95 percentile, and average), while structured logs might use dictionary-based compression. Monitor compression effectiveness to identify opportunities for schema optimization.

By carefully following these practices, you'll build a telemetry collection infrastructure that scales with your security needs while maintaining detection effectiveness. 

Measure effectiveness 

A pipeline's effectiveness depends not just on speed and consistency, but on delivering complete, valuable data to the right destinations. Even the fastest pipeline fails if it overly filters critical fields needed by downstream analytics tools.. Effectiveness is ultimately about understanding a pipeline's ability to detect security breaches reliably. Are correlation rules catching real threats? Are processing bottlenecks creating security blind spots? Are new threats detected?

Use the mean time to detection (MTTD) and mean time to respond (MTTR) metrics to measure the effectiveness of your telemetry systems.

For mean time to respond (MTTR), instrument your response workflows to track time spans between detection, triage, and containment actions. The logs produced by the tools that generate alerts, open tickets, and trigger automated responses can be used to measure the time lapse during incident response to identify bottlenecks. 

If a new attack vector emerges, you must adjust your telemetry collection strategy, add new correlation rules, or enhance edge processing to help maintain MTTD and MTTR goals.

Next-generation telemetry data pipelines for logs, metrics, and traces built with leading-edge technologies

  • 1

    Optimize, enrich & route your real-time telemetry data at the ingestion point

  • 2

    Drag-and-drop data pipeline creation & maintenance with the need for Regex

  • 3

    Rely on a future-proof, simple, efficient & flexible architecture built for hyperscale

Typical cybersecurity telemetry pipeline issues

Failures, outages, and data corruption in the supporting systems challenge the resilience of cybersecurity telemetry pipelines, even if they are designed to be redundant and distributed. 

To ground our understanding of real-world challenges, let's review specific examples of how data loss, traffic surge, or erroneous telemetry cause problems and the measures cybersecurity telemetry architects take to overcome them.

Use Case

Error Scenario

Recommended Strategy

Network anomalies and packet loss

Temporary loss or degradation affecting critical cybersecurity telemetry

  • Use exponential backoff: Automatically retry failed requests with increasing delay intervals to avoid overwhelming destination systems (e.g., using the AWS SDK).
  • Enable local buffering: Temporarily store telemetry data during transient issues using tools like Apache Kafka or AWS Kinesis.
  • Combine exponential backoff and local buffering: Use these approaches together to retry data transmission and buffer data until the network recovers, preventing data loss.

SIEM and threat detection downtime

Unavailability of threat detection platforms

  • Use circuit breakers: Stop making requests to failed destinations.
  • Use dead letter queues (DLQs): Temporarily store failed messages in a DLQ for later processing.
  • Combine circuit breakers and DLQs: Halt data transmission and store failed data in a DLQ until the destination is available again.

Overwhelming attack volumes

A high volume of security events during large attacks

  • Use backpressure: Throttle incoming data to prevent the pipeline from becoming overwhelmed.
  • Use event prioritization: Event prioritization should be dynamic, using risk-based scoring instead of static severity labels. By incorporating historical trends, behavioral baselines, and correlation across multiple data sources, security teams can prioritize alerts based on real-world risk, not just predefined event categories.
  • Combine backpressure and event prioritization: Manage high volumes by first slowing data flow and processing critical events.

Corrupted or tampered data

Malformed or tampered telemetry

  • Use cryptographic signatures: Ensure data integrity by applying cryptographic signatures (e.g., HMAC or RSA).
  • Integrity checks: Use encryption and digital signatures to authenticate and verify data authenticity.

Edge analytics for real-time cybersecurity monitoring and threat detection

Edge analytics stops threats at the source by analyzing cybersecurity telemetry before data reaches centralized systems. Instead of waiting for logs to be processed in a Security Information and Event Management system (SIEM), threats like credential stuffing, API abuse, and insider attacks can be detected immediately at firewalls, authentication servers, and API gateways.

Consider an e-commerce site: Attackers run thousands of login attempts per minute using stolen credentials. A SIEM may take minutes to detect the pattern, allowing multiple accounts to be breached. With edge analytics, the attack is blocked instantly at the login gateway—long before it reaches backend authentication services.

To make this work, edge analytics nodes must make real-time security decisions. The key challenge is ensuring that each node has just enough intelligence to recognize threats without being overwhelmed by excessive data.

How to design an effective edge security layer

Making decisions locally

To be effective, edge systems must quickly recognize which events are threats and which are normal activities. A single failed login attempt is routine, but thousands from the same IP in minutes signal an automated attack. Instead of forwarding all login data to a SIEM for analysis, edge processors immediately identify patterns and block malicious activity, reducing the strain on backend systems.

Filtering and prioritizing data at the edge

Security-critical events—privilege escalations, failed logins from unusual locations, and rapid API calls—should always be processed first. Less urgent data, like routine system health logs, can be filtered, sampled, or delayed to keep processing efficient. This ensures that real threats don’t get buried in low-priority noise.

Automated response at the edge

Once a threat is detected, edge systems don’t just generate alerts—they take action. A botnet running credential stuffing attacks may trigger an immediate IP block at the login gateway. A user exhibiting suspicious privilege escalation might have their session flagged or temporarily suspended. These decisions happen in milliseconds, preventing further damage before the attack spreads.

Staying in sync with central security intelligence

While edge systems handle local detection and response, they must also communicate with centralized security intelligence. If an attack pattern is detected in one region, the information is shared across all edge nodes, adapting defenses in real-time. Switching locations, IP addresses, or attack methods prevents attackers from evading detection.

Evolving with Threats

Attackers constantly refine their techniques, so static detection rules become outdated quickly. To stay ahead, edge analytics systems must receive continuous updates, pulling in new threat intelligence, machine learning models, and correlation rules. Automated syncing ensures that all edge nodes operate with the latest detection capabilities, keeping security agile.

By stopping threats early, filtering unnecessary data, and acting instantly, edge analytics turns cybersecurity telemetry into real-time protection, reducing SIEM load and improving response times.

Advanced correlation, enrichment, and contextualization

While edge analytics excel at immediate threat detection, complex attack patterns often only become visible through centralized correlation and enrichment. Some decisions need immediate local context - like tracking authentication attempts per user. Others require broader historical patterns - like establishing baseline user behavior profiles.

Let’s consider an advanced persistent threat actor moving laterally through your infrastructure—at the edge, you might see isolated authentication attempts that appear legitimate. You could identify the broader attack pattern by centralizing and correlating this telemetry across your infrastructure. For example, a user authenticating to development systems in Asia, followed by database access in Europe, and finally connecting to production systems in North America creates a suspicious pattern that no single edge collector could detect. 

Correlation, enrichment, and contextualization transform raw telemetry into actionable intelligence. Correlation identifies relationships between events, linking seemingly separate login attempts, access requests, or privilege escalations to uncover patterns of compromise. Without it, security teams would be buried in isolated alerts, unable to distinguish real threats from normal activity. 

Enrichment adds detail to correlated data, incorporating external intelligence, user roles, and system classifications to clarify intent. A simple file download alert carries more weight when enriched with information showing that the file contains sensitive financial data and that the user accessing it has no prior history of interacting with such documents. Contextualization further refines detection by aligning activity with expected behavior. A database query at midnight may not be inherently suspicious, but if it comes from an employee who typically works daytime shifts in a region they’ve never logged in from and involves an unusually large data transfer, the risk level increases.

Centralizing this intelligence allows security teams to track slow-moving threats that would otherwise go undetected. While real-time monitoring might overlook an attacker who spaces actions over weeks to avoid detection, a well-architected telemetry system can link gradual privilege escalation, sporadic system access, and subtle configuration changes into a recognizable attack sequence. Without correlation, security data remains fragmented; without enrichment, it lacks clarity; without contextualization, it fails to distinguish normal from abnormal. Together, these processes transform raw telemetry from a flood of disconnected signals into a precise security insight.

How correlation reduces mean time to detection

Reducing mean time to detection (MTTD) demands an intelligent system to link related events, enrich them with additional intelligence, and assign risk based on context. A firewall alert about an external IP scanning a corporate network may not seem urgent on its own. Similarly, a user logging into their workstation from an unfamiliar location could be flagged as unusual but not immediately dangerous. However, when these events are correlated—showing that the same user account is not only logging in from an unusual location but is also querying a known malicious domain and accessing sensitive files—it signals a likely security breach that requires immediate action.

This correlation directly impacts MTTD by eliminating the analysis gap between seeing individual events and recognizing their collective significance. What might take an analyst hours to piece together manually happens in seconds with proper correlation, giving security teams the critical time advantage needed to contain threats before lateral movement or data exfiltration can occur.

This multi-source correlation approach drastically reduces the time security teams spend investigating false alarms. Instead of handling each event separately, the system pre-processes and cross-references data before generating an alert. Enrichment adds critical context, determining whether the user has ever logged in from this location before, whether the accessed domain appears in threat intelligence feeds, and whether the queried files contain classified data. With correlation and enrichment, security teams receive a single, high-confidence alert immediately pointing to a likely compromise.

The diagram below illustrates how these techniques reduce MTTD by combining firewall logs, endpoint telemetry, and DNS queries into a unified security signal. Instead of treating these data points as independent alerts, the system establishes relationships between them, quickly identifying patterns that indicate an attack. By automating this process, correlation helps detect intrusions in minutes rather than hours—often before an attacker can fully establish control.

Correlation and enrichment pipeline for faster threat detection

Recommendations for enabling effective correlation, enrichment, and contextualization

Here are some recommendations to keep in mind:

  • Centralize telemetry aggregation: To build a unified SIEM system or a similar platform, consolidate firewall logs, endpoint telemetry, DNS logs, and other data sources. Centralizing these data sources allows you to track user login attempts and activities within the system. This advanced correlation will allow you to connect the dots and identify suspicious activities and potential threats that might otherwise go unnoticed. Observability pipelines can facilitate this by integrating multiple data sources, ensuring smooth real-time ingestion of telemetry data.

  • Enrich with contextual data: Contextual data refers to additional information that enhances raw telemetry data. You can enrich your data with user roles, device details, geolocation, and more, providing additional context. For example, with extended context, you can automatically identify whether a user is logging into the system from an unauthorized device. This helps you enhance the accuracy and prioritization of threats. An observability pipeline can effectively enrich raw data with context during pipeline processing.

  • Define clear correlation rules: Correlation rules can be defined and tailored to specific threat scenarios based on a sequence of events. For example, you can correlate a user logging in from two different distant geolocations within a short period to detect potential malicious activity. Observability pipelines can implement real-time correlation within the pipeline, identifying specific patterns before data is sent to analytics systems, thereby building powerful rules to achieve better results.

  • Use AI-powered tools for correlation: You can leverage AI models to identify patterns and anomalies that typical static rules might miss. AI is capable of advanced correlation on top of unstructured data. Observability pipelines can be enhanced by integrating AI/ML tools for advanced correlation. The pipeline can be further enhanced to extract relevant features from raw data and feed them into AI models, improving accuracy.

  • Employ continuous tuning and feedback loops: Build continuous tuning and feedback loops to monitor and improve your correlation engine constantly. Regular evaluation and refinement of rules and AI models will help achieve higher success rates and reduce false positives.

Final words

By adopting best practices like source mapping, distributed collection, edge analytics, and real-time processing, organizations can enhance security, ensure compliance, and improve efficiency, enabling faster and more effective threat mitigation. Modern cybersecurity telemetry platforms like Onum are designed based on the principles described in this article to help save months of engineering implementation. Visit https://onum.com/platform/ to learn more.

Want the latest from Onum?

  • 1

    Subscribe to our LinkedIn newsletter to stay up to date on technical best practices for building resilient and scalable observability splutions and telemetry pipelines.

Post content