In every IT environment, logs record individual events across multiple resources. Every login attempt, file access, network connection, and process is captured in logs. Logs answer questions about who performed what actions when events occurred, and which systems were involved.
Production environments often span hundreds of microservices and interconnected infrastructure components. These systems generate thousands of log events per minute. Managing data at this scale presents three main technical challenges: Volume, Velocity, and Variety.
Volume continues to grow as teams deploy new services, containers, IoT devices, and infrastructure components. Velocity matters because attacks can be completed in minutes, but traditional log systems take significantly longer to process events. Variety adds complexity since different systems output logs in various formats. Some use structured JSON, others dump plain text, and each format needs its own parsing approach before you can analyze anything.
These challenges require a different approach to log analytics. In this guide, we will cover best practices for implementing real-time log processing, with practical examples and insights from specialized platforms.
Summary of best practices for log analysis
Best practice |
Description |
Perform in-transit data processing |
Filter, enrich, and route log data while it moves through your network instead of after storage. This approach cuts detection times from minutes to milliseconds while reducing storage costs through filtering and content-based routing. |
Implement intelligent data reduction and filtering |
Use smart sampling to keep only valuable log data, mask sensitive information for compliance, and filter out noise that wastes storage space. Focus on preserving security-relevant events while discarding routine operations. |
Optimize routing and storage |
Send different log types to appropriate destinations and storage tiers based on how often you need to access them. Use efficient formats like Parquet for long-term storage and set up automatic policies that move older logs to cheaper storage options. |
Emphasize infrastructure resiliency and scalability |
Run multiple log collectors with automatic failover, use circuit breakers to prevent system failures from spreading, and set up load balancing that scales with your data volume without manual intervention. |
Set up real-time alerting and detection pipelines |
Connect events from different systems to spot attack patterns, use dynamic thresholds that adapt to normal behavior, and route alerts to the right teams through multiple channels to ensure fast response times. |
Limitations of traditional log analytics
Most log analytics systems work by storing all data initially and then analyzing it later. Your log collectors send raw data to central storage, where it sits in queues waiting for batch processing engines to analyze it. This approach creates delays because each log entry has to be written to disk before any analysis can occur. The constant disk writes create bottlenecks that slow down data ingestion and drive up infrastructure costs.
Store-then-process architecture constraints
Network and storage bottlenecks multiply as log volumes grow. During peak periods, high-traffic systems generate thousands of events per second, which flood the network connections between collectors and storage. Storage systems struggle to write all this incoming data, especially when they're also trying to run analytical queries simultaneously.
SIEM platforms frequently lag during busy periods because storage can't process data fast enough. This causes event accumulation in queues, creating longer gaps between when incidents occur and when you notice them.
Processing latency and its impact on detection capabilities
Processing delays hinder your ability to identify threats promptly. For example, privilege escalation attacks can happen within minutes, and lateral movement attacks often complete in under an hour. If your log system takes several minutes to process events, attackers have plenty of time to complete their work before you notice them.
Batch processing worsens this issue because systems remain idle, waiting to gather enough data before starting any analysis. During off-hours when fewer people are monitoring, these delays can stretch detection times to several hours.
Cost implications of full-fidelity storage approaches
Storing every single log gets expensive as data volumes grow into petabytes. Teams incur storage costs for verbose logs that don't contribute to security analysis, such as successful login records from normal users or routine system status checks. These low-value logs make up 40-50% of what most teams store.
Many teams shorten how long they keep logs to save costs. This often means deleting data that might be useful later for investigations. This results in data gaps when old logs are needed to trace the progression of an attack or identify long-running compromises.
Best practices for log analytics implementation
The following practices help you handle large volumes of log data without breaking your budget or missing security threats. Instead of storing everything first and analyzing later, you can process data as it moves through your network.
Perform in-transit data processing
Real-time filtering and field reduction at collection points
Filter out unnecessary data when logs arrive before they consume network bandwidth and storage space. Set up filtering rules to drop verbose events, such as successful DNS lookups or routine system health checks.
You can also remove bloated fields that waste space without adding value. Strip out verbose debug information, overly detailed timestamps, or lengthy HTTP headers during log collection. This simple step cuts log sizes significantly while preserving security-relevant data.
Context enrichment through lookups during transmission
Add useful context to your logs as they move through your network before they reach your central analytics systems. Set up lookup tables at your collection points with data like IP reputation lists, user role information, and which systems are most important to protect.
When authentication logs are recorded, add details such as the user information, access level, and login patterns from your identity systems. For network logs, add geolocation data and check IPs against threat feeds to spot connections to known bad actors right away
Here is a Logstash configuration that checks client IPs against threat intelligence feeds and adds location information. :
filter {
# Only process logs that have a client_ip field
if [client_ip] {
# Look up IP reputation from threat intelligence file
translate {
# Source field to check
field => "client_ip"
# New field to store result
destination => "threat_status"
# Path to threat intel file
dictionary_path => "/etc/logstash/ip_reputation.yml"
# Default if IP not found
fallback => "unknown"
}
# Add geographic location data for the IP
geoip {
source => "client_ip"
# Store location data here
target => "geoip"
# Add country code for risk assessment
add_field => { "country_risk" => "%{[geoip][country_code2]}" }
}
}
}
This configuration takes basic HTTP logs and adds threat status plus geographic details before sending them to your security tools.
Onum handles this processing automatically during data transmission. Instead of storing raw logs first, their system filters, enriches, and routes data while it flows through your network. This approach sends only relevant, enriched logs to your SIEM while keeping complete copies in cheaper storage for compliance.
Implement intelligent data reduction and filtering
Data sampling methodologies for high-volume streams
When applications generate massive log volumes, you don't need to keep every event. Use sampling to reduce storage requirements while maintaining security coverage. Keep all errors and authentication events, but sample only a percentage of successful operations.
For high-traffic applications, you might keep only 5% of successful requests but preserve 100% of errors and authentication attempts. You can also use techniques like keeping the first and last events in bursts of activity, which maintains context about peak activity periods.
Here is a Fluent Bit configuration that keeps all HTTP errors and authentication requests but samples only 1% of successful requests:
[FILTER]
# Use Lua scripting filter
Name lua
# Apply to all logs with "api" prefix
Match api.*
# Path to the Lua script file
script sampling.lua
# Function name to execute
call sample_logs
# sampling.lua - Smart sampling script for API logs
function sample_logs(tag, timestamp, record)
# Check if this is an error (4xx/5xx) or authentication request
if record.status >= 400 or string.match(record.path, "/auth") then
# Keep all errors and auth events (100%)
return 1, timestamp, record
# For successful requests, only keep 1% randomly
elseif math.random(100) <= 1 then
# Keep this successful request (1% sample)
return 1, timestamp, record
else
# Drop this event (discard 99% of success)
return -1, 0, 0
end
end
Compliance-focused data masking
Protect sensitive data in your logs while maintaining their usefulness for security analysis. Instead of storing credit card numbers, social security numbers, email addresses, and other sensitive information in plain text, mask or hash this information.
Replace sensitive fields with consistent hashes that preserve correlation patterns without exposing data.
For credit card numbers, replace 1234-3567-5678-9012 with hashes like CARD:a1b2c3d4.
Email addresses work similarly, converting user@example.com into a trackable identifier.
For database fields containing sensitive information, use tokenization instead of storing the real values.
The below Logstash implementation finds credit card numbers and email addresses in log messages and replaces them with consistent hashes.
filter {
# Find and replace credit card numbers with consistent hashes
mutate {
gsub => [
# Search in the message field for credit card patterns
"message", "(?i)\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b",
# Replace with hash using MD5 of the original number
"[CARD:%{[message].digest('md5')[0,8]}]"
]
}
# Process email addresses if present
if [user_email] {
# Create a consistent hash of the email address
fingerprint {
# Source field containing the email
source => "user_email"
# New field to store the hash
target => "user_hash"
# Use SHA256 for stronger hashing
method => "SHA256"
# Secret key for consistent hashing (change this!)
key => "your-secret-key"
}
# Remove the original email field for privacy
mutate { remove_field => ["user_email"] }
}
}
Onum automates much of this filtering work. Their system uses machine learning to figure out which logs contain real security signals and which are just noise, cutting storage costs.
Optimize routing and storage
Storage tier optimization strategies
Different logs have different access requirements, so match your storage choices accordingly. Keep recent security events in fast IOPS storage where your SIEM can query them immediately. This typically covers 30-90 days of high-priority events.
For logs you access weekly or monthly, like system and debug logs, use standard storage options that balance cost and retrieval time. Logs for compliance and archival reasons can be stored in cold storage tiers that offer cost savings.
Here is an example S3 policy that shows how this tiering works. This policy keeps security logs in standard storage for 30 days, then moves them to cheaper storage after 90 days, and finally archives them for long-term compliance after about 7 years:
{
"Rules": [{
"ID": "SecurityLogLifecycle",
"Status": "Enabled",
"Filter": {"Prefix": "security-logs/"},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 2555,
"StorageClass": "DEEP_ARCHIVE"
}
]
}]
}
Format transformation
Converting logs to compressed formats can save significant storage space for long-term archival. Convert JSON logs to columnar formats like Parquet. For compliance-only logs, use compression algorithms like LZ4 or Snappy that compress well but decompress quickly when needed.
This Logstash configuration sends high-priority events to your SIEM in JSON format, and the rest of the events are compressed Parquet for long-term storage:
output {
# Send high-priority events to SIEM for immediate analysis
if [event_priority] == "high" {
elasticsearch {
# SIEM cluster endpoints
hosts => ["siem-cluster:9200"]
# Daily indices for better performance and management
index => "security-events-%{+YYYY.MM.dd}"
}
}
# Send all events to S3 for long-term storage and compliance
s3 {
# AWS region for the S3 bucket
region => "us-east-1"
# Destination bucket for log archives
bucket => "long-term-logs"
# Use Parquet format for efficient storage and analytics
codec => "parquet"
# Compression level (1-9, higher = more compression)
compression_level => 6
# Create new file every 15 minutes
time_file => 15
}
}
Onum's routing capabilities intelligently direct data to appropriate destinations. Their system looks at what's in your logs and decides where each piece of data should go. Important security events get sent to your SIEM, while complete copies go to cheaper storage for compliance.
Emphasize infrastructure resiliency and scalability
Active-active collector deployments
Don't rely on just one log collector and risk losing data when it fails. Run multiple collectors simultaneously so that if one goes down, the others continue working without missing any logs.
Set up your collectors in pairs or small groups where each one can handle all the log traffic from your most important systems. Use a load balancer to distribute the work between them and check if they're healthy. When one collector stops responding, the load balancer automatically redirects all data to the working collectors.
Here is an HAProxy setup that sets up three log collectors with health checks every 5 seconds. If any collector fails its health check 3 times in a row, HAProxy stops sending traffic to it:
backend log_collectors
# Distribute requests evenly across all healthy servers
balance roundrobin
# Check server health - /health endpoint
option httpchk GET /health
# First log collector server
server collector1 10.0.1.10:8080 check inter 5s fall 3 rise 2
# Second log collector server with same health check settings
server collector2 10.0.1.11:8080 check inter 5s fall 3 rise 2
# Third log collector server with same health check settings
server collector3 10.0.1.12:8080 check inter 5s fall 3 rise 2
frontend log_ingress
# Listen on port 80 for all network interfaces
bind *:80
# Route traffic to the log_collectors backend
default_backend log_collectors
# Create session table to track client IPs (100k entries, 30min expiry)
stick-table type ip size 100k expire 30m
# Ensure same client IP always goes to same backend server
stick on src
The configuration also keeps track of which collector each IP address was using, so related log events stay together even during failovers.
Circuit breaker patterns
When one of your downstream systems crashes or gets overloaded, you don't want it to bring down your entire log processing pipeline. Circuit breakers work like electrical fuses. They automatically disconnect from failing systems to protect the rest of the network.
Set up circuit breakers on connections to your SIEM, threat intelligence feeds, and enrichment databases. These monitor how often requests fail and how long they take to complete. When a system encounters threshold issues, the circuit breaker switches between three different modes:
Closed (normal): Requests go through normally
Open (blocking): The downstream system is failing, so block all requests to protect it
Half-open (testing): Try a few requests to see if the system recovered
The following Resilience4j configuration sets up circuit breakers for SIEM connections and threat intelligence services. This setup opens the circuit when 50% of requests fail and waits 30 seconds before testing recovery. It also auto-retries connection after the wait period, gradually increasing the time between attempts if the service stays down.
resilience4j.circuitbreaker:
instances:
# Circuit breaker for threat intelligence service lookups
threatIntelService:
# Monitor last 100 requests to determine circuit state
slidingWindowSize: 100
# Open circuit if 50% of requests fail
failureRateThreshold: 50
# Wait 30 seconds before testing if service recovered and consider slow calls as failures if 50% exceed threshold
waitDurationInOpenState: 30s
slowCallRateThreshold: 50
slowCallDurationThreshold: 2s
# Circuit breaker for SIEM connector
siemConnector:
# Monitor last 50 requests
slidingWindowSize: 50
# Open circuit if 60% of requests fail, wait 60 seconds before testing recovery
failureRateThreshold: 60
waitDurationInOpenState: 60s
# Transition from open to half-open state
automaticTransitionFromOpenToHalfOpenEnabled: true
Onum's architecture handles hyper-scale environments through distributed processing and intelligent load distribution. It automatically scales up or down based on the incoming data volume, without requiring manual adjustments.
Set up real-time alerting and detection pipelines
Anomaly detection thresholding techniques
Instead of setting fixed alert thresholds that cause false alarms during busy periods, use dynamic thresholds that adjust based on standard patterns. This approach compares current activity against what is typically normal for that time of day, day of the week, or specific user.
For example, if a user logs in 2-3 times per day, getting 20 failed login attempts in 10 minutes is questionable. However, if your login page usually receives 1000 attempts per hour during business hours, you'd set different thresholds than during overnight hours when legitimate traffic is low.
Here is a Logstash filter that tracks failed login attempts from each IP address. When any IP hits 10 failed attempts, it immediately flags that IP as suspicious and marks it for blocking:
filter {
# Process only failed authentication attempts on login/auth endpoints
if [status] == 401 and [url] =~ /login|auth/ {
# Track failed login attempts per IP address
# Maintains an in-memory hash counter that persists across events
ruby {
init => "@login_attempts = {}"
code => "
# Extract client IP and increment failed login counter
ip = event.get('client_ip')
@login_attempts[ip] ||= 0
@login_attempts[ip] += 1
# Flag as brute force attack if threshold exceeded
# Adds alert fields for downstream processing and response
if @login_attempts[ip] >= 10
event.set('alert_type', 'brute_force')
event.set('severity', 'high')
event.set('action_required', 'block_ip')
end
"
}
}
}
Notification workflow integration
Once threats are detected, you need to get alerts to the right people through the proper channels. Don't just send everything to everyone. This creates alert fatigue and slows down response times.
Set up multiple notification channels for critical alerts. Send high-priority threats through email, SMS, Slack, and your ticketing system so the message gets through even if one channel fails. Route different types of alerts to the right teams. Send database security issues to your DBA team, infrastructure problems to your ops team, and web application attacks to your security team.
For example, this Alertmanager configuration routes alerts based on severity. Attacks like SQL injection and brute force attempts go to security teams through multiple channels, and to reduce alert noise, lower-priority activity gets sent through single channels:
route:
group_by: ['alertname', 'client_ip']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default-team'
# Critical attacks get faster delivery while warnings use standard workflow
routes:
- match:
severity: critical
alert_type: sql_injection|brute_force
receiver: 'security-critical'
group_interval: 1m
- match:
severity: warning
alert_type: suspicious_activity
receiver: 'security-standard'
# Notification channels, critical receiver uses multiple channels to ensure delivery during incidents
receivers:
- name: 'security-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#security-alerts'
title: 'CRITICAL: {{.GroupLabels.alertname}}'
email_configs:
- to: 'security-oncall@company.com'
subject: 'CRITICAL: Web App Attack Detected'
pagerduty_configs:
- service_key: 'YOUR_PD_KEY'
severity: 'critical'
Above configuration routes attacks like SQL injection and brute force attempts through multiple channels. Lower-priority activities use single-channel delivery to reduce alert noise.
Conclusion
The challenges of managing massive log volumes, processing speeds, and diverse data formats don't have to break your budget or leave you blind to threats. By filtering, enriching, and routing data as it moves through your network, you can reduce detection times to milliseconds.
Start with the basics that deliver quick wins. Set up intelligent filtering to discard noisy logs that add no security value. Add context enrichment to your alerts, providing the information analysts need to act quickly. Then, build resilience features like active-active collectors, circuit breakers, and intelligent routing that keep everything running when systems fail.
The goal is to process data during transmission rather than after storage. This change from store-then-process to process-in-transit is what separates teams that struggle with the scale from those that handle it efficiently.
Want the latest from Onum?
Subscribe to our LinkedIn newsletter to stay up to date on technical best practices for building resilient and scalable observability solutions and telemetry pipelines.