Log Analysis: A Best Practices Guide

In every IT environment, logs record individual events across multiple resources. Every login attempt, file access, network connection, and process is captured in logs. Logs answer questions about who performed what actions when events occurred, and which systems were involved.

Production environments often span hundreds of microservices and interconnected infrastructure components. These systems generate thousands of log events per minute. Managing data at this scale presents three main technical challenges: Volume, Velocity, and Variety.

Volume continues to grow as teams deploy new services, containers, IoT devices, and infrastructure components. Velocity matters because attacks can be completed in minutes, but traditional log systems take significantly longer to process events. Variety adds complexity since different systems output logs in various formats. Some use structured JSON, others dump plain text, and each format needs its own parsing approach before you can analyze anything.

These challenges require a different approach to log analytics. In this guide, we will cover best practices for implementing real-time log processing, with practical examples and insights from specialized platforms.

Summary of best practices for log analysis

Best practice	Description
Perform in-transit data processing	Filter, enrich, and route log data while it moves through your network instead of after storage. This approach cuts detection times from minutes to milliseconds while reducing storage costs through filtering and content-based routing.
Implement intelligent data reduction and filtering	Use smart sampling to keep only valuable log data, mask sensitive information for compliance, and filter out noise that wastes storage space. Focus on preserving security-relevant events while discarding routine operations.
Optimize routing and storage	Send different log types to appropriate destinations and storage tiers based on how often you need to access them. Use efficient formats like Parquet for long-term storage and set up automatic policies that move older logs to cheaper storage options.
Emphasize infrastructure resiliency and scalability	Run multiple log collectors with automatic failover, use circuit breakers to prevent system failures from spreading, and set up load balancing that scales with your data volume without manual intervention.
Set up real-time alerting and detection pipelines	Connect events from different systems to spot attack patterns, use dynamic thresholds that adapt to normal behavior, and route alerts to the right teams through multiple channels to ensure fast response times.

Limitations of traditional log analytics

Most log analytics systems work by storing all data initially and then analyzing it later. Your log collectors send raw data to central storage, where it sits in queues waiting for batch processing engines to analyze it. This approach creates delays because each log entry has to be written to disk before any analysis can occur. The constant disk writes create bottlenecks that slow down data ingestion and drive up infrastructure costs.

Store-then-process architecture constraints

Network and storage bottlenecks multiply as log volumes grow. During peak periods, high-traffic systems generate thousands of events per second, which flood the network connections between collectors and storage. Storage systems struggle to write all this incoming data, especially when they're also trying to run analytical queries simultaneously.

SIEM platforms frequently lag during busy periods because storage can't process data fast enough. This causes event accumulation in queues, creating longer gaps between when incidents occur and when you notice them.

fig. Traditional store-then-process architecture

fig. In-transit processing architecture

Processing latency and its impact on detection capabilities

Processing delays hinder your ability to identify threats promptly. For example, privilege escalation attacks can happen within minutes, and lateral movement attacks often complete in under an hour. If your log system takes several minutes to process events, attackers have plenty of time to complete their work before you notice them.

Batch processing worsens this issue because systems remain idle, waiting to gather enough data before starting any analysis. During off-hours when fewer people are monitoring, these delays can stretch detection times to several hours.

Cost implications of full-fidelity storage approaches

Storing every single log gets expensive as data volumes grow into petabytes. Teams incur storage costs for verbose logs that don't contribute to security analysis, such as successful login records from normal users or routine system status checks. These low-value logs make up 40-50% of what most teams store.

Many teams shorten how long they keep logs to save costs. This often means deleting data that might be useful later for investigations. This results in data gaps when old logs are needed to trace the progression of an attack or identify long-running compromises.

Best practices for log analytics implementation

The following practices help you handle large volumes of log data without breaking your budget or missing security threats. Instead of storing everything first and analyzing later, you can process data as it moves through your network.

Perform in-transit data processing

Real-time filtering and field reduction at collection points

Filter out unnecessary data when logs arrive before they consume network bandwidth and storage space. Set up filtering rules to drop verbose events, such as successful DNS lookups or routine system health checks.

You can also remove bloated fields that waste space without adding value. Strip out verbose debug information, overly detailed timestamps, or lengthy HTTP headers during log collection. This simple step cuts log sizes significantly while preserving security-relevant data.

Context enrichment through lookups during transmission

Add useful context to your logs as they move through your network before they reach your central analytics systems. Set up lookup tables at your collection points with data like IP reputation lists, user role information, and which systems are most important to protect.

When authentication logs are recorded, add details such as the user information, access level, and login patterns from your identity systems. For network logs, add geolocation data and check IPs against threat feeds to spot connections to known bad actors right away

Here is a Logstash configuration that checks client IPs against threat intelligence feeds and adds location information. :

filter {
  # Only process logs that have a client_ip field
  if [client_ip] {
    # Look up IP reputation from threat intelligence file
    translate {
      # Source field to check
      field => "client_ip"
      # New field to store result
      destination => "threat_status"
      # Path to threat intel file
      dictionary_path => "/etc/logstash/ip_reputation.yml"
      # Default if IP not found
      fallback => "unknown"
    }
    
    # Add geographic location data for the IP
    geoip {
      source => "client_ip"   
      # Store location data here                               
      target => "geoip"             
      # Add country code for risk assessment                         
      add_field => { "country_risk" => "%{[geoip][country_code2]}" } 
    }
  }
}

This configuration takes basic HTTP logs and adds threat status plus geographic details before sending them to your security tools.

Onum handles this processing automatically during data transmission. Instead of storing raw logs first, their system filters, enriches, and routes data while it flows through your network. This approach sends only relevant, enriched logs to your SIEM while keeping complete copies in cheaper storage for compliance.

Implement intelligent data reduction and filtering

Data sampling methodologies for high-volume streams

When applications generate massive log volumes, you don't need to keep every event. Use sampling to reduce storage requirements while maintaining security coverage. Keep all errors and authentication events, but sample only a percentage of successful operations.

For high-traffic applications, you might keep only 5% of successful requests but preserve 100% of errors and authentication attempts. You can also use techniques like keeping the first and last events in bursts of activity, which maintains context about peak activity periods.

Here is a Fluent Bit configuration that keeps all HTTP errors and authentication requests but samples only 1% of successful requests:

[FILTER]
    # Use Lua scripting filter
    Name    lua
    # Apply to all logs with "api" prefix
    Match   api.*
    # Path to the Lua script file
    script  sampling.lua
    # Function name to execute
    call    sample_logs

# sampling.lua - Smart sampling script for API logs
function sample_logs(tag, timestamp, record)
    # Check if this is an error (4xx/5xx) or authentication request
    if record.status >= 400 or string.match(record.path, "/auth") then
        # Keep all errors and auth events (100%)
        return 1, timestamp, record
    # For successful requests, only keep 1% randomly
    elseif math.random(100) <= 1 then
        # Keep this successful request (1% sample)
        return 1, timestamp, record
    else
        # Drop this event (discard 99% of success)
        return -1, 0, 0
    end
end

Compliance-focused data masking

Protect sensitive data in your logs while maintaining their usefulness for security analysis. Instead of storing credit card numbers, social security numbers, email addresses, and other sensitive information in plain text, mask or hash this information.

Replace sensitive fields with consistent hashes that preserve correlation patterns without exposing data.

For credit card numbers, replace 1234-3567-5678-9012 with hashes like CARD:a1b2c3d4.
Email addresses work similarly, converting user@example.com into a trackable identifier.
For database fields containing sensitive information, use tokenization instead of storing the real values.

The below Logstash implementation finds credit card numbers and email addresses in log messages and replaces them with consistent hashes.

filter {
  # Find and replace credit card numbers with consistent hashes
  mutate {
    gsub => [
      # Search in the message field for credit card patterns
      "message", "(?i)\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b", 
      # Replace with hash using MD5 of the original number
      "[CARD:%{[message].digest('md5')[0,8]}]"
    ]
  }
  
  # Process email addresses if present
  if [user_email] {
    # Create a consistent hash of the email address
    fingerprint {
      # Source field containing the email
      source => "user_email"
      # New field to store the hash
      target => "user_hash"
      # Use SHA256 for stronger hashing
      method => "SHA256"
      # Secret key for consistent hashing (change this!)
      key => "your-secret-key"
    }
    # Remove the original email field for privacy
    mutate { remove_field => ["user_email"] }
  }
}

Onum automates much of this filtering work. Their system uses machine learning to figure out which logs contain real security signals and which are just noise, cutting storage costs.

Optimize routing and storage

Storage tier optimization strategies

Different logs have different access requirements, so match your storage choices accordingly. Keep recent security events in fast IOPS storage where your SIEM can query them immediately. This typically covers 30-90 days of high-priority events.

For logs you access weekly or monthly, like system and debug logs, use standard storage options that balance cost and retrieval time. Logs for compliance and archival reasons can be stored in cold storage tiers that offer cost savings.

Here is an example S3 policy that shows how this tiering works. This policy keeps security logs in standard storage for 30 days, then moves them to cheaper storage after 90 days, and finally archives them for long-term compliance after about 7 years:

{
  "Rules": [{
    "ID": "SecurityLogLifecycle",
    "Status": "Enabled",
    "Filter": {"Prefix": "security-logs/"},
    "Transitions": [
      {
        "Days": 30,
        "StorageClass": "STANDARD_IA"
      },
      {
        "Days": 90, 
        "StorageClass": "GLACIER"
      },
      {
        "Days": 2555,
        "StorageClass": "DEEP_ARCHIVE"
      }
    ]
  }]
}

Format transformation

Converting logs to compressed formats can save significant storage space for long-term archival. Convert JSON logs to columnar formats like Parquet. For compliance-only logs, use compression algorithms like LZ4 or Snappy that compress well but decompress quickly when needed.

This Logstash configuration sends high-priority events to your SIEM in JSON format, and the rest of the events are compressed Parquet for long-term storage:

output {
  # Send high-priority events to SIEM for immediate analysis
  if [event_priority] == "high" {
    elasticsearch {
      # SIEM cluster endpoints
      hosts => ["siem-cluster:9200"]
      # Daily indices for better performance and management
      index => "security-events-%{+YYYY.MM.dd}"
    }
  }
  
  # Send all events to S3 for long-term storage and compliance
  s3 {
    # AWS region for the S3 bucket
    region => "us-east-1"
    # Destination bucket for log archives
    bucket => "long-term-logs"
    # Use Parquet format for efficient storage and analytics
    codec => "parquet"
    # Compression level (1-9, higher = more compression)
    compression_level => 6
    # Create new file every 15 minutes
    time_file => 15
  }
}

Onum's routing capabilities intelligently direct data to appropriate destinations. Their system looks at what's in your logs and decides where each piece of data should go. Important security events get sent to your SIEM, while complete copies go to cheaper storage for compliance.

Emphasize infrastructure resiliency and scalability

Active-active collector deployments

Don't rely on just one log collector and risk losing data when it fails. Run multiple collectors simultaneously so that if one goes down, the others continue working without missing any logs.

Set up your collectors in pairs or small groups where each one can handle all the log traffic from your most important systems. Use a load balancer to distribute the work between them and check if they're healthy. When one collector stops responding, the load balancer automatically redirects all data to the working collectors.

Here is an HAProxy setup that sets up three log collectors with health checks every 5 seconds. If any collector fails its health check 3 times in a row, HAProxy stops sending traffic to it:

backend log_collectors
    # Distribute requests evenly across all healthy servers
    balance roundrobin
    # Check server health - /health endpoint
    option httpchk GET /health
    
    # First log collector server
    server collector1 10.0.1.10:8080 check inter 5s fall 3 rise 2
    # Second log collector server with same health check settings
    server collector2 10.0.1.11:8080 check inter 5s fall 3 rise 2
    # Third log collector server with same health check settings
    server collector3 10.0.1.12:8080 check inter 5s fall 3 rise 2

frontend log_ingress
    # Listen on port 80 for all network interfaces
    bind *:80
    # Route traffic to the log_collectors backend
    default_backend log_collectors
    # Create session table to track client IPs (100k entries, 30min expiry)
    stick-table type ip size 100k expire 30m
    # Ensure same client IP always goes to same backend server
    stick on src

The configuration also keeps track of which collector each IP address was using, so related log events stay together even during failovers.

Circuit breaker patterns

When one of your downstream systems crashes or gets overloaded, you don't want it to bring down your entire log processing pipeline. Circuit breakers work like electrical fuses. They automatically disconnect from failing systems to protect the rest of the network.

Set up circuit breakers on connections to your SIEM, threat intelligence feeds, and enrichment databases. These monitor how often requests fail and how long they take to complete. When a system encounters threshold issues, the circuit breaker switches between three different modes:

Closed (normal): Requests go through normally
Open (blocking): The downstream system is failing, so block all requests to protect it
Half-open (testing): Try a few requests to see if the system recovered

The following Resilience4j configuration sets up circuit breakers for SIEM connections and threat intelligence services. This setup opens the circuit when 50% of requests fail and waits 30 seconds before testing recovery. It also auto-retries connection after the wait period, gradually increasing the time between attempts if the service stays down.

resilience4j.circuitbreaker:
  instances:
    # Circuit breaker for threat intelligence service lookups
    threatIntelService:
      # Monitor last 100 requests to determine circuit state
      slidingWindowSize: 100
      # Open circuit if 50% of requests fail
      failureRateThreshold: 50
      # Wait 30 seconds before testing if service recovered and consider slow calls as failures if 50% exceed threshold
      waitDurationInOpenState: 30s
      slowCallRateThreshold: 50
      slowCallDurationThreshold: 2s

      
    # Circuit breaker for SIEM connector
    siemConnector:
      # Monitor last 50 requests
      slidingWindowSize: 50
      # Open circuit if 60% of requests fail, wait 60 seconds before testing recovery
      failureRateThreshold: 60
      waitDurationInOpenState: 60s
      # Transition from open to half-open state
      automaticTransitionFromOpenToHalfOpenEnabled: true

Onum's architecture handles hyper-scale environments through distributed processing and intelligent load distribution. It automatically scales up or down based on the incoming data volume, without requiring manual adjustments.

Onum’s distributor and workers flow

Set up real-time alerting and detection pipelines

Anomaly detection thresholding techniques

Instead of setting fixed alert thresholds that cause false alarms during busy periods, use dynamic thresholds that adjust based on standard patterns. This approach compares current activity against what is typically normal for that time of day, day of the week, or specific user.

For example, if a user logs in 2-3 times per day, getting 20 failed login attempts in 10 minutes is questionable. However, if your login page usually receives 1000 attempts per hour during business hours, you'd set different thresholds than during overnight hours when legitimate traffic is low.

Here is a Logstash filter that tracks failed login attempts from each IP address. When any IP hits 10 failed attempts, it immediately flags that IP as suspicious and marks it for blocking:

filter {
  # Process only failed authentication attempts on login/auth endpoints
  if [status] == 401 and [url] =~ /login|auth/ {
    
    # Track failed login attempts per IP address
    # Maintains an in-memory hash counter that persists across events
    ruby {
      init => "@login_attempts = {}"
      code => "
        # Extract client IP and increment failed login counter
        ip = event.get('client_ip')
        @login_attempts[ip] ||= 0
        @login_attempts[ip] += 1
        
        # Flag as brute force attack if threshold exceeded
        # Adds alert fields for downstream processing and response
        if @login_attempts[ip] >= 10
          event.set('alert_type', 'brute_force')
          event.set('severity', 'high')
          event.set('action_required', 'block_ip')
        end
      "
    }
  }
}

Notification workflow integration

Once threats are detected, you need to get alerts to the right people through the proper channels. Don't just send everything to everyone. This creates alert fatigue and slows down response times.

Set up multiple notification channels for critical alerts. Send high-priority threats through email, SMS, Slack, and your ticketing system so the message gets through even if one channel fails. Route different types of alerts to the right teams. Send database security issues to your DBA team, infrastructure problems to your ops team, and web application attacks to your security team.

For example, this Alertmanager configuration routes alerts based on severity. Attacks like SQL injection and brute force attempts go to security teams through multiple channels, and to reduce alert noise, lower-priority activity gets sent through single channels:

route:
  group_by: ['alertname', 'client_ip']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default-team'
  
  # Critical attacks get faster delivery while warnings use standard workflow
  routes:
  - match:
      severity: critical
      alert_type: sql_injection|brute_force
    receiver: 'security-critical'
    group_interval: 1m
  - match:
      severity: warning
      alert_type: suspicious_activity
    receiver: 'security-standard'

# Notification channels, critical receiver uses multiple channels to ensure delivery during incidents
receivers:
- name: 'security-critical'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/XXX'
    channel: '#security-alerts'
    title: 'CRITICAL: {{.GroupLabels.alertname}}'
  email_configs:
  - to: 'security-oncall@company.com'
    subject: 'CRITICAL: Web App Attack Detected'
  pagerduty_configs:
  - service_key: 'YOUR_PD_KEY'
    severity: 'critical'

Above configuration routes attacks like SQL injection and brute force attempts through multiple channels. Lower-priority activities use single-channel delivery to reduce alert noise.

Conclusion

The challenges of managing massive log volumes, processing speeds, and diverse data formats don't have to break your budget or leave you blind to threats. By filtering, enriching, and routing data as it moves through your network, you can reduce detection times to milliseconds.

Start with the basics that deliver quick wins. Set up intelligent filtering to discard noisy logs that add no security value. Add context enrichment to your alerts, providing the information analysts need to act quickly. Then, build resilience features like active-active collectors, circuit breakers, and intelligent routing that keep everything running when systems fail.

The goal is to process data during transmission rather than after storage. This change from store-then-process to process-in-transit is what separates teams that struggle with the scale from those that handle it efficiently.

Want the latest from Onum?

Subscribe to our LinkedIn newsletter to stay up to date on technical best practices for building resilient and scalable observability solutions and telemetry pipelines.