2025-07-21T21:00:00+01:0013 min

Log Retention: A Best Practices Guide

Learn best practices for effectively managing log retention and organization, including classifying logs by purpose, setting retention policies, and implementing access controls to prevent breaches and reduce noise in production logs.

Onum
Onum
2025-07-21T21:00:00+01:0013 min

Logs are essential to understanding what’s happening inside your systems, whether you're troubleshooting an outage, monitoring performance or meeting audit requirements. However, without the right retention strategy, they can become hard to manage and expensive.

This article shares best practices for managing log retention. These recommendations apply to production and staging environments where logs are critical for operational visibility and security analysis. By the time you reach the end, you’ll have a checklist of retention tactics you can utilize right away, a feel for how a dedicated platform can make the whole job easier, and a clear picture of how to dodge the usual pitfalls that appear when log files start to snowball.

Summary of key log retention best practices

Best Practice

Description

Classify logs by purpose

Don’t treat all logs the same. Tag them based on environment and whether they’re for auditing, debugging, or metrics.

Set retention policies by log type

Keep logs for as long as they’re useful. It’s best to align with compliance requirements if the organization you operate has stipulated to do so. As a rule of thumb, audit logs need to be saved for months or years, while debug logs usually don’t need as long.

Use lifecycle management

Automate cleanup: Set policies that move, archive or delete logs over time, so you don’t have to.

Compress and archive long-term logs

Compress infrequently accessed logs with an efficient algorithm—Gzip or Zstd work well—and let policy-based lifecycle rules move those archives to a cold-tier object store.

Implement access controls

Limit who can view or delete logs to prevent mistakes or breaches.

Review settings regularly

Schedule time every few months to make sure your retention policies still make sense. Align with compliance and legal requirements.

Alert on volume surges

Keep a watch on your logs by setting up alerts that shout when something’s wrong. A sudden flood of log messages usually means one of three things:

  • A hidden bug that just woke up
  • An attack pounding your app for weaknesses
  • Chatty code spewing more detail than anyone needs

Reduce noise in production logs

Before you deploy, turn the log level down to info or warn and gate any deep-dive debug output behind a feature flag you can temporarily flip on while troubleshooting. That way, you keep production observability crisp, costs under control, and your future self grateful.

Standardize time across logs

Sync every server’s clock (via NTP or Chrony) and log in a single zone like UTC so that timestamps align across services; without this, events appear out of order, distributed traces break, and investigations drag on.

Encrypt logs in transit and at rest

To safeguard the sensitive data often buried in application logs, apply encryption end-to-end: Enforce TLS on every network hop to prevent interception in transit. Enable disk or object-level encryption (for example, AES-256 managed by your cloud provider’s KMS) as soon as the logs are written to storage. This approach mitigates against man-in- the-middle attacks and positions you to meet regulatory obligations.

Hybrid edge-central logging

Pull filtered and analyzed logs from every edge agent into one unified repository. Manage your entire logging footprint without hopping between silos.

Classify logs by purpose

Effective log management begins with recognizing that logs aren’t one-size-fits-all. It can be easy to start hoarding everything “just in case,” which can result in spiraling costs and isn’t the most effective way to approach observability.  Instead, classify and organize your data by purpose—such as audit logs, application logs, debug traces, system metrics, and so forth—to help you apply different retention policies. 

Start with a brief planning session to establish your categories and naming conventions. With that foundation in place, work with stakeholders in legal, security, and compliance as well as business system owners to identify potential gaps and ensure that you are collecting logs that are useful to your organization.

Tagging logs by purpose has a number of benefits. One is improved filtering: You can perform searches based on specific tags, which makes pinpointing specific issues much more efficient. Another is automated monitoring, where monitoring systems generate alerts based on specific tags.

Set retention policies by log type

Once you’ve classified your logs by type, the next step is to decide exactly how long each category should live.

Audit logs often capture critical security events, user access trails, and configuration changes that regulators and auditors may request at any time. Because of this, it’s not uncommon to see audit logs retained for a year or even longer, depending on your organization's regulatory and compliance mandates. 

In many environments, it makes sense to hold onto debug logs for several days or enough time to diagnose problems without letting them pile up. Each organization operates differently, so be sure to tailor your retention strategy with efficiency in mind.

Crafting retention schedules for each system requires careful planning and engagement from the relevant stakeholders. Without the right tools, enforcing these policies across hundreds or thousands of systems can be very challenging and time-consuming. Retention management tools provide the capability to streamline and automate the process.  Automating your policies in a central tool can help you deploy retention policies much more easily and quickly, such as “delete all debug logs after 48 hours” or “archive audit logs monthly and purge anything older than 18 months.” It also eliminates manual clean-up tasks and guards against human error. Your compliance posture stays consistent, and you won’t waste time running scripts or fielding frantic last-minute requests from auditors.

The longer you hang onto low-value log data, the more you pay for disk space, I/O operations, and backups. The table below provides a high-level example of how to start defining retention policies based on log type.

Class

Common Sources

Log Type Examples 

Use Cases

Applicable Standard

Retention Period

Audit
Security

  • Identity and access management (IAM)
  • Database audit logs
  • Firewalls and IDS/IPS products

User logins, role changes, sensitive data access and blocked intrusion attempts

  • Security monitoring
  • Compliance auditing
  • SOC 2 (security, privacy, confidentiality)
  • HIPAA (security rule, audit controls)
  • PCI DSS (Requirement 10)
  • SOC 2: 1 year
  • HIPAA: 6 years
  • PCI DSS: At least 1 year

Application

  • API Gateway 
  • Business logic
  • Error logs

Request and response logs, transaction processing (e.g., payments, orders), and unhandled exceptions

  • Troubleshooting
  • Performance analysis
  • Debugging

  • SOC 2 (processing integrity, availability)
  • GDPR (understanding data processing)
  • HIPAA (PHI handling)
  • SOC 2: 1 year
  • GDPR: Only as long as necessary
  • HIPAA: 6 years if logs contain PHI

 

Infrastructure

  • Kubernetes 
  • Cloud Provider (AWS CloudTrail, Azure Monitor)

Pod crashes, node failures, resource provisioning and event logs

  • System health
  • Performance monitoring
  • Capacity planning

  • SOC 2 (availability, security)
  •  GDPR (system resilience)
  • FedRAMP (system and services acquisition) 
  • SOC 2: 1 year
  • GDPR: Only as long as necessary
  • HIPAA: 6 years if logs 

Use lifecycle management

Before we dive into lifecycle management, let’s take a moment to understand the different types of data storage tiers: hot, warm, and cold. These tiers reflect how frequently data is accessed and how quickly it needs to be retrieved:

  • Hot storage is optimized for real-time access to frequently used data, typically with higher performance but higher cost. 

  • Warm storage serves data that is accessed less often but still needs to be readily available, offering a balance between cost and performance. 

  • Cold storage is used for long-term retention of infrequently accessed data; it’s slower and more cost-efficient. 

Understanding these tiers is essential because lifecycle management revolves around automatically transitioning data between them to optimize storage costs and performance over time.

Scheduling calendar reminders or recurring tasks in Jira for engineers to remember to purge old logs isn’t the best use of their time. Built-in lifecycle management tools in your cloud platform make it trivial to define rules. For example, AWS S3 can transition objects to cheaper storage tiers and then expire them automatically, as illustrated in the code snippet below, where logs transition to standard infrequent access after 30 days then to Glacier after 90 days. Once configured, these rules run in the background without requiring human intervention. 

{
  "Rules": [
    {
      "ID": "ArchiveLogsToGlacier",
      "Filter": { "Prefix": "logs/" },
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

Example AWS S3 lifecycle policy for archiving logs to Standard-IA and Glacier

This “set-it-and-forget-it” approach saves precious engineering hours that would otherwise be spent writing one-off cleanup scripts. However, built-in lifecycle management features in logging platforms don’t always cater to specific use cases. Modern logging vendors, like Onum, can take automation further with so-called “smart expiry” policies. Rather than applying a one-size-fits-all rule, you can tag logs according to their business value or context, perhaps keeping audit and security events online for 90 days and then moving to the archive tier for a year. By leveraging Onum’s automated lifecycle tool, you implement the lifecycle schedule shown in this table.

Log Type

Online Retention

Archive Tier Retention

Audit and security events

90 days

1 year

Application logs

60 days

6 months

Network and firewall logs

30 days

1 year

System configuration

180 days

2 years

Infrastructure access logs

90 days

1 year

Compress and archive long-term logs

By compressing and storing log data in cold storage, you’ll pay less compared to hot storage, and it’s much easier to alert, visualize, and audit logs when they’re all in a central location.

When designing your archiving strategy for long-term logs, the choice of storage format and tier has direct consequences on performance as well as costs. For example, storing logs in Zstandard format provides higher compression but requires more CPU bandwidth for read and write operations, so it is best suited for deep archived files. Storing logs in a Snappy format that has a 2-3x compression ratio doesn’t require as much CPU bandwidth and is ideal for logs that need to be accessed regularly. 

Manually handling the compression and migration workflow can be a burden. Platforms like Onum’s help automate every step. For example, with a simple retention rule, any log tagged “keep-longterm” is automatically compressed on your schedule and vaulted into your chosen archive bucket. This approach is beneficial if, in six months, someone asks: “Can you show me that user’s activity from last spring?” You can point right at your unified console.

Implement access controls

Logs can include sensitive data like tokens, IPs, and even customer information. Set up IAM policies, RBAC rules, or similar controls so that only the right people can access or manage logs. Also, don’t forget to restrict who can delete logs.

Real-world scenario: A junior engineer accidentally deletes logs before an incident review, a problem that could have been prevented by a simple permission policy.

Real-world fix with Onum: Give junior engineers the Viewer role in Onum: read-only access that lets them tail logs but blocks every “delete” action. Reserve the destructive permissions for an Admin. If a junior engineer tries to wipe history, the data stays put. One line of role policy prevents an incident review from turning into guesswork.

Review settings regularly

As your systems grow and new services are enabled, regulatory requirements evolve, and storage fees increase. In a short time, your log retention strategy can fall out of sync with reality.

Setting aside time every quarter to review your log policies can help address these issues sooner rather than later. Create a recurring task to review which logs you’re still collecting, how long you’re keeping them, and what you’re actually using. You’ll be surprised how often you uncover entire buckets of data that haven’t been touched in ages and can safely be retired.

When you can see rising trends—for example, a sudden spike in verbose debug output or a forgotten application still funneling events into your most expensive tier—you’ll catch inefficiencies before they spiral out of control. Proactive housekeeping keeps your logs lean and purposeful but also frees your team from scrambling to explain a big storage bill.

Alert on volume surges

Your log volume suddenly increasing dramatically is often your earliest red flag that something’s gone wrong. This could be a loop churning through requests, a service streaming debug statements by mistake, or a teammate accidentally increasing verbose logging in production. If left unchecked, the influx of data can obscure the real events you care about. 

Start by defining sensible thresholds (such as “fire an alert if X service’s log rate jumps 300% over five-minute average”) or lean on pattern-recognition rules that highlight anomalous behaviors (“too many WARNs,” “sudden drop in auth successes,” and so on). With these guardrails in place, you’ll get a heads-up at the first sign of trouble.

Using a platform like Onum, you can pair real-time anomaly detection with inline log can stop the overflow at its source. As soon as Onum spots an unusual spike, it can alert your team and prune out redundant entries before they leave your servers.

Reduce noise in production logs

Be intentional about what gets logged in production. Use INFO and ERROR levels wisely and only log what adds value. 

Also, be aware of the log levels in each environment. For example, in development, capture everything for troubleshooting; in production, use INFO for routine operations and ERROR for genuine faults, only logging details that drive business value or indicate system health.

Platforms like Onum make this easy by letting you define policies centrally: You can move logs into your staging cluster for end-to-end testing, then apply noise suppression and automatic downgrading rules to your production nodes so irrelevant log entries get discarded before they reach the log aggregation layer. When something goes wrong in prod, your team will only see the signals that matter, resulting in faster root-cause analysis. It's a good practice to review all log levels across environments to ensure that only relevant logs are shipped to aggregation to avoid overwhelming the Onum queue.

Standardize time across logs

Once you’re aggregating logs from multiple hosts or containers, even a few seconds’ worth of clock drift can throw your whole troubleshooting session off. For example, if Service A’s server clock is ten seconds fast and Service B’s is five seconds slow, your logs won’t line up. You’ll end up guessing which event triggered which; meanwhile, valuable context slips through the cracks.

Standardize on a timestamp format, e.g., ISO 8601 with full date, time, and time-zone information. That way, every log entry carries its own universal “when,” regardless of where it originated. Second, automate time synchronization on every node. Point your servers and containers at a reliable Network Time Protocol (NTP) service and let tools like chrony keep your clocks in lockstep.

# /etc/chrony/chrony.conf
# Use two reliable public NTP servers (or your internal pool)
pool 0.pool.ntp.org iburst
pool 1.pool.ntp.org iburst

# Allow chronyd to adjust the system clock
driftfile /var/lib/chrony/chrony.drift
rtcsync              # Sync hardware clock after adjustments
makestep 1.0 3       # If offset >1s, step the clock; limit this to first 3 updates

# Enable logging of tracking stats
log tracking measurements statistics

# Restart and enable on boot
# systemd commands (run as root):
# systemctl enable chronyd
# systemctl start chronyd

Example Chrony Configuration for Reliable NTP Time Synchronization

In a microservices landscape, this discipline pays dividends. When you spin up hundreds of containers, every log from incoming requests to database writes becomes a smooth, timestamp-ordered narrative. You can trace requests end to end, correlate logs with distributed traces, and confidently pull precise time windows for alerts or analytics.

Encrypt logs in transit and at rest

Logs are a goldmine of operational insight, but they’re also a treasure trove of sensitive data: user identifiers, API keys, internal IPs, even PII that you’d rather keep under lock and key. Skipping encryption is a recipe for disaster. A misconfigured pipeline or a simple man-in-the-middle attack could expose credentials and private data.

The first line of defense is encryption in transit. Almost every cloud provider offers server-side encryption by default, which you should turn on immediately. If you need extra control, bring your own keys (BYOK) and configure key rotation policies so that old keys get retired and new ones replace them automatically. These keys can be securely stored in Azure Key Vault or AWS Secrets Manager. 

Platforms like Onum streamline all of this by enforcing encryption rules end-to-end. You define a policy like “TLS only for log transport” or “AES-256 at rest with customer-managed keys,” and Onum automatically configures your pipelines and applies the correct KMS/CMEK settings. It also audits your environments for any drifts. That way, you keep your logs both accessible and airtight without adding yet another manual checklist to your sprint backlog.

Hybrid edge-central logging

In an age when microservices are the de facto architecture in software development, they offer many benefits, such as the ability for teams to use the most efficient language that performs a specific task and allows different teams to manage individual services independently. With this design, collating logs between microservices can be cumbersome unless you have the right tools. Logs scattered across ten services? Seems fine, at least until you find yourself troubleshooting at 2 a.m. and have no idea where to look. 

With Hybrid Edge-Central Logging in the Onum platform, each edge agent filters, enriches, and tags events in real time and then channels those cleaned-up logs into a single search. One query instantly reveals the full journey of any request: Source metadata, cross-service correlations, and enriched context all appear in two seconds flat. No more frantic SSH hopping at midnight or trying to understand which microservice did what.

Last thoughts

Effective log retention isn’t about wrestling with complexity; it’s about being deliberate. Start by cataloging the logs in your environment, so you know exactly what you’re collecting. Then map each type to a clear retention policy, whether days, months, or years, and use automation to archive or delete data on schedule. Don’t forget to carve out time every quarter to review your settings, surface forgotten buckets, and recalibrate as your systems and compliance landscape evolve. Follow these simple steps, and you’ll not only rein in runaway storage costs but also transform your logs from a tangle of cryptic files into a reliable, searchable source of truth.

To conclude: Plan with purpose, automate ruthlessly, and iterate regularly, and you’ll be the team that truly understands its logs, stays audit-ready, and keeps the budget under control.

Want the latest from Onum?

  • Subscribe to our LinkedIn newsletter to stay up to date on technical best practices for building resilient and scalable observability solutions and telemetry pipelines.

Post content