Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/administration/README.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@
* link:clusterlogforwarder.adoc[Log Collection and Forwarding]
* Enabling event collection by link:deploy-event-router.md[Deploying the Event Router]
* link:logfilemetricexporter.adoc[Collecting Container Log Metrics]
* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin
* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin
* Configuring for link:large-volume.adoc[high volume log loss]
327 changes: 327 additions & 0 deletions docs/administration/high-volume-log-loss.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
= High volume log loss
:doctype: article
:toc: left
:stem:

This guide explains how high log volumes in OpenShift clusters can cause log loss,
and how to configure your cluster to minimize this risk.

CAUTION: It is theoretically impossible to prevent log loss under all conditions.
You can configure log storage to avoid loss under expected average and peak loads,
and to minimize it under unexpected conditions.

WARNING: #If your data requires guaranteed delivery *_do not send it as logs_*# +
Logs were never intended to provide guaranteed delivery or long-term storage.
Rotating disk files without any form of flow-control are _inherently_ unreliable.
Guaranteed delivery requires modifying your application to use a reliable messaging
protocol _end-to-end_, for example Kafka, AMQP, or MQTT.

== Overview

=== Log loss

Container logs are written to `/var/log/pods`.
The forwarder reads and forwards logs as quickly as possible.
There are always some _unread logs_, written but not yet read by the forwarder.

_Kubelet_ rotates log files and deletes old files periodically to enforce per-container limits.
Kubelet and the forwarder act independently.
There is no coordination or flow-control that can ensure logs get forwarded before they are deleted.

_Log Loss_ occurs when _unread logs_ are deleted by Kubelet _before_ being read by the forwarder.
footnote:[It is also possible to lose logs _after_ forwarding, we won't discuss that here.]
Lost logs are gone from the file-system, and have not been forwarded.

=== Log rotation

Kubelet rotation parameters are:
[horizontal]
containerLogMaxSize:: Max size of a single log file (default 10MiB)
containerLogMaxFiles:: Max number of log files per container (default 5)

A container writes to one active log file.
When the active file reaches `containerLogMaxSize` the log files are rotated:

. the old active file becomes the most recent archive
. a new active file is created
. if there are more than `containerLogMaxFiles` files, the oldest is deleted.

=== Modes of operation

[horizontal]
writeRate:: long-term average logs per second per container written to `/var/log`
sendRate:: long-term average logs per second per container forwarded to the store

During _normal operation_ sendRate keeps up with writeRate (on average).
The number of unread logs is small, and does not grow over time.

Logging is _overloaded_ when writeRate exceeds sendRate (on average) for some period of time.
This could be due to faster log writing and/or slower sending.
During overload, unread logs accumulate.
If the overload lasts long enough, log rotation may delete unread logs causing log loss.

After an overload, logging needs time to _recover_ and process the excess of unread logs.
Until the backlog clears, the system is more vulnerable to log loss if there is another overload.

== Metrics for logging

Relevant metrics include:
[horizontal]
vector_*:: The `vector` process deployed by the log forwarder generates metrics for log collection, buffering and forwarding.
log_logged_bytes_total:: The `LogFileMetricExporter` measures disk writes _before_ logs are read by the forwarder.
To measure end-to-end log loss it is important to measure data that is _not_ yet read by the forwarder.
kube_*:: Metrics from the Kubernetes cluster.

=== Log File Metric Exporter

The metric `log_logged_bytes_total` is the number of bytes written to each file in `/var/log/pods` by a container.
This is independent of whether the forwarder reads or forwards the data.
To generate this metric, create a `LogFileMetricExporter`:

[,yaml]
----
apiVersion: logging.openshift.io/v1alpha1
kind: LogFileMetricExporter
metadata:
name: instance
namespace: openshift-logging
----

=== Log sizes on reading and sending

The log forwarder adds metadata to the logs that it sends to the remote store, indicating where the
log came from. The size of metadata varies depending on the type of log store.

This means we cannot assume the log line written to `/var/log` matches the size of data sent to the
store. Instead we will have to estimate the _number_ of log records passing in and out of the
forwarder, to compare what is written, read, and sent.

This forces some approximations, but the overall trends are still valid.


=== Measurements from metrics

The PromQL queries below are averaged over an hour of cluster operation, you may want to take longer samples for more stable results.

NOTE: These queries focus on container logs. Node and audit logs are also forwarded and included in total sending rates,
which may cause discrepancies when comparing write vs send rates.

==== Key metrics for capacity planning

.*TotalWriteRateBytes* (bytes/sec, all containers)
----
sum(rate(log_logged_bytes_total[1h]))
----

.*TotalSendRateEvents* (events/sec, all containers)
----
sum(rate(vector_component_sent_events_total{component_kind="sink",component_type!="prometheus_exporter"}[1h]))
----

.*LogSizeBytes* (bytes): Average size of a log record on /var/log disk
----
sum(increase(vector_component_received_bytes_total{component_type="kubernetes_logs"}[1h])) /
sum(increase(vector_component_received_events_total{component_type="kubernetes_logs"}[1h]))
----

.*MaxContainerWriteRateBytes* (bytes/sec per container): The max rate determines per-container log loss.
----
max(rate(log_logged_bytes_total[1h]))
----

== Recommendations

=== Estimate long-term average load

Estimate and/or measure your expected steady-state load, and anticipated peaks and
calculate the Kubelet rotation parameters and /var/log partition size accordingly.

Long-term average send rate *must* be higher than your expected long-term average write rate _including spikes_.

----
TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes
----

=== Handling outages

To handle an outage of length `MaxOutageTime` without loss, the per-container storage required is:

----
PerContainerSizeBytes = MaxOutageTime × MaxContainerWriteRateBytes
----

Configure kubelet so that:

----
PerContainerSizeBytes < containerLogMaxSize × containerLogMaxFiles
----

The minimum disk required to manage an outage is:

----
DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor
----

IMPORTANT: Rotation parameters use the _max_ per-container rate to avoid loss for noisy containers.
Not all containers will actually use this much disk space.
The total disk size requirement is based on the total rate over all containers.

You need to include spikes in the long-term average, so you will have capacity to _recover_ after a spike.
Recovery time to clear the backlog from a max outage:

----
RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes)
----

=== Capacity planning example

Consider a cluster with default kubelet settings:
- 5 files × 10MiB = 50MB per container

For a 3-minute outage with no forwarding, log loss occurs when 50MB is written in 3 minutes:

----
Container writeRate = 50MB / 3min ≈ 17MB/min
----

To handle this load for 1 hour without loss, increase kubelet limits:

----
containerLogMaxSize: 100MB
containerLogMaxFiles: 10
Total capacity: 1GB per container
----

The total disk size is
----
MaxOutageTime × TotalWriteRateBytes × SafetyFactor
----

Note the 1GB capacity above reflects the _noisiest_ containers that begin to lose logs first,
but the disk requirement is based on the overall total write rate.

=== Configure kubelet log limits

For environments that support KubeletConfig (OpenShift 4.6+):

[,yaml]
----
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: increase-log-limits
spec:
machineConfigPoolSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
kubeletConfig:
containerLogMaxSize: 50Mi
containerLogMaxFiles: 10
----

This configuration provides: `50MB × 10 files = 500MB` per container.

You can also configure `MachineConfig` resources directly for older versions of OpenShift that don't support `KubeletConfig`.

=== Apply and verify configuration

*To apply the KubeletConfig:*
[,bash]
----
# Apply the configuration
oc apply -f kubelet-log-limits.yaml

# Monitor the rollout (this will cause node reboots)
oc get kubeletconfig
oc get mcp -w
----

*To verify the configuration is active:*
[,bash]
----
# Check that all nodes are updated
oc get nodes

# Verify the kubelet configuration on a node
oc debug node/<node-name>
chroot /host
grep -E "(containerLogMaxSize|containerLogMaxFiles)" /etc/kubernetes/kubelet/kubelet.conf

# Check effective log limits for running containers
find /var/log -name "*.log" -exec ls -lah {} \; | head -20
----

The configuration rollout typically takes 10-20 minutes as nodes are updated in rolling fashion.

== Why not use large buffers in the forwarder?

An alternative approach is to use large buffers managed by the forwarder, rather than increasing kubelet limits.
This seems attractive at first, but has significant drawbacks.

=== Duplication of logs

The naive expectation is that a forwarder buffer increases the total log buffer space to `/var/log + forwarder-buffer`.
This is not the case.

When the forwarder reads a log record, it _remains in `/var/log`_ until it is deleted by rotation.
This means the forwarder buffer will usually be full of log data that is _still available in the log files_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a true statement but we need to consider vector's checkpoint system. Technically it is 'available' and it is correct that it would be both in the original location and the disk buffer, but it is effectively "gone" unless the checkout point is reset. We are unable to utilize n/ack since kubernetes_log source does not support it.


Thus the _effective_ space for logs is more like `max(/var/log, forwarder-buffer)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we modify our buffer settings, it is always smaller than /var/log because the default is 256Mb per output per collector. The difference is we have "rescued" the log from being rotated off before it can be collected. I think we need to be careful when saying "0" protection.

If the buffer is smaller than `/var/log` then it uses disk space, but provides no additional protection from log loss.
If it is bigger you get some extra buffer, at the expense of doubling the disk storage required for `/var/log`.

It is _much_ more efficient to expand the rotation limits for logs stored on `/var/log`.

=== Buffer design mismatch

Forwarder buffers are optimized for transmitting data efficiently, based on characteristics of the remote store.

- *Intended purpose:* Hold records that are ready-to-send or in-flight awaiting acknowledgement.
- *Typical timeframe:* Seconds to minutes of buffering for round-trip request/response times.
- *Not designed for:* Hours/days of log accumulation during extended outages

=== Supporting other logging tools

Expanding `/var/log` benefits _any_ logging tool, including
- `oc logs` for local debugging, or if there is a problem with the log collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rendering

- Standard Unix tools used when debugging by `rsh` to the node.

Expanding forwarder buffers only benefits the forwarder, and costs more in disk space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something to the effect because it is buffered in a component dependent format (i.e. compression, encoding)


If you deploy multiple forwarders, each additional forwarder will need its own buffer space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each output for each forwarder has its own buffer

If you expand `/var/log`, all forwarders share the same storage.

=== Risk of self-sabotage

Filling the `/var/log` partition is a near-fatal problem.
Forwarder buffers are stored in `/var/lib`, but in typical deployments they share the same disk partition.
Making forwarder buffers very large (large enough to handle global logging overload) creates additional risk of filling the partition.

Modifying kubelet parameters _only_ means there is a single configuration and calculation to do for limiting use of `/var/log`.

=== Additional considerations

*Multiple forwarders:* Deploying N forwarders multiplies all buffer-related problems by N, while expanding `/var/log` benefits all forwarders equally.

*Persistent volume buffers:* A PV is not a local disk, it is a network service. Using PVs for buffer storage still duplicates data and introduces new network dependencies that can cause new reliability and performance issues.

== Summary

1. *Monitor log patterns:* Use Prometheus metrics to measure log generation and send rates
2. *Calculate storage requirements:* Account for peak periods, recovery time, and overlapping spikes
3. *Increase Kubelet log rotation limits:* Allow greater storage for noisy containers.
4. *Plan for peak scenarios:* Size storage to handle expected peak patterns without loss

Note the Observe>Dashboard section of the OpenShift console includes some helpful log-related dashboards.

The optimal configuration balances disk space usage with your specific operational patterns and risk tolerance.

== Limitations

This analysis focuses primarily on container logs stored in `/var/log/pods`.
The following logs are not included in the write rate calculations but are included in send rate metrics:

* Node-level logs (journal, systemd, audit)
* API Audit logs

This can cause apparent discrepancies when comparing total write rates versus send rates. The fundamental principles and recommendations still apply, but you may need to account for this additional log volume in your capacity planning.