diff --git a/docs/administration/README.adoc b/docs/administration/README.adoc index f41c5fdc2..d1390cec5 100644 --- a/docs/administration/README.adoc +++ b/docs/administration/README.adoc @@ -4,4 +4,5 @@ * link:clusterlogforwarder.adoc[Log Collection and Forwarding] * Enabling event collection by link:deploy-event-router.md[Deploying the Event Router] * link:logfilemetricexporter.adoc[Collecting Container Log Metrics] -* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin \ No newline at end of file +* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin +* Configuring for link:large-volume.adoc[high volume log loss] diff --git a/docs/administration/high-volume-log-loss.adoc b/docs/administration/high-volume-log-loss.adoc new file mode 100644 index 000000000..736b5caaf --- /dev/null +++ b/docs/administration/high-volume-log-loss.adoc @@ -0,0 +1,327 @@ += High volume log loss +:doctype: article +:toc: left +:stem: + +This guide explains how high log volumes in OpenShift clusters can cause log loss, +and how to configure your cluster to minimize this risk. + +CAUTION: It is theoretically impossible to prevent log loss under all conditions. +You can configure log storage to avoid loss under expected average and peak loads, +and to minimize it under unexpected conditions. + +WARNING: #If your data requires guaranteed delivery *_do not send it as logs_*# + +Logs were never intended to provide guaranteed delivery or long-term storage. +Rotating disk files without any form of flow-control are _inherently_ unreliable. +Guaranteed delivery requires modifying your application to use a reliable messaging +protocol _end-to-end_, for example Kafka, AMQP, or MQTT. + +== Overview + +=== Log loss + +Container logs are written to `/var/log/pods`. +The forwarder reads and forwards logs as quickly as possible. +There are always some _unread logs_, written but not yet read by the forwarder. + +_Kubelet_ rotates log files and deletes old files periodically to enforce per-container limits. +Kubelet and the forwarder act independently. +There is no coordination or flow-control that can ensure logs get forwarded before they are deleted. + +_Log Loss_ occurs when _unread logs_ are deleted by Kubelet _before_ being read by the forwarder. +footnote:[It is also possible to lose logs _after_ forwarding, we won't discuss that here.] +Lost logs are gone from the file-system, and have not been forwarded. + +=== Log rotation + +Kubelet rotation parameters are: +[horizontal] +containerLogMaxSize:: Max size of a single log file (default 10MiB) +containerLogMaxFiles:: Max number of log files per container (default 5) + +A container writes to one active log file. +When the active file reaches `containerLogMaxSize` the log files are rotated: + +. the old active file becomes the most recent archive +. a new active file is created +. if there are more than `containerLogMaxFiles` files, the oldest is deleted. + +=== Modes of operation + +[horizontal] +writeRate:: long-term average logs per second per container written to `/var/log` +sendRate:: long-term average logs per second per container forwarded to the store + +During _normal operation_ sendRate keeps up with writeRate (on average). +The number of unread logs is small, and does not grow over time. + +Logging is _overloaded_ when writeRate exceeds sendRate (on average) for some period of time. +This could be due to faster log writing and/or slower sending. +During overload, unread logs accumulate. +If the overload lasts long enough, log rotation may delete unread logs causing log loss. + +After an overload, logging needs time to _recover_ and process the excess of unread logs. +Until the backlog clears, the system is more vulnerable to log loss if there is another overload. + +== Metrics for logging + +Relevant metrics include: +[horizontal] +vector_*:: The `vector` process deployed by the log forwarder generates metrics for log collection, buffering and forwarding. +log_logged_bytes_total:: The `LogFileMetricExporter` measures disk writes _before_ logs are read by the forwarder. + To measure end-to-end log loss it is important to measure data that is _not_ yet read by the forwarder. +kube_*:: Metrics from the Kubernetes cluster. + +=== Log File Metric Exporter + +The metric `log_logged_bytes_total` is the number of bytes written to each file in `/var/log/pods` by a container. +This is independent of whether the forwarder reads or forwards the data. +To generate this metric, create a `LogFileMetricExporter`: + +[,yaml] +---- +apiVersion: logging.openshift.io/v1alpha1 +kind: LogFileMetricExporter +metadata: + name: instance + namespace: openshift-logging +---- + +=== Log sizes on reading and sending + +The log forwarder adds metadata to the logs that it sends to the remote store, indicating where the +log came from. The size of metadata varies depending on the type of log store. + +This means we cannot assume the log line written to `/var/log` matches the size of data sent to the +store. Instead we will have to estimate the _number_ of log records passing in and out of the +forwarder, to compare what is written, read, and sent. + +This forces some approximations, but the overall trends are still valid. + + +=== Measurements from metrics + +The PromQL queries below are averaged over an hour of cluster operation, you may want to take longer samples for more stable results. + +NOTE: These queries focus on container logs. Node and audit logs are also forwarded and included in total sending rates, +which may cause discrepancies when comparing write vs send rates. + +==== Key metrics for capacity planning + +.*TotalWriteRateBytes* (bytes/sec, all containers) +---- +sum(rate(log_logged_bytes_total[1h])) +---- + +.*TotalSendRateEvents* (events/sec, all containers) +---- +sum(rate(vector_component_sent_events_total{component_kind="sink",component_type!="prometheus_exporter"}[1h])) +---- + +.*LogSizeBytes* (bytes): Average size of a log record on /var/log disk +---- +sum(increase(vector_component_received_bytes_total{component_type="kubernetes_logs"}[1h])) / +sum(increase(vector_component_received_events_total{component_type="kubernetes_logs"}[1h])) +---- + +.*MaxContainerWriteRateBytes* (bytes/sec per container): The max rate determines per-container log loss. +---- +max(rate(log_logged_bytes_total[1h])) +---- + +== Recommendations + +=== Estimate long-term average load + +Estimate and/or measure your expected steady-state load, and anticipated peaks and +calculate the Kubelet rotation parameters and /var/log partition size accordingly. + +Long-term average send rate *must* be higher than your expected long-term average write rate _including spikes_. + +---- +TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes +---- + +=== Handling outages + +To handle an outage of length `MaxOutageTime` without loss, the per-container storage required is: + +---- +PerContainerSizeBytes = MaxOutageTime × MaxContainerWriteRateBytes +---- + +Configure kubelet so that: + +---- +PerContainerSizeBytes < containerLogMaxSize × containerLogMaxFiles +---- + +The minimum disk required to manage an outage is: + +---- +DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor +---- + +IMPORTANT: Rotation parameters use the _max_ per-container rate to avoid loss for noisy containers. +Not all containers will actually use this much disk space. +The total disk size requirement is based on the total rate over all containers. + +You need to include spikes in the long-term average, so you will have capacity to _recover_ after a spike. +Recovery time to clear the backlog from a max outage: + +---- +RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes) +---- + +=== Capacity planning example + +Consider a cluster with default kubelet settings: +- 5 files × 10MiB = 50MB per container + +For a 3-minute outage with no forwarding, log loss occurs when 50MB is written in 3 minutes: + +---- +Container writeRate = 50MB / 3min ≈ 17MB/min +---- + +To handle this load for 1 hour without loss, increase kubelet limits: + +---- +containerLogMaxSize: 100MB +containerLogMaxFiles: 10 +Total capacity: 1GB per container +---- + +The total disk size is +---- +MaxOutageTime × TotalWriteRateBytes × SafetyFactor +---- + +Note the 1GB capacity above reflects the _noisiest_ containers that begin to lose logs first, +but the disk requirement is based on the overall total write rate. + +=== Configure kubelet log limits + +For environments that support KubeletConfig (OpenShift 4.6+): + +[,yaml] +---- +apiVersion: machineconfiguration.openshift.io/v1 +kind: KubeletConfig +metadata: + name: increase-log-limits +spec: + machineConfigPoolSelector: + matchLabels: + machineconfiguration.openshift.io/role: worker + kubeletConfig: + containerLogMaxSize: 50Mi + containerLogMaxFiles: 10 +---- + +This configuration provides: `50MB × 10 files = 500MB` per container. + +You can also configure `MachineConfig` resources directly for older versions of OpenShift that don't support `KubeletConfig`. + +=== Apply and verify configuration + +*To apply the KubeletConfig:* +[,bash] +---- +# Apply the configuration +oc apply -f kubelet-log-limits.yaml + +# Monitor the rollout (this will cause node reboots) +oc get kubeletconfig +oc get mcp -w +---- + +*To verify the configuration is active:* +[,bash] +---- +# Check that all nodes are updated +oc get nodes + +# Verify the kubelet configuration on a node +oc debug node/ +chroot /host +grep -E "(containerLogMaxSize|containerLogMaxFiles)" /etc/kubernetes/kubelet/kubelet.conf + +# Check effective log limits for running containers +find /var/log -name "*.log" -exec ls -lah {} \; | head -20 +---- + +The configuration rollout typically takes 10-20 minutes as nodes are updated in rolling fashion. + +== Why not use large buffers in the forwarder? + +An alternative approach is to use large buffers managed by the forwarder, rather than increasing kubelet limits. +This seems attractive at first, but has significant drawbacks. + +=== Duplication of logs + +The naive expectation is that a forwarder buffer increases the total log buffer space to `/var/log + forwarder-buffer`. +This is not the case. + +When the forwarder reads a log record, it _remains in `/var/log`_ until it is deleted by rotation. +This means the forwarder buffer will usually be full of log data that is _still available in the log files_. + +Thus the _effective_ space for logs is more like `max(/var/log, forwarder-buffer)`. +If the buffer is smaller than `/var/log` then it uses disk space, but provides no additional protection from log loss. +If it is bigger you get some extra buffer, at the expense of doubling the disk storage required for `/var/log`. + +It is _much_ more efficient to expand the rotation limits for logs stored on `/var/log`. + +=== Buffer design mismatch + +Forwarder buffers are optimized for transmitting data efficiently, based on characteristics of the remote store. + +- *Intended purpose:* Hold records that are ready-to-send or in-flight awaiting acknowledgement. +- *Typical timeframe:* Seconds to minutes of buffering for round-trip request/response times. +- *Not designed for:* Hours/days of log accumulation during extended outages + +=== Supporting other logging tools + +Expanding `/var/log` benefits _any_ logging tool, including +- `oc logs` for local debugging, or if there is a problem with the log collection. +- Standard Unix tools used when debugging by `rsh` to the node. + +Expanding forwarder buffers only benefits the forwarder, and costs more in disk space. + +If you deploy multiple forwarders, each additional forwarder will need its own buffer space. +If you expand `/var/log`, all forwarders share the same storage. + +=== Risk of self-sabotage + +Filling the `/var/log` partition is a near-fatal problem. +Forwarder buffers are stored in `/var/lib`, but in typical deployments they share the same disk partition. +Making forwarder buffers very large (large enough to handle global logging overload) creates additional risk of filling the partition. + +Modifying kubelet parameters _only_ means there is a single configuration and calculation to do for limiting use of `/var/log`. + +=== Additional considerations + +*Multiple forwarders:* Deploying N forwarders multiplies all buffer-related problems by N, while expanding `/var/log` benefits all forwarders equally. + +*Persistent volume buffers:* A PV is not a local disk, it is a network service. Using PVs for buffer storage still duplicates data and introduces new network dependencies that can cause new reliability and performance issues. + +== Summary + +1. *Monitor log patterns:* Use Prometheus metrics to measure log generation and send rates +2. *Calculate storage requirements:* Account for peak periods, recovery time, and overlapping spikes +3. *Increase Kubelet log rotation limits:* Allow greater storage for noisy containers. +4. *Plan for peak scenarios:* Size storage to handle expected peak patterns without loss + +Note the Observe>Dashboard section of the OpenShift console includes some helpful log-related dashboards. + +The optimal configuration balances disk space usage with your specific operational patterns and risk tolerance. + +== Limitations + +This analysis focuses primarily on container logs stored in `/var/log/pods`. +The following logs are not included in the write rate calculations but are included in send rate metrics: + +* Node-level logs (journal, systemd, audit) +* API Audit logs + +This can cause apparent discrepancies when comparing total write rates versus send rates. The fundamental principles and recommendations still apply, but you may need to account for this additional log volume in your capacity planning.