Skip to content

Cache entries can grow unbounded #112

@casperisfine

Description

@casperisfine

Initially reported in Shopify/shipit-engine#935 by @josacar

Context

Shipit is a deploy tool, as such it's frequently hitting the same endpoint to grab updates, e.g GET https://api.github.com/v3/repos/my-org/my-repo/commits?branch=master.

Additionally it uses the GitHub App api for authentication, so it's Authorization header changes every hour.

And finally GitHub's API responses specify the following Vary header: Accept, Authorization, Cookie, X-GitHub-OTP.

Problem

Since the URL never changes, the same cache entry is re-used all the time. However since the Authorization token change every hour, new entries are regularly prepended, until the cache entry becomes huge, and start taking a lot of time to be deserialized, or even go over the cache store size (memcached limit values to 1MB).

Solutions

I can see several solutions to this, with very different tradeoffs.

Very simple, set a cap on the number of entries per key

Self explanatory, we could simply set a configurable upper limit on the number of entries, and drop the oldest entry whenever we go over.

It's an easy fix, not quite ideal as it might surprise people, but IMHO it's better to be surprised by cache evictions than by cache OOM / corruption (in case of memcache truncation).

Somewhat simple, evict stale entries

When prepending a new entry, we could evict all the stale ones. It's easy if max-age / Expires is set, if it isn't, then we have to rely on heuristics as specify in the RFC

Harder, and potentially slower, store varying responses under different keys

Since Faraday::HttpCache implicitly relies on the underlying cache store for garbage collection, the idea here would be to store individual responses under different keys, so that rarely accessed ones can be garbage collected.

The very tricky part is that the Vary is in the response, so it means maintaining some kind of non-trivial index of what kind of Vary values were returned for that response, and then use that to select a good candidate for revalidation.

Additionally it means that it will at least double the number of cache lookups, which is not a small downside.

cc @rafaelfranca @Edouard-chin @josacar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions