Skip to content

clarifications on intermediate result caching and load times #728

@nwoolmer

Description

@nwoolmer

Intermediate result caching

There are some impressive hot run results from SiriusDB, who have done some cool engineering to integrate GPU accelerated queries with DuckDB. Kudos to them, and it will be interesting to see how the commoditisation of GPU compute may (or may not) increase the viability of GPU acclerated databases.

However, it does raise a question about a grey area that still exists, which (this time) impacts hot runs too.

https://github.com/sirius-db/sirius/blob/75ad0a274150deb338c77760999993b20a0bcff7/docs/README.md?plain=1#L224-L235

The GPU caching region is a memory region where the raw data is stored in GPUs, whereas the GPU processing region is where intermediate results are stored in GPUs (hash tables, join results .etc).
For example, to set the caching region as 1 GB and the processing region as 2 GB, we can run the following command:

Quite sensibly, the raw data is cached on the GPU, but also intermediate results. The ClickBench guidelines state the following:

General rules regarding caching:

    Query result caches should be disabled.
    Caching source data (e.g. buffer pools) is fine.
    Caches for intermediate data (e.g. hash tables) are generally okay, however if such caches are located near the end of the query execution pipeline, the effects are similar to query result caching and such caches should thus be disabled.

Which doesn't explicitly rule out operator caching, but does rule it out in the event that it becomes effectively a results cache.

In the benchmark scripts, 80 GB is allocated to raw data, and 40 GB to intermediate results:

GPU_CACHING_SIZE='80 GB'
GPU_PROCESSING_SIZE='40 GB'
CPU_PROCESSING_SIZE="100 GB"

There is also the 100 GB CPU cache which has similar naming, and may cache intermediate results too?

In another case, there were previous question marks about the naming of another database's execution plan cache, which does not cache results or intermediate data: #662

Other databases could certainly improve their numbers by caching partially executed operator data, and yet it cannot be predicted which operators would be acceptable to cache, and which would not.

I am curious what kind of partially-executed query data is cached and how this impacts the results, and whether this can be accepted for other databases.

Load times

I am finding personally that the load time metric has become less useful. There is prior art on this issue: #678

For Cloud systems, this is a difficult metric to compare against, especially since many offerings will substitute fsync durability for replication.

Additionally, the dataset is static, and not representive of use cases where ingestion is critical i.e. real-time analytics.

I am wondering if there is more useful profiling data that could be presented instead, for example, CPU impact, memory usage etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions