Caching

Crunchy Data Warehouse allows you to query files stored in object storage directly, but local caching is a critical feature that enhances query performance.

When files are cached on the database server, your queries will access them locally instead of downloading them from object storage repeatedly. It's crucial to provision your warehouse cluster with sufficient local storage to meet the caching needs of your use case (see plans and pricing for details).

How caching works

When you first query a file, our caching layer will start moving data to your instance to optimize performance for subsequent queries on that file. Once the files you're working with are cached, your query performance will improve significantly.

When reading Iceberg or lake analytics tables, Crunchy Data Warehouse will use range requests to minimize the amount of I/O during the query. In the background, it will also start downloading the entire file into block storage, which is a local NVMe SSD drive. Once the file is downloaded, new requests will use the local copy of the file, which significantly speeds up queries. When the cache drive is full, the least recently accessed files will be removed to make space.

Crunchy Data Warehouse has what is known as a “write-through” cache, which means that files written to the backing storage are immediately written to cache as well. This is especially advantageous for Iceberg, because it means we can immediately read back everything we wrote from the cache, often with millisecond latency. At the same time, the data is durably stored in object storage.

When the server restarts or fails over, the cache has to be repopulated. Hence, immediately after a restart queries are likely to be slower.

Automatic cache management

After you access a foreign table for the first time, Crunchy Data Warehouse automatically begins to cache the files. The system uses a simple method to manage the cache:

Fetch and store recently accessed files
Removes older files from the cache to free up space when necessary

This process is automatic and works in the background, allowing you to enjoy faster query execution without manual intervention.

Manual cache management

If you want more control over which files are cached, Crunchy Data Warehouse provides functions that let you manage the cache manually. You can use these functions to specify which files to cache or remove as needed.

Function	Description	Arguments
crunchy_file_cache.add (PATH text, REFRESH boolean [default false])	Adds a single file to cache	PATH: path (url) to file to be added, REFRESH (optional): forces to redownload file into the cache even if it exists
crunchy_file_cache.remove(PATH text)	Removes a single file from cache	PATH: path (url) to file to be removed
crunchy_file_cache.list()	Lists all files in cache

Example caching calls:

--Manually start downloading a file into the cache, rewrite the file if it already exists
SELECT crunchy_file_cache.add('s3://your_bucket_name/file_to_be_cached.xx', true);

--Manually remove file from cache
SELECT crunchy_file_cache.remove('s3://your_bucket_name/file_to_be_removed_from_cached.xx');

--List all files in cache
SELECT crunchy_file_cache.list();