Concepts
Crunchy Data Warehouse supports three general kinds of tables:
- Iceberg tables: Transactional, columnar tables stored in object storage, optimized for fast analytics
- Data lake analytics tables: Tables for querying external data, including Parquet, CSV, JSON, and geospatial file formats like Shapefiles
- Postgres tables: Regular "heap" tables for operational workloads, including as a staging table for small inserts and materialized views
Data Storage Formats
Iceberg
Apache Iceberg is an open table format for huge analytic datasets. It was designed to solve challenges in managing and querying large-scale data lakes. Iceberg provides a structured and efficient way to organize Parquet files by creating manifests and metadata of Parquet files.
Iceberg allows for schema updates without needing to rewrite entire datasets. This makes it easier to handle changes like adding or dropping columns. Crunchy Data Warehouse fully support writes to an Iceberg table. You can create, manage, query, and update Iceberg tables that are cheaply and durably stored in S3.
Iceberg uses metadata (manifest files and snapshots) to reduce the number of files scanned during queries, improving query speed and efficiency.
Parquet
Parquet is a columnar storage file format optimized for large-scale data processing. It is notably efficient at handling complex data structures and large volumes of data. Data is stored column by column rather than row by row, which significantly improves the performance of analytical queries that access only a subset of columns.
Parquet is a columnar file format with built-in compression, optimized for analytical queries. Storing Parquet files in an open table format provides a crucial benefit: interoperability. This allows many tools and query engines to access the same data. For example, Spark jobs can process large-scale data transformations from the same Iceberg table that Postgres queries for analytics.
Data lake analytics tables
These tables are for querying external data, including Parquet, CSV, JSON, and a variety of geospatial formats including Shapefile (in zip) and Geopackage. In the background, files are cached on a local drive to optimize performance.
Heap tables
Crunchy Data Warehouse also includes regular Postgres tables for operational workloads. This is ideal for staging tables for inserts and materialized views.
DuckDB
DuckDB is a query engine for remote file formats like Iceberg, Parquet, CSV, and JSON that directly reads them from cloud storage or web. DuckDB is optimized for single-node in-memory analytics. It is extremely fast for handling complex queries and large datasets without the need for heavy infrastructure. DuckDB is designed to integrate seamlessly with modern data workflows.
Crunchy Data Warehouse is a hybrid query engine using DuckDB's vectorized query engine together with Postgres. Using this hybrid approach allows us to push down queries to the object storage layer for enhanced performance. This approach also gives data analytics and developers the power to rely on remote object stores, while using Postgres and SQL syntax and tooling.