Data lake files and formats
Lake analytics tables are foreign tables with server crunchy_lake_analytics
pointing to external files in your data lake in a variety of supported formats.
For instance, you can create a table from a directory of Parquet files in S3 and immediately start querying them:
-- Load a GeoParquet file without specifying columns
-- Geometry columns are automatically recognized
-- Struct/map types are auto-generated
create foreign table ov_buildings ()
server crunchy_lake_analytics
options (path 's3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=*/*.parquet');
postgres=> \d ov_buildings
Foreign table "public.ov_buildings"
┌────────────────────────┬─────────────────────────────────────────────────────────────────────────────┬───────────┬──────────┬─────────┬─────────────┐
│ Column │ Type │ Collation │ Nullable │ Default │ FDW options │
├────────────────────────┼─────────────────────────────────────────────────────────────────────────────┼───────────┼───────── ─┼─────────┼─────────────┤
│ id │ text │ │ │ │ │
│ geometry │ geometry │ │ │ │ │
│ bbox │ crunchy_struct.xmin_xmax_ymin_ymax_35464140 │ │ │ │ │
...
│ type │ text │ │ │ │ │
└────────────────────────┴─────────────────────────────────────────────────────────────────────────────┴───────────┴──────────┴─────────┴─────────────┘
Server: crunchy_lake_analytics
FDW options: (path 's3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=*/*.parquet')
Lake analytics tables support multiple formats, including Parquet, CSV, and newline-delimited JSON. Additionally, each format has more specialized options.
Data lake formats
Crunchy Data Warehouse supports querying, importing, and exporting in a variety of formats. The following commands can use all supported data lake formats:
CREATE FOREIGN TABLE ... SERVER crunchy_lake_analytics ...
(querying)COPY ... FROM
(importing)CREATE TABLE ... WITH load_from / definition_from option
(importing)COPY ... TO
(exporting)
The format can be specified as an option in the command. Note that PostgreSQL has a slightly different syntax for specifying options in each case:
-- COPY command uses WITH and no =
COPY ... TO/FROM '<url>' WITH (format 'csv');
-- CREATE TABLE ... uses WITH and =
CREATE TABLE ... USING iceberg WITH (load_from = '<url>', format = 'csv');
-- CREATE FOREIGN TABLE ... uses OPTIONS and no =
CREATE FOREIGN TABLE ... SERVER crunchy_lake_analytics OPTIONS (path '<url>', format 'csv');
When using a common file name extension, the format will be automatically inferred from the extension.
The following formats are supported:
Format | Description | Extensions | Wildcard |
---|---|---|---|
parquet | Parquet (including GeoParquet) | .parquet | Yes (for s3:) |
csv | Comma-separated values | .csv .csv.gz csv.zst | Yes (for s3:) |
json | Newline-delimited JSON | .json .json.gz .json.zst | Yes (for s3:) |
gdal | Formats loaded using GDAL | See below | No |
iceberg | External Apache Iceberg snapshot | .metadata.json | No |
delta | External Delta Lake table | n/a | No |
log | Log files such as S3 logs | n/a | Yes (for s3:) |
Parquet format
Parquet files are the most versatile format for importing and exporting data in Crunchy Data Warehouse. They are self-describing in terms of column names, types, and compression, and use columnar storage.
The Parquet format supports the following additional options:
Option | Description |
---|---|
compression | Compression method to use in COPY TO ('gzip' , 'snappy' , 'zstd' , or 'none' ), defaults to 'snappy' |
CSV format
When querying or importing CSV files, the CSV format properties like the presence of a header, delimiter, and quote character are automatically detected. However, in some cases the detection fails and you can specify it manually.
-- Example: create a table pointing to a list of compressed CSV files
create foreign table your_table_csv ()
server crunchy_lake_analytics
options (path 's3://your_bucket_name/data/*.csv.gz');
When writing CSV with COPY ... TO
, the defaults are the same as PostgreSQL, but
you can use the options to override them.
The CSV format supports the following additional options:
Option | Description |
---|---|
compression | (De)compression method to use ('gzip' , 'zstd' , or 'none' ), inferred from extension, otherwise 'none' |
header | Whether to use a header ('true' or 'false' ) |
delimiter | Character to use to separate values (e.g. ';' ) |
quote | Character to use as a quote around string literals (e.g. '"' ) |
escape | Character to use to escape characters within string literals (e.g. '"' ) |
JSON format
Crunchy Data Warehouse can query, import, and export newline-delimited JSON files. For query and import, it can also support JSON arrays, though they may be less efficient.
-- Example: create a table pointing to a single compressed JSON file
create foreign table your_table_json ()
server crunchy_lake_analytics
options (path 's3://your_bucket_name/values.json.z', compression 'zstd');
The JSON format supports the following additional options:
Option | Description |
---|---|
compression | (De)compression method to use ('gzip' , 'zstd' , or 'none' ), inferred from extension, otherwise 'none' |
GDAL format
Crunchy Data Warehouse can use
GDAL to query or import
many different file formats. When using format 'gdal'
explicitly, some file
formats are recognized from the extension.
-- Example: create an Iceberg table from a .zip shapefile
create table forest_fires ()
using iceberg
with (load_from = 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.FireOccurrence.zip');
Below is an incomplete list:
Extension | Description | GDAL format inferred |
---|---|---|
.dxf .dwg | AutoCAD files | Yes |
.fgb | FlatGeoBuf | Yes |
.gdb | Geodatabase | No |
.geojson .geojson.gz | GeoJSON | Yes |
.geojsons .geojsonl | GeoJSONSeq | No |
.gml | Geography Markup Language | Yes |
.gpkg .gpkg.gz | Geopackage | Yes |
.kml .kmz | Key-Hole Markup Language | Yes |
.map | WAsP .map format | No |
.mif .mit .tab | MapInfo datasets | No |
.shp | Shapefile | No |
.xls .xlsx | Excel files | No |
.xodr | OpenDRIVE Road Description Format | No |
.zip | Auto-detected or derived from zip_path extension (e.g. .shp for Shapefile, ) | Yes |
The GDAL format supports the following additional options:
Option | Description |
---|---|
compression | Decompression method to use ('gzip' , 'zip' , or 'none' ), inferred from extension, otherwise 'none' |
layer | Name of a layer within a file (e.g. 'Sheet 1' ) |
zip_path | Relative path within a .zip file (e.g. 'S_USA.OtherSubSurfaceRight.shp' ) |
When using format GDAL, files are downloaded immediately when creating a table, which may therefore take longer than creating other types of tables.
Log format
The log format is intended to make it easy to query and import raw log files based on a pre-defined template. Currently, only S3 access logs are supported.
create foreign table s3log ()
server crunchy_lake_analytics
options (
path 's3://my-s3-logs/000000000000/eu-central-1/**',
format 'log', log_format 's3'
);
The log format supports the following additional options:
Option | Description |
---|---|
compression | (De)compression method to use ('gzip' , 'zstd' , or 'none' ), defaults to'none' |
log_format | Format of the log file ('s3' ), required |
External Iceberg format
Crunchy Data Warehouse can create and manage Iceberg tables, but you may already have Iceberg tables managed by other systems. In that case, you can still query or import from those tables by creating a foreign table that points to the metadata JSON file that represents a specific version of the table:
create foreign table external_iceberg()
server crunchy_lake_analytics
options (path 's3://mybucket/table/v14.metadata.json');
Note that changes to the external Iceberg table will not be reflected in the foreign table unless you update the path to point to the new metadata file.
External Delta format
Similar to external Iceberg tables, you can query or import external Delta Lake tables:
create foreign table external_delta()
server crunchy_lake_analytics
options (format 'delta', path 's3://mybucket/delta_table/');
Because Delta Lake tables track the latest version in object storage, changes to the table will automatically become visible in the foreign table.
Note: Delta Lake tables are currently only supported in the same region. Otherwise, you will get an error like:
ERROR: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: '...'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic S3 error: Received redirect without LOCATION, this normally indicates an incorrectly configured region)
Postgres tables
You can create regular (”heap”) tables in PostgreSQL as usual.
The load_from
and definition_from
options can also be used when creating
heap tables:
create table vegitation()
with (load_from = 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.Activity_RngVegImprove.zip');
Similarly, you can use the COPY
command with a URL to import/export data in a
heap table.
An advantage of heap tables over Iceberg tables is that you can create indexes
and more efficiently handle insert
operations, and selective select
/
update
/ delete
operations, as well as certain types of joins.
In addition to built-in storage parameters and the format-specific options, heap tables support the following options in the CREATE TABLE statement:
Option | Description |
---|---|
format | Format of the file to load and/or infer columns from |
definition_from | URL of a file to infer the columns from |
load_from | URL of a file to infer the columns from and immediately load into the table |
Querying data lake tables
Crunchy Data Warehouse lets you query files in object storage using PostgreSQL. You can query Parquet, CSV, and JSON. Under the covers, part of the query will be pushed down into an analytical query engine for maximum performance.
The ability to query CSV and JSON files is aimed at helping you to simplify integrating into your analytics pipelines using those ubiquitous formats, for data such as machine generated logs. Parquet files instead offer features like columnar storage and statistics that vastly speed up query performance, allowing better use of our state-of-art query engine. To get the best of both worlds, Crunchy Data Warehouse lets you easily convert CSV and JSON files into Parquet for future analytics work.
Explore your object store files
You can view the list of files in your supported cloud storage bucket directly
from your database, using the crunchy_lake.list_files()
utility function. Just
pass in the URL pattern for your bucket to see the results:
SELECT path FROM crunchy_lake.list_files('s3://crunchydemobucket/**/*.parquet');
path
-----------------------------------------------------------------------------
s3://crunchydemobucket/out.parquet
s3://crunchydemobucket/table1/part1.parquet
s3://crunchydemobucket/table1/part2.parquet
s3://crunchydemobucket/table1/part3.parquet
s3://crunchydemobucket/tmp/AI.parquet
s3://crunchydemobucket/tmp/map.parquet
s3://crunchydemobucket/tmp/map/dec98925-3ad9-4056-8bd6-9ec6bdb8082c.parquet
s3://crunchydemobucket/tmp/out.parquet
(8 rows)
You can use wildcards like above to find all *.parquet
files anywhere in your
S3 bucket, or use a more restrictive pattern to limit to a single directory:
SELECT path FROM crunchy_lake.list_files('s3://crunchydemobucket/table1/*');
path
---------------------------------------------
s3://crunchydemobucket/table1/part1.parquet
s3://crunchydemobucket/table1/part2.parquet
s3://crunchydemobucket/table1/part3.parquet
(3 rows)
Querying across regions
Crunchy Data Warehouse will automatically detect the region of the S3 bucket you are querying and configure itself accordingly to provide a seamless experience. Our automatic caching will download files you query to fast local drives to minimize network traffic across regions and maximize query performance. As a best practice, we still recommend that S3 buckets you access frequently and your Crunchy Data Warehouse cluster are in the same region to get the fastest caching performance and avoid network charges altogether.
Vectorized query execution
Crunchy Data Warehouse extends PostgreSQL with a vectorized query engine designed to accelerate analytics tasks. Vectorized execution improves efficiency by processing data in batches, improving computational throughput. Performance is improved by pushing down query computation to this engine when possible.
Query pushdown
When computations are pushed down, they are processed directly within the vectorized query engine. However, if certain computations cannot be handled by the vectorized engine, they are executed normally in PostgreSQL instead. See the Iceberg tables page for more information about query pushdown.