Data lake import and export
Once your credentials are set so your cloud storage can be accessed by Crunchy Data Warehouse, you can easily import data from CSV, newline delimited JSON or Parquet files with automatic schema detection.
Importing data from cloud object storage
The following example will create a regular PostgreSQL table where columns are inferred from the data file on S3 using the warehouse auto-detection feature.
CREATE TABLE your_table_name () WITH (definition_from = 's3://your_bucket_name/your_file_name.xx');
When you’re using definition_from
, the column list must be empty. If you want
to specify column definitions manually, simply follow the regular PostgreSQL
CREATE TABLE
syntax.
You can then use the familiar COPY command to load data into the table from S3:
COPY your_table_name FROM 's3://your_bucket_name/your_file_name.xx';
Crunchy Data Warehouse clusters support the
functions available in COPY,
with csv
, delimiter
, headers
, and more.
Alternatively, if you want to create a regular PostgreSQL table and immediately load in the data, you can use the following:
CREATE TABLE your_table_name () WITH (load_from = 's3://your_bucket_name/your_file_name.xx');
When using load_from
you can specify column definitions manually as you would
when creating regular tables in Postgres.
For the COPY
and CREATE TABLE
commands, having the full Postgres syntax
allows passing various options such as the file type, compression algorithm,
delimiter and CSV header information. Check the PostgreSQL documentation for the
full list of the standard options.
The supported file types and compression algorithms for Crunchy Data Warehouse are:
File Type / Compression | gzip | zstd | snappy |
---|---|---|---|
CSV | ☑️ | ☑️ | |
JSON | ☑️ | ☑️ | |
Parquet | ☑️ | ☑️ | ☑️ |
File formats
If your data file extension is parquet
, csv
or json
, Crunchy Data
Warehouse can automatically detect the format. However, if the file doesn’t have
the proper file extension, you can provide the format in the foreign table
creation options.
CREATE FOREIGN TABLE ft () SERVER crunchy_lake_analytics
OPTIONS (path 's3://your_bucket_name/data_file', format 'parquet');
Compression examples
Crunchy Data Warehouse clusters can automatically detect the file and compression type from the extension of the file. Compressed Parquet files do not have extensions that indicate compression type, but the file metadata does; auto-detection will work without any options. However, for CSV and JSON files without extensions indicating the compression algorithm, you will need to specify it in the options.
—-gziped files when you attempt a create foreign table
create foreign table ft3 () SERVER crunchy_lake_analytics
OPTIONS (path 's3://your_bucket_name/data_file.csv');
ERROR: Invalid Input Error: CSV Error on Line: 1
—-now, provide compression for gzip
create foreign table ft3 () SERVER crunchy_lake_analytics
OPTIONS (path 's3://your_bucket_name/data_file.csv', compression 'gzip');
Copy syntax
Crunchy Data Warehouse allows for a large range of PostgreSQL’s COPY CSV options
such as header
, delimiter
, quote
, escape
, null
. See the
Postgres copy docs for
additional options.
Here are a few examples of how to create a table and immediately load data from
a file in Amazon S3. Note that this can be optionally broken into two steps,
first creating the table with definitions_from
the source file and then
copying data into the table.
--Columns, file format and compression are automatically detected from parquet file extension and metadata
CREATE TABLE your_table_name () WITH (load_from = 's3://your_bucket_name/your_file_name.parquet';
--Delimiter and header information provided in options; columns, file format and compression are automatically detected from csv.gz extension and file content
CREATE TABLE your_table_name () WITH (load_from = 's3://your_bucket_name/your_file_name.csv.gz', delimeter = ',', header = false);
--Columns provided (optional), since file extension is missing format and compression type need to be provided
CREATE TABLE your_table_name (id integer, name VARCHAR(50)) WITH (load_from = 's3://your_bucket_name/your_file_name', format ='json', compression ='gzip');
Exporting data out to cloud object storage
After setting up object storage credentials, you can start exporting data into
cloud object storage in CSV, newline delimited JSON or the efficient columnar
Parquet format. The following examples of copy
statements show how to export a
table to different file types with or without compression, where the compression
type is inferred from the extension in the file name:
-- data exported in uncompressed CSV format
COPY table_to_export TO 's3://your_bucket_name/your_file_name.csv';
-- data exported with gzip compression in CSV format
COPY table_to_export TO 's3://your_bucket_name/your_file_name.csv.gz';
-- data exported uncompressed in new-line-delimited JSON format
COPY table_to_export TO 's3://your_bucket_name/your_file_name.json';
-- data exported with zstd compression in new-line-delimited JSON format
COPY table_to_export TO 's3://your_bucket_name/your_file_name.json.zst';
-- data exported in Parquet format using snappy compression (default)
COPY table_to_export TO 's3://your_bucket_name/your_file_name.parquet';
-- results of a query are saved in a file
COPY (SELECT * FROM table_to_export JOIN other_table USING (id)) TO 's3://your_bucket_name/your_file_name.parquet';
You can also pass options to the COPY command based on the format of the file you want to write.
psql copy
Crunchy Data Warehouse also lets you import and export with the copy formats
from the psql client using the \copy
meta-command. Note that you’ll always
want to specify format and compression when using psql \copy because the local
file extension is not visible to the server.
-- Import a compressed JSON file from local disk
\copy data from '/tmp/data.json.gz' with (format 'json', compression 'gzip');
-- Export a Parquet file to local disk
\copy data to '/tmp/data.parquet' with (format 'parquet');
Converting CSV and JSON to Parquet
If you need to run high performance analytical queries against CSV and JSON files, you can convert them to the Parquet format using tools inside Crunchy Data Warehouse. The Parquet format is notably more performant.
Copy the files into your warehouse database:
CREATE TABLE your_table_name () WITH (load_from = 's3://your_bucket_name/your_file_name.xx');
Then export the foreign table into a Parquet file:
COPY table_to_export TO 's3://your_bucket_name/your_file_name.parquet';
Finally, create your foreign table:
CREATE FOREIGN TABLE your_table_name_parquet () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/your_file_name.parquet');
Querying the resulting Parquet file will take advantage of its compression features while also making use of vectorized execution, multi-threading and data statistics to dramatically speed up your analytical queries.
How much faster is Parquet than CSV?
Here is a small data example with the same query against both CSV and Parquet:
EXPLAIN ANALYZE SELECT * FROM thermostat_csv;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Foreign Scan on thermostat_csv (cost=100.00..120.00 rows=1000 width=0) (actual time=38.885..59.216 rows=7205 loops=1)
Planning Time: 0.141 ms
Execution Time: 60.624 ms
EXPLAIN ANALYZE SELECT * FROM thermostat_parquet;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Foreign Scan on thermostat_parquet (cost=100.00..120.00 rows=1000 width=0) (actual time=5.427..21.359 rows=7205 loops=1)
Planning Time: 1.139 ms
Execution Time: 26.496 ms
This shows Parquet executing a little faster than 2x than the same CSV. Results will vary by workload.