Queries
Crunchy Bridge for Analytics lets you query files in object storage using PostgreSQL. You can query Parquet, CSV, and JSON. Under the covers, part of the query will be pushed down into an analytical query engine for maximum performance.
The ability to query CSV and JSON files is aimed at helping you to simplify integrating into your analytics pipelines using those ubiquitous formats, for data like machine generated logs. Parquet files instead offer features like columnar storage and statistics that vastly speed up query performance, allowing better use of our state-of-art query engine. To get the best of both worlds, Bridge Analytics lets you easily convert CSV and JSON files into Parquet for future analytics work.
Creating foreign tables to object storage
To run queries on your CSV, JSON or Parquet files, all you need to do is create a Foreign Table pointing to the file or a list of files:
-- create a table pointing to a single file
CREATE FOREIGN TABLE your_table () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/your_file_name.parquet');
-- create a table pointing to a list of compressed CSV files
CREATE FOREIGN TABLE your_table_csv () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/*.csv.gz');
Explore your object store files
You can view the list of files in your supported cloud storage bucket directly
from your database, using the crunchy_lake.list_files()
utility function. Just
pass in the URL pattern for your bucket to see the results:
SELECT crunchy_lake.list_files('s3://crunchydavidtest2/**/*.parquet');
filename
-----------------------------------------------------------------------------
s3://crunchydavidtest2/out.parquet
s3://crunchydavidtest2/table1/part1.parquet
s3://crunchydavidtest2/table1/part2.parquet
s3://crunchydavidtest2/table1/part3.parquet
s3://crunchydavidtest2/tmp/AI.parquet
s3://crunchydavidtest2/tmp/map.parquet
s3://crunchydavidtest2/tmp/map/dec98925-3ad9-4056-8bd6-9ec6bdb8082c.parquet
s3://crunchydavidtest2/tmp/out.parquet
(8 rows)
You can use wildcards like above to find all *.parquet
files anywhere in your
S3 bucket, or use a more restrictive pattern to limit to a single directory:
SELECT crunchy_lake.list_files('s3://crunchydavidtest2/table1/*');
filename
---------------------------------------------
s3://crunchydavidtest2/table1/part1.parquet
s3://crunchydavidtest2/table1/part2.parquet
s3://crunchydavidtest2/table1/part3.parquet
(3 rows)
Querying across regions
Analytics will automatically detect the region of the S3 bucket you are querying and configure itself accordingly to provide a seamless experience. Our automatic caching will download files you query to fast local drives to minimize network traffic across regions and maximize query performance. As a best practice, we still recommend that S3 buckets you access frequently and your Bridge Analytics cluster are in the same region to get the fastest caching performance and avoid network charges altogether.
Vectorized query execution
Crunchy Bridge for Analytics extends PostgreSQL with a vectorized query engine designed to accelerate analytics tasks. Vectorized execution improves efficiency by processing data in batches, improving computational throughput. Performance is improved by pushing down query computation to this engine when possible.
How query pushdown works
When computations are pushed down, they are processed directly within the vectorized query engine. However, if certain computations cannot be handled by the vectorized engine, they are executed normally in PostgreSQL instead.
Monitoring query pushdown
To monitor which parts of your query are pushed down, you can use the EXPLAIN
command with the VERBOSE
option. This command shows how the query is executed,
showing whether elements are processed by the vectorized engine or on the
Postgres server. Look for Vectorized SQL
in the output to see what is executed
by the vectorized engine. If you see Custom Scan (Query Pushdown)
in the
output, you are all set, the computation is delegated to the vectorized engine.
Full pushdown example
Here is an example where the entire computation is pushed down:
EXPLAIN (VERBOSE) SELECT inv_warehouse_sk, count(*) FROM inventory GROUP BY inv_warehouse_sk ORDER BY count(*) DESC;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Custom Scan (Query Pushdown) (cost=0.00..0.00 rows=0 width=0)
Output: pushdown_query.inv_warehouse_sk, pushdown_query.count
Vectorized SQL: SELECT inv_warehouse_sk,
count(*) AS count
FROM public.inventory inventory(inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand)
GROUP BY inv_warehouse_sk
ORDER BY (count(*)) DESC
Partial pushdown example
In this example not all computations are pushed down:
EXPLAIN (VERBOSE) SELECT count(*) FROM inventory WHERE sqrt(inv_item_sk) > 19 ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Aggregate (cost=128.33..128.34 rows=1 width=8)
Output: count(*)
-> Foreign Scan on public.inventory (cost=100.00..127.50 rows=333 width=0)
Output: inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand
Filter: (sqrt((inventory.inv_item_sk)::double precision) > '19'::double precision)
Vectorized SQL: SELECT inv_item_sk
FROM public.inventory inventory(inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand)
Here the sqrt
function is not yet supported by the vectorized engine, hence it
is executed on the Postgres server.
Crunchy Bridge for Analytics currently enables vectorized query execution for
nearly all SQL, with a few exceptions. One gap in pushing computations down to
the vectorized engine involves certain functions and operators, such as the
sqrt
function above.
Future versions of Crunchy Bridge for Analytics aim to support more SQL functions and operators within the vectorized engine. Reach out to Support if there are specific functions or operators you need for your use case.