Queries

Crunchy Bridge for Analytics lets you query files in object storage using PostgreSQL. You can query Parquet, CSV, and JSON. Under the covers, part of the query will be pushed down into an analytical query engine for maximum performance.

The ability to query CSV and JSON files is aimed at helping you to simplify integrating into your analytics pipelines using those ubiquitous formats, for data like machine generated logs. Parquet files instead offer features like columnar storage and statistics that vastly speed up query performance, allowing better use of our state-of-art query engine. To get the best of both worlds, Bridge Analytics lets you easily convert CSV and JSON files into Parquet for future analytics work.

Creating foreign tables to object storage

To run queries on your CSV, JSON or Parquet files, all you need to do is create a Foreign Table pointing to the file or a list of files:

-- create a table pointing to a single file
CREATE FOREIGN TABLE your_table () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/your_file_name.parquet');

-- create a table pointing to a list of compressed CSV files
CREATE FOREIGN TABLE your_table_csv () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/*.csv.gz');

Explore your object store files

You can view the list of files in your supported cloud storage bucket directly from your database, using the crunchy_lake.list_files() utility function. Just pass in the URL pattern for your bucket to see the results:

SELECT crunchy_lake.list_files('s3://crunchydavidtest2/**/*.parquet');
                                  filename
-----------------------------------------------------------------------------
 s3://crunchydavidtest2/out.parquet
 s3://crunchydavidtest2/table1/part1.parquet
 s3://crunchydavidtest2/table1/part2.parquet
 s3://crunchydavidtest2/table1/part3.parquet
 s3://crunchydavidtest2/tmp/AI.parquet
 s3://crunchydavidtest2/tmp/map.parquet
 s3://crunchydavidtest2/tmp/map/dec98925-3ad9-4056-8bd6-9ec6bdb8082c.parquet
 s3://crunchydavidtest2/tmp/out.parquet
(8 rows)

You can use wildcards like above to find all *.parquet files anywhere in your S3 bucket, or use a more restrictive pattern to limit to a single directory:

SELECT crunchy_lake.list_files('s3://crunchydavidtest2/table1/*');
                  filename
---------------------------------------------
 s3://crunchydavidtest2/table1/part1.parquet
 s3://crunchydavidtest2/table1/part2.parquet
 s3://crunchydavidtest2/table1/part3.parquet
(3 rows)

Querying across regions

Analytics will automatically detect the region of the S3 bucket you are querying and configure itself accordingly to provide a seamless experience. Our automatic caching will download files you query to fast local drives to minimize network traffic across regions and maximize query performance. As a best practice, we still recommend that S3 buckets you access frequently and your Bridge Analytics cluster are in the same region to get the fastest caching performance and avoid network charges altogether.

Vectorized query execution

Crunchy Bridge for Analytics extends PostgreSQL with a vectorized query engine designed to accelerate analytics tasks. Vectorized execution improves efficiency by processing data in batches, improving computational throughput. Performance is improved by pushing down query computation to this engine when possible.

How query pushdown works

When computations are pushed down, they are processed directly within the vectorized query engine. However, if certain computations cannot be handled by the vectorized engine, they are executed normally in PostgreSQL instead.

Monitoring query pushdown

To monitor which parts of your query are pushed down, you can use the EXPLAIN command with the VERBOSE option. This command shows how the query is executed, showing whether elements are processed by the vectorized engine or on the Postgres server. Look for Vectorized SQL in the output to see what is executed by the vectorized engine. If you see Custom Scan (Query Pushdown) in the output, you are all set, the computation is delegated to the vectorized engine.

Full pushdown example

Here is an example where the entire computation is pushed down:

EXPLAIN (VERBOSE) SELECT inv_warehouse_sk, count(*) FROM inventory GROUP BY inv_warehouse_sk ORDER BY count(*) DESC ;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Custom Scan (Query Pushdown)  (cost=0.00..0.00 rows=0 width=0)
   Output: pushdown_query.inv_warehouse_sk, pushdown_query.count
   Vectorized SQL:  SELECT inv_warehouse_sk,
     count(*) AS count
    FROM public.inventory inventory(inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand)
   GROUP BY inv_warehouse_sk
   ORDER BY (count(*)) DESC

Partial pushdown example

In this example not all computations are pushed down:

EXPLAIN (VERBOSE) SELECT count(*) FROM inventory WHERE sqrt(inv_item_sk) > 19 ;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Aggregate  (cost=128.33..128.34 rows=1 width=8)
   Output: count(*)
   ->  Foreign Scan on public.inventory  (cost=100.00..127.50 rows=333 width=0)
         Output: inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand
         Filter: (sqrt((inventory.inv_item_sk)::double precision) > '19'::double precision)
         Vectorized SQL:  SELECT inv_item_sk
    FROM public.inventory inventory(inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand)

Here the sqrt function is not yet supported by the vectorized engine, hence it is executed on the Postgres server.

Crunchy Bridge for Analytics currently enables vectorized query execution for nearly all SQL, with a few exceptions. One gap in pushing computations down to the vectorized engine involves certain functions and operators, such as the sqrt function above.

Future versions of Crunchy Bridge for Analytics aim to support more SQL functions and operators within the vectorized engine. Reach out to Support if there are specific functions or operators you need for your use case.