Queries

Crunchy Bridge for Analytics lets you query files in object storage using PostgreSQL. You can query Parquet, CSV, and JSON. Under the covers, part of the query will be pushed down into an analytical query engine for maximum performance.

The ability to query CSV and JSON files is aimed at helping you to simplify integrating into your analytics pipelines using those ubiquitous formats, for data like machine generated logs. Parquet files instead offer features like columnar storage and statistics that vastly speed up query performance, allowing better use of our state-of-art query engine. To get the best of both worlds, Bridge Analytics lets you easily convert CSV and JSON files into Parquet for future analytics work.

Creating foreign tables to object storage

To run queries on your CSV, JSON or Parquet files, all you need to do is create a Foreign Table pointing to the file or a list of files:

-- create a table pointing to a single file
CREATE FOREIGN TABLE your_table () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/your_file_name.parquet');

-- create a table pointing to a list of compressed CSV files
CREATE FOREIGN TABLE your_table_csv () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/*.csv.gz');

Querying across regions

We recommend that your S3 bucket and your Bridge Analytics cluster are in the same region to get the best performance and avoid network charges. However, you can access an S3 store in a different region, Bridge Analytics allows you to do that by specifying the region of the source bucket by adding ?s3_region=[bucket_region] to the URL:

-- create a table pointing to a region different than the Analytics instance
CREATE FOREIGN TABLE your_table () SERVER crunchy_lake_analytics OPTIONS (path 's3://your_bucket_name/your_file_name.parquet?s3_region=region-name');

Vectorized query execution

Crunchy Bridge for Analytics extends PostgreSQL with a vectorized query engine designed to accelerate analytics tasks. Vectorized execution improves efficiency by processing data in batches, improving computational throughput. Performance is improved by pushing down query computation to this engine when possible.

How query pushdown works

When computations are pushed down, they are processed directly within the vectorized query engine. However, if certain computations cannot be handled by the vectorized engine, they are executed normally in PostgreSQL instead.

Monitoring query pushdown

To monitor which parts of your query are pushed down, you can use the EXPLAIN command with the VERBOSE option. This command shows how the query is executed, showing whether elements are processed by the vectorized engine or on the Postgres server. Look for Remote SQL in the output to see what is executed by the vectorized engine.

Full pushdown example

Here is an example where the entire computation is pushed down:

EXPLAIN (VERBOSE) SELECT inv_warehouse_sk, count(*) FROM inventory GROUP BY inv_warehouse_sk ORDER BY count(*) DESC ;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Foreign Scan  (cost=101.00..121.20 rows=1000 width=12)
   Output: inv_warehouse_sk, (count(*))
   Relations: Aggregate on (public.inventory)
   Remote SQL:  SELECT inv_warehouse_sk,
     count(*) AS count
    FROM public.inventory inventory(inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand)
   GROUP BY inv_warehouse_sk
   ORDER BY (count(*)) DESC
(8 rows)

Partial pushdown example

In this example not all computations are pushed down:

explain (VERBOSE) SELECT count(*) FROM inventory WHERE sqrt(inv_item_sk) > 19 ;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Aggregate  (cost=128.33..128.34 rows=1 width=8)
   Output: count(*)
   ->  Foreign Scan on public.inventory  (cost=100.00..127.50 rows=333 width=0)
         Output: inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand
         Filter: (sqrt((inventory.inv_item_sk)::double precision) > '19'::double precision)
         Remote SQL:  SELECT inv_item_sk
    FROM public.inventory inventory(inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand)
(7 rows)

Here the sqrt function is not supported by the vectorized engine, hence it is executed on the Postgres server.

Future versions of Crunchy Bridge for Analytics will support more SQL functions in the vectorized query engine, allowing for greater computational pushdown and enhanced performance across a wider range of queries.