Spatial
Setting up Geospatial
To get started with geospatial data, you need to do one extra set up step on your Crunchy Data Warehouse cluster.
Connect to your cluster as the postgres
user and run the following:
-- disable pgaudit in the current session, required for PostGIS creation
SET pgaudit.log TO 'none';
-- create the spatial analytics extension and PostGIS:
CREATE EXTENSION crunchy_spatial_analytics CASCADE;
-- to re-enable audit logging in the current session
RESET pgaudit.log;
After these commands you can connect as a regular database user and start creating geospatial analytics tables.
Geospatial fetaures
Crunchy Data Warehouse adds advanced geospatial features on top of PostGIS. The most powerful feature is to instantly import almost any geospatial data set with a single command. We also leverage DuckDB to accelerate spatial queries, especially on (Geo)Parquet files.
Geospatial features:
- Query and import directly from public data sets (available over https or S3) or private data sets in S3 without downloading to your computer.
- Long-lived caching of geospatial files on the NVMe drives of your warehouse cluster
- Query acceleration using DuckDB
- Support for many geospatial file formats:
- GeoParquet and Overture
- GeoJSON and GeoJSONSeq
- WKB in regular Parquet
- WKT in CSV/JSON/other
- GDAL-supported formats, including Shapefile in zip, Geopackage, Geodatabase, FlatGeoBuf, and more
In addition, all of PostgreSQL and PostGIS can be directly used with the new geospatial analytics features. For instance, you might use analytics tables to import and transform a data set, and then convert it into a regular PostgreSQL materialized view with a spatial index. You can periodically rebuild the view using the job scheduler.
Connecting remote data sets to PostGIS
Crunchy Data Warehouse has three different ways of interacting with geospatial data sets:
- Creating foreign tables
- Creating local tables with indexes
- Loading data with COPY into an existing table
-- 1) Create a remote table for querying a geospatial data set
create foreign table world ()
server crunchy_lake_analytics
options (path 'https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.zip');
\d world
Foreign table "public.world"
┌────────────┬──────────┬───────────┬──────────┬─────────┬─────────────┐
│ Column │ Type │ Collation │ Nullable │ Default │ FDW options │
├────────────┼──────────┼───────────┼──────────┼─────────┼─────────────┤
│ shapegroup │ text │ │ │ │ │
│ shapetype │ text │ │ │ │ │
│ shapename │ text │ │ │ │ │
│ geom │ geometry │ │ │ │ │
└────────────┴──────────┴───────────┴──────────┴─────────┴─────────────┘
-- 2) Create a regular table (and index) directly from a geospatial data set
create table world ()
with (load_from = 'https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.zip');
create index on world using gist (geom);
-- 3) Load data from a geospatial data set into an existing table
copy world from 'https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.zip';
You can combine these mechanisms in different ways, depending on what makes sense for your scenario.
Queries across large (Geo)Parquet data sets benefit from using analytics tables, using compression, caching, enhanced parallelism and vectorized execution.
Precise geospatial lookups (e.g. which ZIP Code is this point?) benefit a lot more from regular tables with spatial indexes.
For smaller data sets, the differences are less pronounced. It is worth noting that joins between the same table types (e.g. analytics join analytics or regular join regular) may be more efficient than joins across two table types (e.g. analytics join regular).
GeoParquet example: querying Overture data
GeoParquet is a relatively new format for storing geospatial data that enables compression and efficient analytics queries via columnar storage.
When creating a crunchy_lake_analytics
table from a URL pointing to a
GeoParquet file, geometry columns are automatically detected and mapped
into the PostGIS geometry type, such that you can immediately use PostGIS
functions and operators.
A large and well-known GeoParquet data set is
Overture Maps. Crunchy Data Warehouse is a
good choice for working with Overture, because
bigger instances
can cache the full data set on a local drive, the vectorized query execution and
parallelism give excellent performance when querying Parquet files, and clusters
are available in the us-west-2
region where Amazon hosts the Overture data set.
You can easily query Overture data by creating a table:
-- Create a table to query the Overture (GeoParquet) Places data set
create foreign table ov_places ()
server crunchy_lake_analytics
options (path 's3://overturemaps-us-west-2/release/2024-08-20.0/theme=places/type=*/*.parquet');
\d ov_places
Foreign table "public.ov_places"
Column | Type | Collation | Nullable | Default | FDW options
------------+-----------------------------------------------------------------------------+-----------+----------+---------+-------------
id | text | | | |
geometry | geometry | | | |
bbox | crunchy_struct.xmin_xmax_ymin_ymax_35464140 | | | |
version | integer | | | |
sources | crunchy_struct.property_dataset_record_id_update_time_confidence_ff84a944[] | | | |
names | crunchy_struct.primary_common_rules_a6b9407b | | | |
categories | crunchy_struct.primary_alternate__7ead223 | | | |
confidence | double precision | | | |
websites | text[] | | | |
socials | text[] | | | |
emails | text[] | | | |
phones | text[] | | | |
brand | crunchy_struct.wikidata_names_55767648 | | | |
addresses | crunchy_struct.freeform_locality_postcode_region_country__4d9a9db[] | | | |
theme | text | | | |
type | text | | | |
Server: crunchy_lake_analytics
FDW options: (path 's3://overturemaps-us-west-2/release/2024-07-22.0/theme=places/type=*/*.parquet', format 'parquet')
When querying Overture data sets, it's often a good idea to use a bounding box (bbox) filter. The query engine will use Parquet metadata to reduce the amount of data that needs to be read.
-- Query for gas station locations in St. Louis, Missouri
SELECT (categories).primary AS cat, (addresses[1]).locality as city,
geometry
FROM ov_places
WHERE (bbox).xmin >= -90.24
AND (bbox).xmax <= -90.15
AND (bbox).ymin >= 38.59
AND (bbox).ymax <= 38.66
AND (categories).primary = 'gas_station';
Note: Only WKB-encoded geometries are currently supported.
GDAL data sets: Shapefile zip, Geopackage, and more
Geospatial data sets are published in a large variety of formats. The
GDAL library is able to read many/most of these
formats, and you can use it to import data into PostgreSQL using the
format 'gdal'
option when creating a table:
-- Create a table from a zip containing a Shapefile
-- Add compression option
create foreign table nld ()
server crunchy_lake_analytics
options (
format 'gdal',
compression 'zip',
path 'https://www.eea.europa.eu/data-and-maps/data/eea-reference-grids-2/gis-files/netherlands-shapefile/at_download/file'
);
For some commonly used formats, the GDAL format will be inferred:
Extension | GDAL format | Compression |
---|---|---|
.zip | Shapefile, Geodatabase, other GDAL formats | zip |
.geojson | GeoJSON | none |
.geojson.gz | GeoJSON | gzip |
.gpkg | Geopackage | none |
.gpkg.gz | Geopackage | gzip |
.kml | Key Markup Language | none |
.kmz | Key Markup Language | zip |
.fgb | FlatGeoBuf | none |
When working with a URL that does not have an explicit extension, it may be
necessary to explicitly specify both the format and compression (usually
compression 'zip'
). When a .zip archive can contain multiple geospatial data
sets, you can also distinguish them by adding a zip_path
option:
-- Map of the Netherlands at 100km granularity from
-- a .zip containing multiple shape files
create foreign table nld_100km ()
server crunchy_lake_analytics
options (
format 'gdal',
compression 'zip',
path 'https://www.eea.europa.eu/data-and-maps/data/eea-reference-grids-2/gis-files/netherlands-shapefile/at_download/file'
);
-- Map of the Netherlands at 1km granularity from
-- by selecting a specific .zip
create foreign table nld_1km ()
server crunchy_lake_analytics
options (
format 'gdal',
compression 'zip',
zip_path 'nl_1km.shp',
path 'https://www.eea.europa.eu/data-and-maps/data/eea-reference-grids-2/gis-files/netherlands-shapefile/at_download/file'
);
GDAL files are downloaded when creating the table. This means it may take a little longer, but it avoids repeatedly requesting the same file from the web server.
The file is only re-downloaded if it is evicted from the cache or the PostgreSQL server is replaced.
GeoJSON
There are roughly two forms of GeoJSON: Single JSON object containing a FeatureCollection, or newline-delimited JSON with a single feature per JSON object. The latter kind is sometimes referred to as GeoJSONSeq.
Both forms of GeoJSON are supported in Crunchy Data Warehouse, though in
different ways. For GeoJSON with a FeatureCollection, you always want to use
format 'gdal'
, which will unwind the attributes into columns:
create foreign table gemeentes ()
server crunchy_lake_analytics
options (format 'gdal', path 'https://service.pdok.nl/cbs/gebiedsindelingen/2017/wfs/v1_0?request=GetFeature&service=WFS&version=2.0.0&typeName=gemeente_gegeneraliseerd&srsName=EPSG:4326&outputFormat=json');
\d gemeentes
Foreign table "public.gemeentes"
┌────────────┬──────────┬───────────┬──────────┬─────────┬─────────────┐
│ Column │ Type │ Collation │ Nullable │ Default │ FDW options │
├────────────┼──────────┼───────────┼──────────┼─────────┼─────────────┤
│ id │ integer │ │ │ │ │
│ statcode │ text │ │ │ │ │
│ statnaam │ text │ │ │ │ │
│ jrstatcode │ text │ │ │ │ │
│ rubriek │ text │ │ │ │ │
│ geom │ geometry │ │ │ │ │
└────────────┴────── ────┴───────────┴──────────┴─────────┴─────────────┘
Server: crunchy_lake_analytics
FDW options: (format 'gdal', path 'https://service.pdok.nl/cbs/gebiedsindelingen/2017/wfs/v1_0?request=GetFeature&service=WFS&version=2.0.0&typeName=gemeente_gegeneraliseerd&outputFormat=json')
For GeoJSONSeq, GDAL also works, but format 'json'
is likely to give better
performance. However, it does not automatically infer the geometry and attribute
columns. You can use the ST_GeomFromGeoJSON
function to convert a feature to
the geometry type.
Reading and writing regular Parquet/CSV/JSON data with geometry data
Sometimes you have a Parquet, CSV, or JSON files that contain encoded geometries. You can use PostGIS functions to decode these geometries.
For example:
-- create a CSV file with a geometry, will encode as WKT
copy (select 'POINT(3.14 6.28)'::geometry) to 's3://mybucket/demo.csv' with header;
-- Create a table from the CSV, will use text column
create foreign table csv_table ()
server crunchy_lake_analytics options (path 's3://mybucket/demo.csv');
-- Convert the WKT into a geometry
select sum(ST_Area(ST_GeomFromText(wkt))) from csv_table;
-- Alternatively, create a table from the CSV with an explicit geometry column
create foreign table csv_table (g geometry)
server crunchy_lake_analytics options (path 's3://mybucket/demo.csv');
-- Select the geometry, will use ST_GeomFromText under the covers
select sum(ST_Area(g)) from csv_table;
The expected encodings and functions to use for each format are given below:
Format | Geometry encoding | Read function |
---|---|---|
Parquet | WKB | ST_GeomFromWKB |
CSV | WKT | ST_GeomFromText |
JSON | GeoJSON | ST_GeomFromGeoJSON |
GDAL | Native | n/a |
Apart from querying files with geometry data, you can also write geometry
columns in Parquet/CSV/JSON using COPY ... TO
. They will automatically use the
encoding listed in the table. Writing using GDAL is not yet supported.
Use QGIS with Crunchy Data Warehouse
Using QGIS with Crunchy Data Warehouse is straightforward. First, add a PostgreSQL connection that points to your cluster.
You'll find your tables and view in the Browser under your schema. You can add layers to the project to visualize them.
Finally, you may need to set your coordinate reference system to match the data set. Note that the SRID is currently set to 0 on all analytics tables, so it cannot be automatically detected.
If your data set uses an alternative coordinate reference system, you can also
create a view that transforms it using the st_transform
function:
create view points_transformed as
select point_id, st_transform(geom, 'epsg:27700', 'epsg:4326')
from points;
Advanced scenarios: combining geospatial data sets in QGIS
With the ability to create tables for almost any geospatial data set, you can also combine them in interesting ways.
For instance, the following queries create tables for data sets containing national forest boundaries and occurrences of fires over the past few years, and we then create views to find the forests that had fires in 2022:
-- National Forest System boundaries
create foreign table forests ()
server crunchy_lake_analytics
options (path 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.AdministrativeForest.zip');
-- Fire occurence points in US
create foreign table fires ()
server crunchy_lake_analytics
options (path 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.MTBS_FIRE_OCCURRENCE_PT.zip');
-- Only consider fires in national forests in 2022
create view nfs_fires_in_2022 as
select fires.*, forests.adminfores
from forests, fires
where st_within(fires.geom, forests.geom)
and date_trunc('year', ig_date) = '2022-01-01';
-- Find the forests which had fires in 2022
create view forests_with_fires_in_2022 as
select *
from forests
where adminfores in (
select adminfores from nfs_fires_in_2022
);
Loading these tables and views as layers into QGIS gives a good overview of which national forests suffered from fires in 2022:
Limitations of spatial query acceleration
Crunchy Data Warehouse takes advantage of some of the logic in DuckDB spatial, though it is still in an early stage of development. While DuckDB spatial it aims to mimic PostGIS, it still has many inconsistencies. We correct for those inconsistencies such that you'll always get the default PostGIS experience, even if DuckDB changes in the future. It does mean that functions may use the PostGIS implementation by transferring geometry data from DuckDB into PostgreSQL, which adds some overhead.
For instance, if you use the ST_SetSRID
function, then we'll execute that in
PostGIS, since DuckDB does not currently track SRID information. In most cases,
that will cause other functions to also be executed in PostGIS instead of
DuckDB. That means everything always works correctly, but the use of certain
functions can sometimes cause unexpected performance degradation.