Automatic Iceberg Replication

Crunchy Data Warehouse supports the ability to automatically replicate data from a Crunchy Bridge standard Postgres instance into a Crunchy Data Warehouse instance–converting data in the process to Iceberg. This capability allows you to:

  • Use Postgres for your operational (OLTP) database while keeping data in sync with Crunchy Data Warehouse to enable fast analytical queries.
  • Test your Postgres data inside Crunchy Data Warehouse.
  • Migrate from an "accidental" Postgres data warehouse.

Support for logical replication into Iceberg builds on top of core PostgreSQL logical replication features in order to provide an easy and streamlined approach for setting up your Data warehouse.

Concepts

Tables on the source server are grouped into a publication, basically a list of table and policies as to what changes to replicate.

On the warehouse side, a subscription stores connection information to the publisher database, as well as which publications to replicate.

Logical replication in Postgres and Crunchy Data Warehouse operate in two main phases:

  • data sync (the copy_data phase)
  • change application (the apply phase)

The copy_data phase ensures that the data in the target table exists on the remote side as of the initial start, while the apply phase ensures that all changes made since the start of the copy_data phase get applied.

The apply worker applies the changes that are made on the publishing side to the Iceberg tables in batches every 30 seconds, which leads to an end-to-end lag which is typically under 60 seconds. The order of transactions is preserved, so you will always have a consistent view of the data.

Initial setup

Setup on the publisher

Publication setup

In order to use logical replication into Iceberg, you must define one or more publications on the source PostgreSQL server.

You can do this by explicitly defining the set of tables in the publication via explicit list:

CREATE PUBLICATION my_pub_name FOR TABLE
    table_1, table_2, table_3;

You can also define publications for entire schemas in the database like so:

CREATE PUBLICATION my_public_tables_pub FOR TABLES IN SCHEMA public;

Or even the entire database:

CREATE PUBLICATION my_all_tables_pub FOR ALL TABLES;

Note: creating ALL TABLES or TABLES IN SCHEMA publications require a superuser account to create the publication.

By default, publications will replicate insert, update, delete, and truncate against the source table. You can specify a different set of these, for instance just insert,truncate to define a publication as insert-only. See the postgres docs for more details here.

Table identity

Whether update/delete commands can be replicated depends on whether the table has a primary key and whether it uses REPLICA IDENTITY FULL. The following table outlines what is supported:

Primary KeyREPLICA IDENTITY FULLSupportedStorage use in warehouse
YesYesYes (preferred)Medium
YesNoYesHigh
NoYesOnly with (publish = ‘insert,truncate’)Low
NoNoYes (update/delete blocked on source)Low

If both a Primary Key and REPLICA IDENTITY FULL are set for a relation, Crunchy Data Warehouse can minimize data needed to track changes to the table, which is why this setting is preferred. However, it may increase the amount of WAL on the source database.

User setup

You will also need a user account that you can connect to that has permissions to read the tables that are being replicated, as well as having the REPLICATION privilege. This can (and probably should) be a dedicated account:

-- create a new user
CREATE USER my_user WITH REPLICATION PASSWORD 'abc123';

-- or modify an existing user
ALTER USER my_user WITH REPLICATION;

-- grant appropriate permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO my_user;

This user does not need write permissions on the tables.

Remote access settings

You will need to ensure that the replication user is allowed to connect to the source database using the pg_hba.conf settings or equivalent.

Setup on the subscriber

Once the publisher setup has been done, you will create a subscription on your Crunchy Data Warehouse instance.

Subscription setup

To start the replication, you use the regular CREATE SUBSCRIPTION command with some extra options.

-- Simple: create a subscription and auto-create Iceberg tables
CREATE SUBSCRIPTION my_sub_name
CONNECTION 'postgres://...'
PUBLICATION my_pub_name
WITH (create_tables_using = 'iceberg');

-- Advanced: Create a subscription for existing Iceberg tables, but replace their contents
-- Use advanced options to improve performance and reliability
CREATE SUBSCRIPTION my_sub_name
CONNECTION 'postgres://...'
PUBLICATION my_pub_name
WITH (overwrite_iceberg_data, streaming, binary, failover);

Most CREATE SUBSCRIPTION options are supported. We few options that are worth considering are:

  • failover - if the source is Postgres 17 or above, resume replication after automatic failover
  • binary - use binary instead of text encoding in the wire protocol, often improves performance
  • streaming - avoids buffering transactions on the source, though requires more storage in warehouse

Replicating to the same Iceberg table from multiple subscriptions is currently not supported.

Target table creation

To replicate data, you must have target tables. These can either be created ahead of time, or you can use the create_tables_using = 'iceberg' option to automatically create Iceberg tables for all tables in the publication that do not exist yet.

You can also replicate into a mixture of heap and Iceberg tables, but heap tables must be created manually.

User setup

You will need a user that has permissions to create subscriptions. In Postgres, this is granted by the pg_create_subscription permission.

For instance, you can connect to your Crunchy Data Warehouse as the superuser (postgres) and grant permission to the application user:

GRANT pg_create_subscription TO application;

The Iceberg table can be owned by users other than the subscription owner, but tables created via create_tables_using = 'iceberg' will have the same owner as the subscription.

Subscription options

The following are custom subscription options that can be provided in either a CREATE SUBSCRIPTION .. WITH () or ALTER SUBSCRIPTION ... WITH () statements.

OptionValid ValuesNotes
create_tables_using'iceberg'Creates any missing iceberg tables
overwrite_iceberg_datatrueOverwrites existing data in an iceberg table

For other options consult the postgres documentation.

Considerations

While Crunchy Data Warehouse stores Iceberg tables in remote storage, but it uses regular disk storage to stage changes and data, and for regular heap tables. We recommend that you use the same disk size on your Crunchy Data Warehouse instance as on the source PostgreSQL instance.

There are some limitations compared to regular logical replication to be aware of:

  • Multiple subscriptions per table are not supported
  • Generated columns are not yet replicated
  • Iceberg tables must be empty or use overwrite_iceberg_data when a subscription is created
  • You cannot perform schema changes or other writes to an Iceberg table while replication is active.