Major version upgrades

PostgreSQL minor version upgrades are generally safe and straightforward. So much so that Crunchy Bridge automatically rolls them into other cluster management operations. On the other hand, major version upgrades should be fully tested in advance of being performed in production. In this guide, we will walk you through our basic recommendations around PostgreSQL major version upgrades.

How Crunchy Bridge does major version upgrades

Postgres major version upgrades on Crunchy Bridge work by provisioning a hidden hot standby of the cluster being upgraded, an "upgrade standby." This upgrade standby is created as soon as the upgrade is scheduled, but it is not actually upgraded until after the maintenance begins.

When the appointed upgrade time arrives, the platform does the following:

Halts writes to the active cluster server
Confirms that replication to the upgrade standby is current with the active cluster
Detaches the upgrade standby from the active cluster
Upgrades the upgrade standby to the desired PostgreSQL version using pg_upgrade
Promotes the upgrade standby to be the new active cluster server if the pg_upgrade run is successful; otherwise, re-enables writes to the active cluster server and terminates the maintenance
Re-enables writes to the new, upgraded active cluster server
Removes the old cluster server as it is no longer needed

Major version upgrades in practice

Crunchy Bridge aims to make it easy to perform major version upgrades. The available methods for scheduling and running the upgrade are outlined in this section. Note that testing your upgrade in a non-production environment before running it in production is extremely important, and is covered next.

Testing the upgrade

For any type of testing, we always recommend using a fork of your production cluster. This will give you a more accurate view of the test results than testing against a staging or development server, since these typically have smaller data sets.

Testing a major version upgrade on a fork of production gives a good estimate of the time needed to complete the upgrade process in production. This will help you determine how much time to allocate for your application maintenance window. Upgrading on a fork first also surfaces any issues that will arise during the upgrade process, allowing you to work through them in advance.

Once you have successfully upgraded a fork, you can point your development or staging application environment at the upgraded cluster. You should run your application's test suite against it, and follow whatever is your usual QA process to ensure that your application still functions correctly when connecting an upgraded cluster.

Info

Validating application compatibility in advance is critically important. PostgreSQL major version upgrades are generally one-way trips.

There is no easy way to roll back to a previous version if your application is not compatible. The upgrade process we use allows for a graceful landing for your cluster only if the upgrade itself does not succeed.

Scheduling and running the upgrade

Major version upgrades can be run and managed via the cluster's dashboard or the CLI. Each method provides different options for managing the time at which the upgrade standby is actually upgraded and promoted.

Upgrades via the dashboard

When running a major version upgrade via the Dashboard, the actual upgrade and failover are done during active cluster's next maintenance window, as long as the upgrade standby has finished building by the time that maintenance window arrives. If no maintenance window is set, or if you click "Run Now," the upgrade will begin as soon as it is ready.

Upgrades via the CLI

Scheduling and running major version upgrades via the CLI allows for more fine-grained control over when the upgrade and failover happen.

Let's say that it's currently Monday, May 12th, 2025 and you want to upgrade your cluster to PG17 at 8pm UTC on the following day, Tuesday, May 13th. You can specify the date and time when creating the maintenance:

cb upgrade start --cluster <clusterID> --version 17 --starting-from '2025-05-13T20:00:00Z'

The upgrade standby will be built as soon as the upgrade is requested and the upgrade cluster will be kept up to date with the active cluster until the maintenance begins. Note that you can schedule a maintenance up to 3 days in advance.

If nothing else is done, this maintenance will begin at the specified date and time: 2025-05-13 20:00:00. If you decide you want the maintenance to run at a different date or time, for example on Wednesday, May 14th, 2025 at 6am, you can update the maintenance any time before it begins:

cb upgrade update --cluster <clusterId> --starting-from '2025-05-14T06:00:00Z'

If you decide that you actually want the maintenance to run now (or as soon as it is ready), you can update it to run "now":

cb upgrade update --cluster <clusterID> --now

You can use the CLI to update a maintenance that was requested through the Dashboard, or "Run Now" or cancel a maintenance in the Dashboard that was requested through the CLI.

Post-upgrade considerations

Immediately after the upgrade process completes successfully, the platform will:

Run vacuumdb --analyze-in-stages on the upgraded cluster (because planner statistics are not carried forward across major version upgrades)
Take a fresh backup
Rebuild the HA standby or any read replicas from the backup once it is complete

This has a few important disk IO implications for large databases:

Both the backup and ANALYZE work done by the vacuumdb --analyze-in-stages job will require a lot of read IOPS.
Until the ANALYZE work is completed, many queries will run with sub-par execution plans that use sequential scans instead of indexes.
Read replicas will be stale at first. You can send that traffic to the upgraded primary while waiting for new read replicas to be built (after the backup completes), but that will add additional disk IO and other pressure to the new primary.

Given these facts, for very large database it can be a good idea to delay restoring application traffic to the upgraded cluster until the ANALYZE work is completed. Alternatively, you can restore traffic in stages if possible, e.g. resume web functionality first but wait to re-enable any background jobs.

Upgrade issues

Upgrade failures

Upgrades can fail for a variety of reasons, such as issues with shared_preload_libraries or extensions that are not compatible with the upgraded server version. If an upgrade does fail, the pg_upgrade failure logs will be sent to your cluster's log stream. If you have a logging integration set up with a logging-as-a-service provider, you can look there for failure logs.

If your upgrade is failing and you're not sure how to fix the problems shown in the failure logs, or you do not have a logging yet provier configured, please reach out to support.

Rolling back a successful major version upgrade

A PostgreSQL server that has successfully completed a major version upgrade can not be rolled back to a prior major version. Thus, a major version rollback strategy on Crunchy Bridge would be one of these options:

Use the CLI upgrade strategy and when performing your major upgrade maintenance detach (promote) a read replica before you finalize the upgrade. This detached read replica will then be a promoted, independent primary that remains on the prior major version with a its data current as of just prior to when the original primary was upgraded.
If you did not provision and detach a read replica before upgrading your primary, you can restore a PITR fork from a backup of the primary that is replayed to the point in time just before the upgrade was finalized. This is not ideal since you will have to wait for the fork to be provisioned and PITR replay to complete before you can use the forked server.

Writes to the primary server after it was upgraded will not make it onto a detached replica or PITR fork. Those writes will have to be salvaged manually if they are critical and cannot be lost. In order to avoid this, you can extend the detached read replica method:

Start your application maintenance window and halt all writes to the primary.
Detach your replica.
Finalize the primary upgrade.
Establish a logical replication link from the upgraded server to the promoted read replica for all databases on the server.
Re-enable application writes to the primary.

This will ensure that writes to the upgraded server are replicated back to the detached/promoted read replica. However, logical replication setup and failover does add a lot of complexity to the upgrade process. This method should also be fully tested before proceeding in production.