Metrics and monitoring

Dashboard metrics

You can find system metrics on the Crunchy Bridge Dashboard under the Metrics tab.

This graph displays processing load broken out into system load, user load, iowait, and percent CPU steal. System CPU time reflects operating system (i.e. kernel) functions while user time reflects processing in the actual running instance of Postgres.

Hobby-tier burst credit exhaustion

Info

Note: Hobby-tier plans have burstable vCPUs. This means that you can temporarily use 20x more CPU than your instance is allotted, but the CPU will be throttled back to baseline when all burst credits are depleted. This is likely to manifest as a huge drop in performance on a hobby-tier database.

If you're having performance challenges on a hobby-tier cluster, look for spikes in percent steal in the CPU graph that typically follow after a spike in another measure of CPU load. High percent CPU steal would indicate CPU burst credit exhaustion:

Burst credits will accumulate again over time, but you may need to upgrade your cluster to achieve more consistent performance. Review plans and pricing to determine which tier is right for your use case.

Note: If you don't see % CPU steal in your cluster Metrics, you may need to refresh your cluster to receive the latest Crunchy Bridge features.

Disk usage

The disk usage graph shows the individual sizes (in MB) of the major components of your Postgres storage, including:

Data files: These should be monitored for overall disk size and storage planning.
Log files: Log file management is generally not a concern for Crunchy Bridge. Excessive log size may indicate an issue (for example, a problematic query or logging misconfiguration).
Temporary files: Excessive use of temp files may be an indication that Postgres does not have enough memory to complete queries. You may need additional memory to improve query performance.
WAL, write ahead log files: WAL size is normally not a concern as WAL is constantly generated and archived. Excessive WAL use indicates a larger issue with your instance or replication slots.

IOPS

IOPS (input/output per second) is available in I/O RTPS (read transactions per second) and I/O WTPS (write transactions per second). IOPS capacity varies by plan. The Plans and Pricing page shows the specifications of each plan.

To determine which queries are contributing to IOPS usage, look for ones that use a lot of disk. Crunchy Bridge runs pg_stat_statements by default on all instances, so statistics are available to review.

You can query pg_stat_statements to look for a low hit rate on shared blocks, which would indicate that more data is read proportionally from disk than is being provided by cache: low shared_blks_hit / (shared_blks_hit + shared_blks_read).

Here's an example query you can use to find queries with a low hit rate:

SELECT
	pd.datname AS DB_Name
	,pss.rows AS Total_Row_Count
	,(pss.total_exec_time / 1000 / 60) AS Total_Exec_Mins
	,((pss.total_exec_time / 1000 / 60) / calls) as Total_Avg_Exec_Time
	,calls
	,shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0)::float AS Hit_Rate
	,queryid
FROM pg_stat_statements AS pss
INNER JOIN pg_database AS pd
	ON pss.dbid=pd.oid
WHERE calls > 1000
ORDER BY 6
LIMIT 10;

To dig into a query shown in the output, you can run the following statement with a given queryid:

select query from pg_stat_statements where queryid = <queryid>;

Query and index tuning can be a big help in increasing the hit rate on the cache and thereby reducing IOPS usage. For a deeper dive, check out Query Optimization in Postgres with pg_stat_statements on the blog.

Load average

Load average shows average CPU load over the indicated time period. A load average equal to your vCPU count indicates full utilization of all CPUs. A load average in excess of your CPU count means that processes had to wait for CPU time, with higher values meaning more time spent waiting.

Number of vCPUs varies by plan. Check the Plans and Pricing page for details about specific plans. If you are consistently seeing high load average you should look at tuning expensive queries or consider upgrading to a larger plan.

Memory

This shows the amount of process memory and the amount of swap you are using based on the plan you have provisioned. Check the Plans and Pricing page for details about specific plans.

Note that swap usage is not necessarily a bad thing. However, if you’re often needing swap and your baseline memory usage is high, you likely need additional memory.

Postgres uses memory at a few different levels. If you're interested in the details, check out our blog post on data storage and flow.

Additionally, Postgres memory usage can be tricky to interpret. With regard to Postgres memory utilization, there are three main things at play.

Process Memory - This is memory being taken up by each backend process for its own use, including:

the main Postmaster process
utility processes (checkpointer, archiver, autovacuum launcher, etc)
any client processes, i.e. those executing query statements

These processes allocate (by default) 4 MB each for process memory, but they also reserve additional memory based on parameters like work_mem, maintanence_work_mem, and temp_buffer.

Shared Memory - This is memory used by all processes for data and transaction log caching. That is the sum of shared_buffers, wal_buffers, CLOG_buffers, etc. By default we allocate 25% of system memory to shared_buffers.

Kernel Memory - Memory not being used by Postgres processes is generally used by the kernel for disk cache. The kernel is (generally) smarter about what to keep and what to push to disk.

Info

The memory graph on your Crunchy Bridge dashboard currently shows a memory_used metric which includes all memory allocated by processes. The PostgreSQL server process allocates various buffers shared by all processes, so this value includes the sum of all the Process Memory and Shared Memory described above.

On Standard and Memory instances this will usually account for 25%-30% of memory usage, although it may be larger if you have a high connection count or query activity which consumes a lot of memory. However on Hobby instances this process memory will represent a larger fraction of overall memory usage, and it's not uncommon to see this value consistently reporting 80-85% of memory in use.

The important thing to note is that Linux makes intelligent use of available memory, using it to reduce load on disks. If processes need the memory, the OS will give up some of its disk cache.

Network usage

This shows the inbound and outbound network traffic for your instance. In general this should be consistent and match the behavior you see in other metrics. Unusual activity or dramatic changes could be a sign of unintentional application changes.