•   PostgreSQL   •   By Pavel Luzanov

PostgreSQL 17: part 5 or CommitFest 2024-03

Since the PostgreSQL 17 RC1 came out, we are on a home run towards the official PostgreSQL release, scheduled for September 26, 2024.

Letʼs take a look at the patches that came in during the March CommitFest. Previous articles about PostgreSQL 17 CommitFests: 2023-07, 2023-09, 2023-11, 2024-01.

Together, these give an idea of what the new PostgreSQL will look like.

Reverts after code freeze

Unfortunately, some previously accepted patches didn't make it in after all. Some of the notable ones:

Now, letʼs get to the new stuff.

SQL commands

Performance

Monitoring and management

Server

Replication

Client applications

New features of the MERGE command

commit:5f2e179b,0294df2f,c649fa24

The MERGE command, first introduced in PostgreSQL 15, receives a bunch of improvements.

Here are two tables:


SELECT * FROM source;

 id | descr
----+-------
  0 | Zero
  1 | First
(2 rows)

SELECT * FROM target;

 id | descr
----+-------
  1 | One
  2 | Two
(2 rows)

The first improvement is for target. Now the target can be not just a table, but also a view. The view must be updatable or must have INSTEAD OF triggers defined for it.

Letʼs run MERGE so that target becomes the same as source. To do this, add the row with id=0 to target, update the row with id=1 and delete the row with id=2. Before PostgreSQL 17, the MERGE command couldnʼt process target relationship rows that are not in the source. Now, there is the condition WHEN NOT MATCHED BY SOURCE for such rows:


MERGE INTO target AS t
     USING source AS s
     ON (s.id = t.id)
WHEN matched AND s.descr IS DISTINCT FROM t.descr
THEN
    UPDATE SET descr = s.descr
WHEN NOT MATCHED BY TARGET
THEN
    INSERT (id, descr) VALUES (s.id, s.descr)
WHEN NOT MATCHED BY SOURCE
THEN
    DELETE
RETURNING merge_action(), s.*, t.*;

 merge_action | id | descr | id | descr
--------------+----+-------+----+-------
 INSERT       |  0 | Zero  |  0 | Zero
 UPDATE       |  1 | First |  1 | One
 DELETE       |    |       |  2 | Two
(3 rows)

MERGE 3

The third improvement is in the output. It adds the RETURNING clause support, namely, the merge_action function that tells you which rows were added, updated or deleted. Note that within RETURNING you can reference both source and target rows.


SELECT * FROM target;

 id | descr
----+-------
  0 | Zero
  1 | One
(2 rows)

See also:

Waiting for PostgreSQL 17 -- Add RETURNING support to MERGE (Hubert ʻdepeszʼ Lubaczewski)

COPY ... FROM: messages about discarded rows

commit: f5a22789

In the January CommitFest article I talked about how the COPY command can now ignore format conversion errors.

This patch adds a new parameter log_verbosity. If set to verbose, the COPY output will include NOTICE-level messages for every discarded row.


CREATE TABLE t (id int PRIMARY KEY);

COPY t FROM STDIN (on_error 'ignore', log_verbosity verbose);

Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1
>> two
>> three
>> 4
>> \.
NOTICE:  skipping row due to data type incompatibility at line 2 for column id: "two"
NOTICE:  skipping row due to data type incompatibility at line 3 for column id: "three"
NOTICE:  2 rows were skipped due to data type incompatibility
COPY 2

SELECT * FROM t;

 id
----
  1
  4
(2 rows)

See also:

Waiting for PostgreSQL 17 -- Add new COPY option LOG_VERBOSITY (Hubert ʻdepeszʼ Lubaczewski)

Exploring PostgreSQL 17: A Developerʼs Guide to New Features -- Part 3: The COPY Command Gets More User-Friendly (Deepak Mahto)

The SQL/JSON standard support

commit: 6185c973, de360045

The SQL standard defines how to work with JSON using dedicated constructors, predicates and functions. All of them were implemented and included in PostgreSQL 15 before its code freeze. However, just before the release, SQL/JSON support was retracted due to performance concerns.

The issue is with the need for some of the functions to be able to process incorrect arguments when they are passed to them: for example, when a function expects a JSON but gets a non-JSON string instead. Such errors were processed by means of savepoints and rollbacks. This approach hit performance hard, but was the only way to catch an error and not interrupt the whole function.

In PostgreSQL 16, the pg_input_error_info and pg_input_is_valid functions were added for soft type conversion error handling. They enabled the SQL/JSON functionality to be refined and, ultimately, get into the 16th release.

Now, most of the standard is implemented, with a few exceptions (see the second commit description.)

See also:

Waiting for ...: SQL/JSON is coming back. Hopefully (Hubert ʻdepeszʼ Lubaczewski)

SLRU cache configuration

commit: bcdfa5f2

In high-load systems, the size of SLRU caches, specifically the inability to increase the size of a particular cache, can become a bottleneck.

In PostgreSQL 17, each cache gets a configuration parameter (with the _buffers suffix in the name), which allows users to fine-tune the system to a specific load. In addition, the cache lookup has been significantly optimized.

New SLRU cache configuration parameters:

The names of the caches in pg_stat_slru view were adjusted slightly (the second commit) to correlate with the parameter names.


SELECT name FROM pg_stat_slru;

       name       
------------------
 commit_timestamp
 multixact_member
 multixact_offset
 notify
 serializable
 subtransaction
 transaction
 other
(8 rows)

Planner: Merge Append for the UNION implementation

commit: 66c0185a

In queries with UNION, the planner has always used the Append node to combine the results of subqueries. At the next step, either Sort or Hash Aggregate was applied to the results, then duplicates were eliminated with Unique.

Hereʼs an impractical but simple example:


16=# EXPLAIN (costs off)
SELECT book_ref FROM bookings
UNION
SELECT book_ref FROM bookings;

                    QUERY PLAN                     
---------------------------------------------------
 Unique
   ->  Sort
         Sort Key: bookings.book_ref
         ->  Append
               ->  Seq Scan on bookings
               ->  Seq Scan on bookings bookings_1
(6 rows)

However, if there are indexes or an ORDER BY clause, the results of subqueries may come in sorted even before they are combined. So, if Merge Append is used in place of Append, the full sort can be avoided and you can go straight to deduplication.

Hereʼs the plan of the same query after the patch:


                               QUERY PLAN                               
------------------------------------------------------------------------
 Unique
   ->  Merge Append
         Sort Key: bookings.book_ref
         ->  Index Only Scan using bookings_pkey on bookings
         ->  Index Only Scan using bookings_pkey on bookings bookings_1
(5 rows)

Planner: materialized CTE statistics (continued)

commit: a65724df

The fact that the planner has learned to use some column statistics from CTE to build the outer parts of the plan has already been mentioned in the November commitfest article. Now the planner will take into account the sorting of the rows returned from CTE. This avoids unnecessary repeat sorting for some queries.

In this example, index access to the bookings table is used inside CTE, which means the output is returned already sorted. But in old versions, the planner does not know that, so it adds the Sort node anyway.


16=# EXPLAIN (costs off)
WITH b AS MATERIALIZED (
    SELECT * FROM bookings ORDER BY book_ref
)
SELECT * FROM b ORDER BY book_ref;

                     QUERY PLAN                     
----------------------------------------------------
 Sort
   Sort Key: b.book_ref
   CTE b
     ->  Index Scan using bookings_pkey on bookings
   ->  CTE Scan on b
(5 rows)

In PostgreSQL 17, no additional sorting is required.



                     QUERY PLAN                     
----------------------------------------------------
 CTE Scan on b
   CTE b
     ->  Index Scan using bookings_pkey on bookings
(3 rows)

Optimization of parallel plans with DISTINCT

commit: 7e0ade0f, 902900b3

The parallel execution of DISTINCT was introduced in PostgreSQL 15. Each worker processes its portion of data then passes the result to the leader process into the Gather node.

The new patch allows the planner to choose between Gather and Gather Merge as the collection point. Gather Merge may be more efficient in cases where unique values need to be sorted.

In the next example, Gather still collects results from workers. Here, parallelization stops, PostgreSQL sorts all the rows and returns the first ten. (While PostgreSQL is very efficient at sorting and returning the first several rows, it still needs to sort the whole set.)

PostgreSQL 16 plan:


16=# EXPLAIN (costs off)
SELECT DISTINCT flight_id FROM ticket_flights ORDER BY 1 LIMIT 10;

                            QUERY PLAN                             
-------------------------------------------------------------------
 Limit
   ->  Sort
         Sort Key: flight_id
         ->  HashAggregate
               Group Key: flight_id
               ->  Gather
                     Workers Planned: 2
                     ->  HashAggregate
                           Group Key: flight_id
                           ->  Parallel Seq Scan on ticket_flights
(10 rows)

In PostgreSQL 17, workers sort their batches before feeding them to Gather Merge. Since the leader process receives the rows already sorted, all it has to do is deduplicate (Unique) and return the first 10 rows, saving time on sorting the whole set.

PostgreSQL 17 plan:


                            QUERY PLAN                             
-------------------------------------------------------------------
 Limit
   ->  Unique
         ->  Gather Merge
               Workers Planned: 2
               ->  Sort
                     Sort Key: flight_id
                     ->  HashAggregate
                           Group Key: flight_id
                           ->  Parallel Seq Scan on ticket_flights
(9 rows)

Optimizing B-tree scans for sets of values

commit: 5bf748b8

This patch optimizes searches for sets of values, especially over multicolumn indexes. Hereʼs an example for the ticket_flights table, which has a unique index for two columns: ticket_no and flight_id.

Letʼs look up records matching a set of tickets and flights.


16=# EXPLAIN (analyze, buffers, costs off, timing off, summary off)
SELECT *
FROM ticket_flights
WHERE ticket_no IN ('123456789012', '234567890123', '345678901234')
  AND flight_id IN (12345, 23456, 34567);

                                                                   QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ticket_flights_pkey on ticket_flights (actual rows=0 loops=1)
   Index Cond: ((ticket_no = ANY ('{123456789012,234567890123,345678901234}'::bpchar[])) AND (flight_id = ANY ('{12345,23456,34567}'::integer[])))
   Buffers: shared hit=36

The query plan in PostgreSQL 17 does not change, but only four buffers have been read instead of 36:


                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ticket_flights_pkey on ticket_flights (actual rows=0 loops=1)
   Index Cond: ((ticket_no = ANY ('{123456789012,234567890123,345678901234}'::bpchar[])) AND (flight_id = ANY ('{12345,23456,34567}'::integer[])))
   Buffers: shared hit=4

VACUUM: new dead tuples storage

commit: 30e14428, 667e65aa

VACUUM operates by scanning a table (with regards to the visibility map) for any tuples that are no longer needed. Identifiers for these tuples (TIDs) are stored in a dedicated array. The array is then used first to free up space in all indexes (which have to be scanned in full), and then in the table itself. If the TIDs do not fit into the array, the process will have to be repeated. This is bad and should be avoided by configuring autovacuuming accordingly.

Before this patch, the storage array was the bottleneck of the process. Its size is defined by the maintenance_work_mem parameter, 64MB by default. The memory is allocated all at once, even if the dead tuples are few. On the other hand, the array size cannot exceed 1GB, even if a higher value is set.

What has changed? The first commit implements a memory structure and an interface for working with TIDs based on a compressed prefix tree (radix tree). According to the patch authorʼs assessment (link below), TIDs take up about twenty times less space in the new storage than in the array, while the scan speed becomes several times faster. In addition, there is no longer a 1GB limit (although with such a compact layout, this is no longer so relevant), and memory is allocated as needed, rather than exactly in the amount defined by maintenance_work_mem.

All this should help avoid most situations when one pass is not enough to vacuum a table.

The example below demonstrates the results.

First, prepare a table with one index:


CREATE TABLE t(id integer) WITH (autovacuum_enabled = off);
INSERT INTO t SELECT gen.id FROM generate_series(1,3_500_000) gen(id);
CREATE INDEX t_id ON t(id);
VACUUM;

Update the table and set a low maintenance_work_mem, then run vacuuming. Watch how many passes it takes to vacuum the indexes.


UPDATE t SET id = id + 1;
SET maintenance_work_mem = '1MB';
VACUUM VERBOSE t;

The VACUUM VERBOSE output in PostgreSQL 16 starts with the lines:


INFO:  vacuuming "postgres.public.t"
INFO:  finished vacuuming "postgres.public.t": index scans: 21
...

The rest of the output is irrelevant. The dead TIDs array size of 1MB is too low, so index vacuuming was done 21 times.

PostgreSQL 17 output:


INFO:  vacuuming "postgres.public.t"
INFO:  finished vacuuming "postgres.public.t": index scans: 1
...

All the TIDs fit into the new storage at once.

Another change was made to the pg_stat_progress_vacuum view. The columns max_dead_tuples and num_dead_tuples have been replaced with max_dead_tuple_bytes and dead_tuple_bytes. As the column names suggest, vacuum progress is now tracked in bytes rather than in tuples, as there is no longer a predictable relationship between the number of dead tuples and the storage space occupied by their TIDs.

See also:

PostgreSQL meets ART --- Using Adaptive Radix Tree to speed up vacuuming (Masahiko Sawada)

VACUUM: combine WAL records for vacuuming and freezing

commit: f83d7097, 6dbb4902

If rows are frozen during vacuuming, VACUUM will add one combined record to WAL instead of two separate ones. This decreases the size of WAL, which helps not only the server, but also replicas and the WAL archive.

We can use VACUUM VERBOSE to assess the optimization, since it reports WAL usage.

Freeze all pages in the test table:


CREATE TABLE t(id integer) WITH (autovacuum_enabled = off);
INSERT INTO t SELECT gen.id FROM generate_series(1,3_500_000) gen(id);
VACUUM FREEZE t;

Update the rows and run vacuum with freeze. Watch the WAL usage stats:


UPDATE t SET id = id + 1;
VACUUM FREEZE VERBOSE t;

PostgreSQL 16:


INFO:  aggressively vacuuming "postgres.public.t"
...
WAL usage: 77437 records, 28718 full page images, 142916493 bytes
...

PostgreSQL 17:


INFO:  aggressively vacuuming "postgres.public.t"
...
WAL usage: 61949 records, 27227 full page images, 116712997 bytes
...

~62 thousand WAL records versus ~77 thousand is very significant.

Functions with subtransactions in parallel processes

commit: 0075d789

This is a bug fix. The interesting part here is that functions with subtransactions (with the EXCEPTION block in plpgsql) can now be marked as PARALLEL SAFE. You still have to consider other restrictions, such as modifying the database.

For example, if the EXCEPTION is used to catch a division by zero, and at the same time nothing is written to the database, the function can be marked safe for parallelization.

However, using EXCEPTION in such situations is not the best idea. If the function doesnʼt write anything to the database, then there is nothing to roll back to the savepoint in the EXCEPTION block, and creating a savepoint isnʼt free by any means.

EXPLAIN (analyze, serialize): data conversion costs

commit: 06286709

EXPLAIN ANALYZE executes a query and shows the actual time spent. However, not all actions are performed during this execution.

If you compare the execution time in the plan:


=# EXPLAIN (costs off, analyze) SELECT * FROM tickets;

                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on tickets (actual time=0.080..212.350 rows=2949857 loops=1)
 Planning Time: 0.060 ms
 Execution Time: 275.396 ms
(3 rows)

... to the actual execution time, you will see a significant difference:


\timing on

Timing is on.

=# SELECT * FROM tickets\g /dev/null

Time: 1853,612 ms (00:01,854)

EXPLAIN ANALYZE does not convert the query result into text or binary form for sending to the client and saves on assembling TOASTed values from individual pieces. Therefore, the more rows the query returns, the greater the margin of error.

The new parameter SERIALIZE forces EXPLAIN to do these tasks as well. The cost of decompression and conversion is displayed under Serialization:


=# EXPLAIN (costs off, analyze, serialize) SELECT * FROM tickets;

                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on tickets (actual time=0.074..232.690 rows=2949857 loops=1)
 Planning Time: 0.057 ms
 Serialization: time=1215.195 ms  output=338792kB  format=text
 Execution Time: 1568.133 ms
(4 rows)

The difference between calculated and actual time is significantly lower. Itʼs still not exact, since EXPLAIN cannot track the time spent sending data from server to client and writing to disk.

See also:

Waiting for PostgreSQL 17 -- Invent SERIALIZE option for EXPLAIN (Hubert ʻdepeszʼ Lubaczewski)

EXPLAIN: improved display of SubPlan and InitPlan nodes

commit: fd0398fc

The EXPLAIN BASICS section of the documentation is supplemented with examples of plans with SubPlan, hashed SubPlan and InitPlan nodes. Letʼs consider the changes by using the following query as an example:


EXPLAIN (costs off)
SELECT * FROM bookings
WHERE book_date > (SELECT current_date) AND
      total_amount NOT IN (VALUES (1),(2));

PostgreSQL 16 plan:

                        QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on bookings
   Filter: ((book_date > $0) AND (NOT (hashed SubPlan 2)))
   InitPlan 1 (returns $0)
     ->  Result
   SubPlan 2
     ->  Values Scan on "*VALUES*"
(6 rows)

PostgreSQL 17 plan:

                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on bookings
   Filter: ((book_date > (InitPlan 1).col1) AND (NOT (ANY (total_amount = ((hashed SubPlan 2).col1)::numeric))))
   InitPlan 1
     ->  Result
   SubPlan 2
     ->  Values Scan on "*VALUES*"
(6 rows)

Look at the two differences in the second line.

  • The output parameters of the InitPlan and SubPlan nodes are now displayed not with the dollar sign ($0, $1, ..., $n) but as (InitPlan N).colX or (SubPlan N).colX, where N is the ordinal number of the InitPlan or SubPlan node, and X is the ordinal number of the returned column.
  • The second condition is displayed more clearly. The uninformative NOT (hashed SubPlan 2) has been replaced by NOT (ANY (total_amount = ((hashed SubPlan 2).col1)::numeric)). Now it correlates with the condition in the query.

pg_buffercache: eviction from cache

commit: 13453eed

The new function pg_buffercache_evict evicts a buffer from the buffer cache.


CREATE EXTENSION pg_buffercache;

The test table will take up one block. Autovacuum is disabled for clarity of the experiment.


CREATE TABLE t (id int) WITH (autovacuum_enabled=off);
INSERT INTO t VALUES (1);

Find the buffer related to the table:


SELECT bufferid, isdirty, pinning_backends
FROM pg_buffercache
WHERE relfilenode = (
        SELECT relfilenode
        FROM pg_class
        WHERE oid = 't'::regclass
);

 bufferid | isdirty | pinning_backends
----------+---------+------------------
      824 | t       |                0
(1 row)

Evict the buffer by feeding the new buffer number to pg_buffercache_evict:


SELECT pg_buffercache_evict(824);

 pg_buffercache_evict
----------------------
 t
(1 row)

Verify that there are no buffers for the table left in the cache.


SELECT bufferid, isdirty, pinning_backends
FROM pg_buffercache
WHERE relfilenode = (
        SELECT relfilenode
        FROM pg_class
        WHERE oid = 't'::regclass
);

 bufferid | isdirty | pinning_backends
----------+---------+------------------
(0 rows)

random: a random number in the specified range

commit: e6341323

The function random returns a random number in the range from 0 to 1. However, sometimes you want a random number within an arbitrary range. Now you can get just that!


SELECT random(2,5), random(0,3.14) FROM generate_series(1,5);

 random | random
--------+--------
      2 |   0.43
      5 |   0.98
      3 |   0.12
      5 |   2.18
      2 |   2.43
(5 rows)

The function random now has two parameters to specify the minimum and maximum values of the returned random number. The range can be specified for int, bigint, and numeric types.

transaction_timeout: session termination when the transaction timeout is reached

commit: 51efe38c

We already have two parameters to control long transactions: statement_timeout and idle_in_transaction_session_timeout. However, a transaction can last indefinitely if it executes sufficiently short commands that do not exceed statement_timeout, and the pauses between commands fit into idle_in_transaction_session_timeout.

In such situations, transaction_timeout ensures that the duration of the transaction does not exceed the specified timeout. If exceeded, the transaction and the session that executed it will be terminated.

Prohibit the use of ALTER SYSTEM

commit: d3ae2a24

Cloud providers usually come with their own configuration management tools, which often prohibit changing certain parameters or only accept certain parameter values. This is important for the proper service operation. For example, if the client administrator sets the jsonlog server log format instead of the expected csvlog, the log analysis tools will stop working.

To avoid the temptation to edit postgresql.conf, access to the server file system can be restricted. However, the ALTER SYSTEM command can still modify postgresql.auto.conf without access to the file system.

The new configuration parameter allow_alter_system can be used to restrict the use of ALTER SYSTEM for everyone, including superusers:


=# SHOW allow_alter_system;

 allow_alter_system
--------------------
 off
(1 row)

=# ALTER SYSTEM SET wal_level = minimal;

ERROR:  ALTER SYSTEM is not allowed in this environment

Important note: this is not a server security feature. A superuser has multiple ways to change the server configuration and perform other potentially dangerous actions. The parameter is there to help prevent unintentional, non-malicious configuration changes.

See also:

The PostgreSQL community debates ALTER SYSTEM (Jonathan Corbet)

The MAINTAIN privilege and the pg_maintain predefined role

commit: ecb0fd33

This patch has already been discussed in the PostgreSQL 16 CommitFest articles, but after the code freeze it has been rolled back due to security concerns. The concerns have been resolved, and PostgreSQL 17 brings a new privilege MAINTAIN for tables and materialized views.

The privilege allows users to execute the following commands: ANALYZE, VACUUM (including VACUUM FULL), CLUSTER, REINDEX, REFRESH MATERIALIZED VIEW and LOCK TABLE.

Membership in the new role pg_maintain grants the MAINTAIN privilege for all relations within a database.

Built-in locale provider for C.UTF8

commit: f69319f2

For a long time, libc was the only provider in PostgreSQL. It provided locales of the C system library. In PostgreSQL 10, icu was added, using the external library ICU. PostgreSQL brings to the table the built-in provider with support for C and C.UTF8 locales.

Functionally, nothing has changed. Both locales have been around and provided by libc for a long time. The ascetic locale of C still understands only the Latin alphabet, but C.UTF8 is a bit more interesting.


CREATE DATABASE test_builtin_provider
  LOCALE_PROVIDER = 'builtin'
  BUILTIN_LOCALE = 'C.UTF8'
  TEMPLATE = 'template0';

In the C.UTF8 locale, character classification works correctly for any language (which characters are letters, are they upper or lower case), but letters are sorted by character codes, disregarding any linguistic nuances. For example, the sorting for the Cyrillic letter ʻёʼ is incorrect:


WITH russian_alphabet AS (
  SELECT regexp_split_to_table('абвгдеёжзийклмнопрстуфхцчшщъыьэюя', '') AS letter
)
SELECT string_agg(lower(letter), '' ORDER BY lower(letter)) AS lower,
       string_agg(upper(letter), '' ORDER BY upper(letter)) AS upper
FROM russian_alphabet\gx

-[ RECORD 1 ]----------------------------
lower | абвгдежзийклмнопрстуфхцчшщъыьэюяё
upper | ЁАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

If this sorting is acceptable, the use of C.UTF8 from the built-in provider gives an advantage:

  • Sorting and case change are quicker due to internal optimizations. See the link below for some timings.
  • The locale is independent from external libraries and works identically on any platform on the same major PostgreSQL version.

See also:

Looking at the new built-in collation provider in PostgreSQL 17 (Daniel Vérité)

pg_column_toast_chunk_id: ID of the TOAST value

commit: d1162cfd

The values of columns with variable length types can be transferred to a TOAST table. But which values didnʼt fit into the main table and were TOASTed?

For the demonstration, letʼs create a table with three rows containing oversized numeric values.


CREATE TABLE t(
    id integer,
    n numeric
);

INSERT INTO t VALUES
    (1,    123456789::numeric ^ 123::numeric),
    (2,    123456789::numeric ^ 1234::numeric),
    (3,    123456789::numeric ^ 12345::numeric);

Are these values stored in the main table or in the TOAST table? The new function pg_column_toast_chunk_id helps answer this question: it returns chunk_id, the identifier of a large value in the TOAST table.


SELECT id, pg_column_toast_chunk_id(n)
FROM t;

 id | pg_column_toast_chunk_id
----+--------------------------
  1 |                         
  2 |                         
  3 |                    24587
(3 rows)

As you can see, only the value from the third row did not fit in the main table.

pg_basetype function: basic domain type

commit: b154d8a6

The pg_basetype function returns the base domain type. In the case of nested domains, the function will recursively scan them in search of the base type:


CREATE DOMAIN posint AS integer CHECK (VALUE > 0);
CREATE DOMAIN dozen AS posint CHECK (VALUE < 13);

SELECT pg_basetype('dozen'::regtype);

 pg_basetype
-------------
 integer
(1 row)

pg_constraint: NOT NULL restrictions for domains

commit: e5da0fe3

In the September CommitFest article I mentioned how pg_constraint now records NOT NULL constraints for table columns. Unfortunately, that patch was reverted.

But the NOT NULL constraints will still appear in pg_constraint, if only for domains.


CREATE DOMAIN int_nn AS integer NOT NULL;

SELECT * FROM pg_constraint WHERE conname ~ 'int_nn'\gx

-[ RECORD 1 ]--+----------------
oid            | 16573
conname        | int_nn_not_null
connamespace   | 2200
contype        | n
condeferrable  | f
condeferred    | f
convalidated   | t
conrelid       | 0
contypid       | 16572
conindid       | 0
conparentid    | 0
confrelid      | 0
confupdtype    |  
confdeltype    |  
confmatchtype  |  
conislocal     | t
coninhcount    | 0
connoinherit   | f
conkey         |
confkey        |
conpfeqop      |
conppeqop      |
conffeqop      |
confdelsetcols |
conexclop      |
conbin         |

New function to_regtypemod

commit: 1218ca99

A combination of the already existing functions format_type and to_regtype, together with the new function to_regtypemod, will help convert the name of a data type specified in any valid form to the canonicalized form.


WITH types (alias) AS (
    VALUES ('timestamp(5)'),
           ('char(10)'),
           ('varchar(30)')
)
SELECT alias,
       format_type(to_regtype(alias), to_regtypemod(alias))
FROM types;

    alias     |          format_type           
--------------+--------------------------------
 timestamp(5) | timestamp(5) without time zone
 char(10)     | character(10)
 varchar(30)  | character varying(30)
(3 rows)

Hash indexes for ltree

commit: 485f0aa8

The ltree data type (of the ltree extension) now supports hash indexes.

pg_createsubscriber: quickly create a logical replica from a physical one

commit: d44032d0

The usual logical replication setup process for a large database can take a long time due to the initial data synchronization. Itʼs even worse if the database is actively updated.

If that is the case, you can now use a physical replica as a foundation, since it already has all the tables synchronized. If you switch the physical replica to read/write mode, create publication/subscription pairs in one or more databases on both servers, and specify the correct position in the subscription properties from which to continue receiving changes, then a logical replica will be created fairly quickly. This is a simplified description of what the new pg_createsubscriber server utility does. See the link below for a full description.

Initialize the primary server on port 5400. Set wal_level to logical. Test it with a single table in the postgres database.


$ PGPORT=5400; initdb -U postgres -D ./primary -c port=5400 -c wal_level=logical
$ pg_ctl start -D ./primary -l ./primary/log

primary=# CREATE TABLE test(id int);
primary=# INSERT INTO test (id) VALUES (1);

The physical replica standby will run on port 5401.


$ pg_basebackup -d 'port=5400 user=postgres' -c fast -R -D ./standby
$ echo 'port=5401' >>./standby/postgresql.auto.conf
$ pg_ctl start -D ./standby -l ./standby/log

standby=# SELECT pg_is_in_recovery();

-------------------
 t
(1 row)

The replica must be stopped before conversion.


$ pg_ctl stop -D ./standby

The pg_createsubscriber tool runs on the physical replica server.


$ pg_createsubscriber -d postgres -U postgres -D ./standby -P 'port=5400 user=postgres'

As it works, the tool outputs a lot of diagnostics data about the actions being performed. If the conversion completes without errors, the logical replica is ready.


$ pg_ctl start -D ./standby -l ./standby/log

A publication has been created in the postgres database on the primary server for all tables.


primary=# \x \dRp

Expanded display is on.
List of publications
-[ RECORD 1 ]------------------------------
Name       | pg_createsubscriber_5_d3ba035a
Owner      | postgres
All tables | t
Inserts    | t
Updates    | t
Deletes    | t
Truncates  | t
Via root   | f

primary=# INSERT INTO test (id) VALUES (2);

And a subscription to the publication has been created on the standby server in the same database.


standby=# SELECT * FROM test;

 id
----
  1
  2
(2 rows)

standby=# \x \dRp

Expanded display is on.
List of subscriptions
-[ RECORD 1 ]---------------------------------
Name        | pg_createsubscriber_5_dec1faeb
Owner       | postgres
Enabled     | t
Publication | {pg_createsubscriber_5_dec1faeb}

Logical slots: tracking the causes of replication conflicts

commit: 007693f2, 6ae701b4

In the pg_stat_replication view, in the conflicting column, you can get information that the logical slot is invalid, and in the invalidation_reason column, you can find out why. The following reasons can currently be monitored: wal_removed, rows_removed, wal_level_insufficient. See the documentation for more.

pg_basebackup -R: dbname in primary_conninfo

commit: a145f424

In the September CommitFest article I talked about the new ability to specify the dbname parameter when launching pg_basebackup and pg_receivewal.

Now, if a database was specified in the connection string when running pg_basebackup with the -R parameter, dbname will be included in the primary_conninfo parameter in postgresql.auto.conf. This helps when configuring a replica with replication slots synchronization.

Synchronization of logical replication slots between the primary server and replicas

commit: c393308b, 73292404, 776621a5, ddd5f4f5, 93db6cbd, bf279ddd, a11f330b, 2ec005b4, 6d49c8d4, 6f132ed6

This set of patches aims to ensure smooth logical replication when switching from the publisher to a replica. The problem: logical slots are not transferred to replicas, so when a failure or a planned switchover occurs and a replica becomes the publisher, logical replication on subscribers stops.

The solution: to transfer the logical slots to the replica and keep them up to date. Then, after switching to the replica, all you need to do is to set the connection string to the new server in the subscription settings and continue receiving changes without data loss.

Letʼs set up three local servers on different ports:

  • pub_primary ― publishing server (port 5400)
  • pub_standby ― physical replica of the publishing server (port 5401)
  • sub_server ― subscriber to the publication on pub_primary (port 5402)

After switchover from pub_primary to pub_standby, we want sub_server to start listening to updates from pub_standby with no data loss.

pub_primary

Initialize pub_primary and immediately set the wal_level required for logical replication.


$ PGPORT=5400; initdb -U postgres -D ./pub_primary -c port=5400 -c wal_level=logical -c cluster_name=pub_primary
$ pg_ctl start -D ./pub_primary -l ./pub_primary/log

The future replica pub_standby must operate through a physical slot. Set it up manually:


pub_primary=# SELECT pg_create_physical_replication_slot('standby_slot');

The publication consists of a single table in the schema public in the database postgres. The descr column is filled with the name of the current cluster by default.


pub_primary=# CREATE TABLE test(
    id int,
    descr text DEFAULT current_setting('cluster_name')
);
pub_primary=# INSERT INTO test (id) VALUES (1) RETURNING *;

 id |    descr    
----+-------------
  1 | pub_primary
(1 row)

pub_primary=# CREATE PUBLICATION test_pub FOR TABLE test;

pub_standby

Create a replica from a physical backup of pub_primary, which was created with pg_basebackup. Use the slot created earlier.


$ pg_basebackup -d 'port=5400 user=postgres dbname=postgres' -S standby_slot -R -D ./pub_standby

pg_basebackup with the flag -R will create a postgresql.auto.conf file with parameters required to run the backup as a replica. Adjust the configuration:


$ echo 'port=5401' >>./pub_standby/postgresql.auto.conf
$ echo 'cluster_name=pub_standby' >>./pub_standby/postgresql.auto.conf
$ echo 'sync_replication_slots=on' >>./pub_standby/postgresql.auto.conf
$ echo 'hot_standby_feedback=on' >>./pub_standby/postgresql.auto.conf

The parameters port and cluster_name are self-explanatory. The new parameter sync_replication_slots deserves a closer look. It initiates a dedicated process slotsync worker on the replica. This process creates logical slots on pub_standby mirroring those on pub_primary and ensures their ongoing synchronization. (The slots still can be synchronized manually by running the function pg_sync_replication_slots on the replica.)

Note the dbname parameter in the pg_basebackup connection string. It is not necessary to specify a database to create a copy of the cluster. The parameter dbname is there in order for pg_basebackup to transfer it to the primary_conninfo connection string. This is necessary to synchronize logical slots that are defined at the database level.

In addition to that, the replica must have feedback enabled, hence the parameter hot_standby_feedback.

Start the replica and check the configuration.


pg_ctl start -D ./pub_standby -l ./pub_standby/log

pub_standby=# \dconfig primary_*|hot_*|sync_*|wal_level

                                                                                                                                                            List of configuration parameters
       Parameter        |                                                                                                                                                             Value                                                                                                                                                              
------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 hot_standby            | on
 hot_standby_feedback   | on
 primary_conninfo       | user=postgres passfile='/home/pal/.pgpass' channel_binding=disable port=5400 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable dbname=postgres
 primary_slot_name      | standby_slot
 sync_replication_slots | on
 wal_level              | logical
(6 rows)

The new process slotsync worker is already running:


SELECT datname, pid, wait_event_type, wait_event, state
FROM pg_stat_activity
WHERE backend_type = 'slotsync worker'
\gx

-[ RECORD 1 ]---+------------------------
datname         | postgres
pid             | 171422
wait_event_type | Activity
wait_event      | ReplicationSlotsyncMain
state           |

There are no logical replication slots on pub_primary yet, so thereʼs nothing to sync. Now is the time to set up the subscription server.

sub_server

Cluster initialization and startup:


$ PGPORT=5402; initdb -U postgres -D ./sub_server -c port=5402 -c cluster_name=sub_server
$ pg_ctl start -D ./sub_server -l ./sub_server/log

Create a table and subscribe to updates from pub_primary:


sub_server=# CREATE TABLE test(
    id int,
    descr text DEFAULT current_setting('cluster_name')
);
sub_server=# CREATE SUBSCRIPTION test_sub
    CONNECTION 'port=5400 user=postgres dbname=postgres'
    PUBLICATION test_pub WITH (failover = true);

Enable the new subscription parameter failover. It will add the attribute failover to the logical replication slot we are going to create on pub_primary. Only slots with this attribute will synchronize with the replica. (The function pg_create_logical_replication_slot also gets the same parameter to create failover slots.)

Verify that the slots synchronize

Check that the created logical slot on pub_primary has the failover attribute:


pub_primary=# SELECT slot_name, slot_type, database, failover, synced, inactive_since
FROM pg_replication_slots
WHERE slot_name = 'test_sub'

-[ RECORD 1 ]--+---------
slot_name      | test_sub
slot_type      | logical
database       | postgres
failover       | t
synced         | f
inactive_since |

The slot on the replica exists and synchronizes with pub_primary (synced). The last sync time is listed under inactive_since.


pub_standby=# SELECT slot_name, slot_type, database, failover, synced, inactive_since
FROM pg_replication_slots

-[ RECORD 1 ]--+------------------------------
slot_name      | test_sub
slot_type      | logical
database       | postgres
failover       | t
synced         | t
inactive_since | 2024-08-12 14:20:07.351486+03

Thereʼs one more thing to be taken care of. An asynchronous replica may receive updates later than a subscriber does because of lag or other reasons. If the publisher crashes and switches over to the replica, the subscriber will no longer receive updates from the replica, since it is now ahead of it. To ensure that the subscriber does not get ahead, the name of the physical slot is set in the new parameter synchronized_standby_slots on the publication server:


pub_primary=# ALTER SYSTEM SET synchronized_standby_slots = 'standby_slot';
pub_primary=# SELECT pg_reload_conf();

The parameter stores a list of physical replication slots. WAL sender processes serving logical replication will send data to subscribers only after all slots from this list confirm receipt of WAL records. In case of lag on the replica, subscribers will not receive records ahead of it.

Now all the preparations have been completed. Add a second row to the table at the publisher:


pub_primary=# INSERT INTO test (id) VALUES (2) RETURNING *;

 id |    descr    
----+-------------
  2 | pub_primary
(1 row)

Crash and switchover to replica

To simulate a crash, stop pub_primary in immediate mode:


$ pg_ctl stop -m immediate -D ./pub_primary

Before switching to the replica, the subscription should be disabled.


sub_server=# ALTER SUBSCRIPTION test_sub DISABLE;

Switch the replica to read/write mode:


$ pg_ctl promote  -D ./pub_standby

Change the connection string in the subscription and enable it:


sub_server=# ALTER SUBSCRIPTION test_sub
    CONNECTION 'port=5401 user=postgres dbname=postgres';
sub_server=# ALTER SUBSCRIPTION test_sub ENABLE;

Verify that the logical replication works

On the new publisher, add a third row to the table:


pub_standby=# INSERT INTO test (id) VALUES (3) RETURNING *;

 id |    descr    
----+-------------
  3 | pub_standby
(1 row)

The subscriber receives updates:


sub_server=# SELECT * FROM test;

 id |    descr    
----+-------------
  1 | pub_primary
  2 | pub_primary
  3 | pub_standby
(3 rows)

Conclusion. In order for the subscriber to continue receiving updates after switchover to a physical replica, you need to do the following:

  • Add the failover attribute to the logical slot at the publisher by setting the corresponding parameter in the subscription configuration.
  • The replica must run on a physical slot with feedback enabled.
  • The replica must have the parameter sync_replication_slots enabled to start the slotsync worker process. This process synchronizes slots with the main publishing server.
  • On the publishing server, the physical slot should be added to synchronized_standby_slots. This ensures that subscribers will not apply updates before the replica does.
  • After switchover, you just need to change the connection string in the subscription configuration.

See also:

Postgres 17 highlight: Logical replication slots synchronization (Bertrand Drouvot)

Logical decoding optimization for subtransactions

commit: 5bec1d6b

Logical decoding and, consequently, logical replication have been optimized significantly when it comes to processing a lot of subtransactions.

Developersʼ own tests on a transaction containing 100,000 subtransactions show an increase in performance by over 30 times.

libpq: non-locking query cancellation

commit: 61461a30

libpq gets a new interface for non-locking query cancellation. The existing PQcancel call has been deprecated because it not only makes the application wait for the query to complete, but is also not secure.

The new interface can be used by drivers working through libpq. In particular, there are plans to implement non-locking query cancellation in psycopg 3.2.

See also:

Improved query cancellation in PostgreSQL 17 and Psycopg 3.2 (Denis Laxalde)

libpq: direct connection via TLS

commit: d39a49c1

Client applications running through libpq can use the new parameter sslnegotiation in the connection string.

With sslnegotiation=direct, when connecting to servers running PostgreSQL 17 or newer that support ALPN, the client and server can forego requesting and approving the use of TLS, which reduces the connection time.

Direct connection is disabled by default.

vacuumdb, clusterdb, reindexdb: processing individual objects in multiple databases

commit: 648928c7, 1b49d56d, 24c928ad

If there are identical objects in different databases in the same cluster, then they can be processed in one call by the vacuumdb, clusterdb and reindexdb utilities. For example, vacuum all pg_catalog schemas:


$ vacuumdb --all --schema=pg_catalog

vacuumdb: vacuuming database "demo"
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

reindexdb: --jobs and --index at the same time

commit: 47f99a40

reindexdb had a restriction on working in parallel mode (--jobs) when trying to reindex individual indexes specified in several --index parameters, because indexes belonging to the same table cannot be rebuild in parallel. However, the utility couldnʼt rebuild indexes from different tables in parallel mode either.

Now the restriction has been lifted. reindexdb can be run with the --jobs and --index parameters simultaneously, while the indexes belonging to the same table will be reindexed by the same process.

psql: new implementation of FETCH_COUNT

commit: 90f51782

The stock psql variable FETCH_COUNT can be used to get a batch of rows from a query at a time, rather than the whole output at once.

How did it work? Query execution (SELECT) began with the cursor declaration and the subsequent series of FETCH FORWARD commands with the number of rows specified in FETCH_COUNT.

However, not all queries that return sets of rows support cursors. One example is DML commands with the keyword RETURNING, which always returned the full output, regardless of the FETCH_COUNT value.

In the new implementation, FETCH_COUNT uses libpqʼs single-row mode instead of cursors, which works with any type of query.

pg_dump --exclude-extension

commit: 522ed12f

In PostgreSQL 14, pg_dump got the --extension parameter. It defines which extensions should be included into the archive. PostgreSQL 17 adds the reverse parameter --exclude-extension: extensions that do not need to be included.

Backup and restore large objects

commit: a45c78e3, 959b38d7

Most system catalog objects are designed to define a data schema: tables, views, sequences, data types, functions, etc. The number of these objects in production systems can be quite large, but not unimaginably so.

The large objects stand out among them. These unusual system catalog objects are, in fact, user data. And there can be a lot of them. The discussion around this patch was initiated by an email about someone unable to use pg_upgrade on a database containing ~200 million large objects. The pg_restore utility, which restores system catalog objects in a new cluster, failed. Each large object is restored by a separate command, and therefore by a separate transaction. This makes the system rapidly increment transaction counter, which may lead to wraparound.

Besides, even if pg_dump/pg_restore completes without errors, its speed goes down as the number of large objects goes up. pg_dump includes every large object in the archive table of contents, bloating the archive and slowing down table of contents access. And pg_restore not only blasts through the transaction counter, but also writes each transaction to WAL with subsequent synchronization to disk (fsync).

To optimize the operation of pg_dump and pg_restore, developers took the following steps. pg_dump groups large objects in the archive table of contents in chunks 1000 objects each (the first commit).

pg_restore gets a new parameter --transaction-size (second commit). As the name implies, it defines the number of SQL commands that are grouped into a single transaction during recovery. This reduces WAL and disk sync overhead, increasing the recovery speed, and also goes easier on the transaction counter. In PostgreSQL 16, by default, each command is executed in a separate transaction, unless the --single-transaction flag is set.

With the new patch, pg_upgrade will call pg_restore with a hard-coded value of --transaction-size=1000.

 

Thatʼs all for the upcoming PostgreSQL 17 release. Hopefully, the release will come out on time, with no unexpected delays.

Meanwhile, further development is already under way: the first, July CommitFest for PostgreSQL 18 has concluded and the second, September one, has begun. More to research and share.

← Back to all articles

Pavel Luzanov