Home > Blog

March 13 • PostgreSQL • By Pavel Luzanov

PostgreSQL 19: part 3 or CommitFest 2025-11

PostgreSQL

Source:
habr.com

Subscribe to blog

Source:
habr.com

Subscribe to blog

This article reviews the November 2025 CommitFest.

For the highlights of the previous two CommitFests, check out our last posts: 2025-07, 2025-09.

Planner: eager aggregation
Converting COUNT(1) and COUNT(not_null_col) to COUNT(*)
Parallel TID Range Scan
COPY … TO with partitioned tables
New function error_on_null
Planner support functions for optimizing set-returning functions (SRF)
SQL-standard style functions with temporary objects
BRIN indexes: using the read stream interface for vacuuming
WAIT FOR: waiting for synchronization between replica and primary
Logical replication of sequences
pg_stat_replication_slots: a counter for memory limit exceeds during logical decoding
pg_buffercache: buffer distribution across OS pages
pg_buffercache: marking buffers as dirty
Statistics reset time for individual relations and functions
Monitoring the volume of full page images written to WAL
New parameter log_autoanalyze_min_duration
psql: search path in the prompt
psql: displaying boolean values
pg_rewind: skip copying WAL segments already present on the target server
pgbench: continue running after SQL command errors

Planner: eager aggregation

commit: 8e11859102f

Consider the following query. Here, a large table of factual data (routes) is joined with a small table of reference data (airplanes_data). The result is aggregated by reference table columns.

EXPLAIN (costs off)
SELECT a.airplane_code, a.model, count(*)
FROM routes r
     JOIN airplanes_data a USING (airplane_code)
GROUP BY a.airplane_code, a.model;

Query plan in PostgreSQL 18:

                       QUERY PLAN                       
--------------------------------------------------------
 HashAggregate
   Group Key: a.airplane_code
   ->  Hash Join
         Hash Cond: (r.airplane_code = a.airplane_code)
         ->  Seq Scan on routes r
         ->  Hash
               ->  Seq Scan on airplanes_data a
(7 rows)

There is a problem with this plan. The tables are joined first, and the aggregation happens after. However, we could aggregate the large table first and then join the remaining rows with the reference table. We can get the planner to do that by rewriting the query:

EXPLAIN (costs off)
SELECT a.airplane_code, a.model, r.total
FROM (SELECT r.airplane_code, count(*) AS total
      FROM routes r
      GROUP BY r.airplane_code
     ) r
     JOIN airplanes_data a USING (airplane_code);

                    QUERY PLAN                    
--------------------------------------------------
 Hash Join
   Hash Cond: (a.airplane_code = r.airplane_code)
   ->  Seq Scan on airplanes_data a
   ->  Hash
         ->  Subquery Scan on r
               ->  HashAggregate
                     Group Key: r_1.airplane_code
                     ->  Seq Scan on routes r_1
(8 rows)

Now let’s look at the execution plan for the original query in PostgreSQL 19:

EXPLAIN (costs off)
SELECT a.airplane_code, a.model, count(*)
FROM routes r
     JOIN airplanes_data a USING (airplane_code)
GROUP BY a.airplane_code, a.model;

                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Finalize GroupAggregate
   Group Key: a.airplane_code
   ->  Sort
         Sort Key: a.airplane_code
         ->  Nested Loop
               ->  Partial HashAggregate
                     Group Key: r.airplane_code
                     ->  Seq Scan on routes r
               ->  Index Scan using airplanes_data_pkey on airplanes_data a
                     Index Cond: (airplane_code = r.airplane_code)
(10 rows)

With this new optimization, the planner figured out on its own that it’s better to first aggregate the large table to dramatically reduce the number of rows, and only then join the result with the reference table. After that, the final aggregation is performed.

The optimization can be disabled using the new enable_eager_aggregate parameter. It’s enabled by default.

But, if eager aggregation will not reduce the number of rows by any significant margin, it’s better to avoid it altogether. Another new parameter min_eager_agg_group_size controls when to switch eager aggregation on or off. If the planner estimates that the average number of rows per group is at least equal to the parameter value, the optimization will trigger. The default is 8, meaning the optimization will only kick in if the original dataset is reduced by a factor of at least 8.

Converting COUNT(1) and COUNT(not_null_col) to COUNT(*)

commit: 42473b3b312

If you pass a constant or a NOT NULL column as a parameter to the COUNT aggregate function, the planner will automatically convert the call to COUNT(*):

EXPLAIN(verbose, costs off)
SELECT count(1), count(seat_no)
FROM seats;

                       QUERY PLAN                        
---------------------------------------------------------
 Aggregate
   Output: count(), count()
   ->  Seq Scan on bookings.seats
         Output: airplane_code, seat_no, fare_conditions
(4 rows)

This optimization improves query performance by eliminating unnecessary overhead from tuple deforming.

The substitution is performed in the planner’s support function with COUNT function as the target. The SupportRequestSimplifyAggref structure has been added to the list of structures available for implementation in support functions. For more details, see the src/include/nodes/supportnodes.h file.

Parallel TID Range Scan

commit: 0ca3b16973a

The TID Range Scan operation was introduced in PostgreSQL 14. Now this operation can be executed in parallel by multiple processes (Parallel TID Range Scan):

EXPLAIN(costs off)
SELECT count(*)
FROM bookings
WHERE ctid > '(42.0)';

                      QUERY PLAN                       
-------------------------------------------------------
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Parallel Tid Range Scan on bookings
                     TID Cond: (ctid > '(42.0)'::tid)
(6 rows)

COPY … TO with partitioned tables

commit: 4bea91f21f6, 266543a6205

Before PostgreSQL 19, the COPY … TO command did not accept partitioned tables:

CREATE TABLE p (id int) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF p FOR VALUES FROM (0) TO (100);
INSERT INTO p VALUES (42);

18=# COPY p TO stdout;

ERROR:  cannot copy from partitioned table "p"
HINT:  Try the COPY (SELECT ...) TO variant.

Now it does:

19=# COPY p TO stdout;

This approach will also be used for the initial synchronization of partitioned tables in logical replication (second commit).

New function error_on_null

commit: 2b75c38b707

If the value passed to the new polymorphic function is not NULL, the function returns that value:

SELECT error_on_null(42),
       error_on_null(ARRAY[1,NULL]),
       error_on_null(ROW());

 error_on_null | error_on_null | error_on_null 
---------------+---------------+---------------
            42 | {1,NULL}      | ()
(1 row)

Otherwise, it raises an error:

SELECT error_on_null(NULL::int);

ERROR:  null value not allowed

Planner support functions for optimizing set-returning functions (SRF)

commit: b140c8d7a3f

When a query calls an SQL-language function, the function’s query can be inlined into the main query under certain conditions. This can significantly improve the query execution plan. However, this capability isn’t available for functions written in other languages, such as PL/pgSQL.

This patch enables you to write a C support function for a custom set-returning function (by implementing the SupportRequestInlineInFrom structure) that tells the planner what query can be used in place of the function call. With this information, the planner can inline the output of the support function into the main query.

The patch doesn’t include any ready-made support functions – you’ll need to write your own.

SQL-standard style functions with temporary objects

commit: 572c40ba9, 698fa924b11

Views and functions referencing temporary objects are created as temporary.

CREATE TEMP TABLE tmp AS SELECT 42 col;

Creating a view will issue a warning about this:

CREATE VIEW v_tmp AS SELECT * FROM tmp;

NOTICE:  view "v_tmp" will be a temporary view
CREATE VIEW

However, creating a function in SQL-standard style produced no such warning. The function was created in the public schema without issue:

18=# CREATE FUNCTION f_tmp() RETURNS bigint LANGUAGE sql RETURN (SELECT count(*) FROM tmp);

CREATE FUNCTION

18=# \df

                       List of functions
 Schema | Name  | Result data type | Argument data types | Type 
--------+-------+------------------+---------------------+------
 public | f_tmp | bigint           |                     | func
(1 row)

But it would disappear in new sessions:

18=# \c

You are now connected to database "postgres" as user "postgres".

18=# SELECT f_tmp();

ERROR:  function f_tmp() does not exist
LINE 1: SELECT f_tmp();

In PostgreSQL 19, creating a function will now issue a warning that the function is temporary:

19=# CREATE FUNCTION f_tmp() RETURNS bigint LANGUAGE sql RETURN (SELECT count(*) FROM tmp);

NOTICE:  function "f_tmp" will be effectively temporary
DETAIL:  It depends on temporary table tmp.
CREATE FUNCTION

BRIN indexes: using the read stream interface for vacuuming

commit: a6eac2273e6

In PostgreSQL 18, the streaming read interface was extended to enable vacuuming of Btree, GiST, and SP-GiST indexes, among other uses. Now it also enables vacuuming of BRIN indexes.

WAIT FOR: waiting for synchronization between replica and primary

commit: 447aae13b03

To distribute the load, applications can write data to the primary server while reading from a replica. However, if you need to retrieve changes from a replica that were just written by the application, you need a mechanism to ensure those changes are already synchronized to the replica. With asynchronous replication, delays in applying WAL records happen all the time.

The new SQL command WAIT FOR LSN presents a mechanism for handling these situations. Here’s how it works: when the application makes changes on the primary, it saves the current LSN position, then waits for that LSN position to be applied on the replica using the WAIT FOR LSN command.

Let’s create a table on the primary.

primary=# CREATE TABLE test (id int);
primary=# INSERT INTO test VALUES (42);

Save the current LSN position. We’ll wait for this position to be applied on the replica.

primary=# SELECT pg_current_wal_insert_lsn() AS lsn
\gset

Without leaving psql, connect to the replica.

primary=# \connect postgres postgres localhost 5402
replica=# SELECT pg_is_in_recovery();

 pg_is_in_recovery 
-------------------
 t
(1 row)

For demonstration purposes, the replica is configured to apply WALs with a 30-second delay. The test table hasn’t appeared here yet.

replica=# \dconfig recovery_min_apply_delay

 List of configuration parameters
        Parameter         | Value 
--------------------------+-------
 recovery_min_apply_delay | 30s
(1 row)

replica=# SELECT * FROM test;

ERROR:  relation "test" does not exist

Run the WAIT FOR LSN command and wait for the replica to apply the record with the LSN we saved.

replica=# WAIT FOR LSN :'lsn' WITH (MODE ’standby_replay');

 status  
---------
 success
(1 row)

Once the command completes successfully, you can be sure that the changes made on the primary are now available on the replica.

replica=# SELECT * FROM test;

 id 
----
 42
(1 row)

Logical replication of sequences

commit: b93172ca59f, 96b37849734, f0b3573c3aa, 5509055d695, f6a4c498dcf, 55cefadde87

Until now, logical replication subscribers only received changes to table rows. Now it’s possible to synchronize sequence values on the subscriber server with the values on the publisher server. Synchronization is performed manually and applies to all sequences. Let’s see how it works.

First, create a sequence on the publisher server:

pub=# CREATE SEQUENCE seq;

You cannot publish individual sequences (at least not yet); the CREATE PUBLICATION command includes all sequences in the publication:

pub=# CREATE PUBLICATION pub FOR ALL SEQUENCES;

Now let’s create the same sequence on the subscriber and subscribe to the publication:

sub=# CREATE SEQUENCE seq;
sub=# CREATE SUBSCRIPTION sub CONNECTION 'port=5401 dbname=postgres' PUBLICATION pub;

Let’s fetch a few values from the sequence:

pub=# SELECT nextval(’seq') FROM generate_series(1,3);

 nextval 
---------
       1
       2
       3
(3 rows)

The new pg_get_sequence_data function displays information about the current state of a sequence:

pub=# SELECT * FROM pg_get_sequence_data(’seq');

 last_value | is_called |  page_lsn  
------------+-----------+------------
          3 | t         | 2/650B51C0
(1 row)

To obtain the first two columns, just query the sequence as you would a table. The last column, page_lsn, shows the LSN of the WAL record that most recently modified the sequence value.

You can check the sequence state on the subscriber in pg_subscription_rel:

sub=# SELECT srrelid::regclass, srsubstate, srsublsn
FROM pg_subscription_rel;

 srrelid | srsubstate |  srsublsn  
---------+------------+------------
 seq     | r          | 2/650B3270
(1 row)

When sequences are synchronized between servers, the srsublsn value should match the page_lsn shown above on the publication server. Currently, that’s not the case: advancing a sequence on the publication server doesn’t automatically propagate to the subscription server. You need to synchronize manually:

sub=# ALTER SUBSCRIPTION sub REFRESH SEQUENCES;

sub=# SELECT srrelid::regclass, srsubstate, srsublsn FROM pg_subscription_rel;

 srrelid | srsubstate |  srsublsn  
---------+------------+------------
 seq     | r          | 2/650B51C0
(1 row)

Another command, ALTER SUBSCRIPTION ... REFRESH PUBLICATION, additionally includes and synchronizes any new sequences that have appeared on the publication server. The sequence synchronization is handled by the sequence synchronization worker process.

Now the subscriber has the same state as the publication server:

sub=# SELECT * FROM seq;

 last_value | log_cnt | is_called 
------------+---------+-----------
          3 |       0 | t
(1 row)

If synchronization fails, you can learn about it not only from the server log, but also from the pg_stat_subscription_stats view: there is a new column with a counter for sequence synchronization errors.

SELECT subname, sync_seq_error_count
FROM pg_stat_subscription_stats
WHERE subname = ’sub';

 subname | sync_seq_error_count 
---------+----------------------
 sub     |                    0
(1 row)

pg_stat_replication_slots: a counter for memory limit exceeds during logical decoding

commit: d3b6183dd98

The logical_decoding_work_mem parameter sets the amount of memory allocated for logical decoding of transactions. If decoding requires more memory, some of the data spills to disk. How often does this actually happen in practice?

A new column mem_exceeded_count in the pg_stat_replication_slots holds the answer. This column contains a counter that tracks how many times the logical_decoding_work_mem limit was exceeded, requiring disk usage. This is useful for tuning the logical_decoding_work_mem setting.

\d pg_stat_replication_slots

                   View "pg_catalog.pg_stat_replication_slots"
       Column        |           Type           | Collation | Nullable | Default 
---------------------+--------------------------+-----------+----------+---------
 slot_name           | text                     |           |          | 
 spill_txns          | bigint                   |           |          | 
 spill_count         | bigint                   |           |          | 
 spill_bytes         | bigint                   |           |          | 
 stream_txns         | bigint                   |           |          | 
 stream_count        | bigint                   |           |          | 
 stream_bytes        | bigint                   |           |          | 
 mem_exceeded_count  | bigint                   |           |          | 
 total_txns          | bigint                   |           |          | 
 total_bytes         | bigint                   |           |          | 
 slotsync_skip_count | bigint                   |           |          | 
 slotsync_last_skip  | timestamp with time zone |           |          | 
 stats_reset         | timestamp with time zone |           |          |

pg_buffercache: buffer distribution across OS pages

commit: 4b203d499c6

The pg_buffercache_numa view introduced in PostgreSQL 18 shows how each buffer in the shared buffer cache is distributed across NUMA nodes (numa_node) and operating system pages (os_page_num):

CREATE EXTENSION pg_buffercache;

SELECT *
FROM pg_buffercache_numa          
WHERE bufferid = 42;

 bufferid | os_page_num | numa_node 
----------+-------------+-----------
       42 |          82 |         0
       42 |          83 |         0
(2 rows)

This view is not available on systems without NUMA support, but information about buffer distribution across OS pages can still be useful. You can check this out in the new pg_buffercache_os_pages view:

SELECT *
FROM pg_buffercache_os_pages
WHERE bufferid = 42;

 bufferid | os_page_num 
----------+-------------
       42 |          82
       42 |          83
(2 rows)

pg_buffercache: marking buffers as dirty

commit: 9ccc049dfe6

In PostgreSQL 17 and 18, the pg_buffercache extension gained new functions for evicting buffers from cache to disk: pg_buffercache_evict, pg_buffercache_evict_relation, and pg_buffercache_evict_all.

Now, primarily for testing purposes, functions have been added to mark buffers as dirty: pg_buffercache_mark_dirty, pg_buffercache_mark_dirty_relation, and pg_buffercache_mark_dirty_all. The first marks a single specified buffer as dirty, the second marks all buffers of a specified relation, and the third marks all cache buffers.

Statistics reset time for individual relations and functions

commit: a5b543258aa, b71bae41a0c

The pg_stat_reset_single_table_counters function allows you to reset cumulative statistics for an individual relation (table, index, etc.). Similarly, the function pg_stat_reset_single_function_counters resets statistics for an individual function.

However, the date and time of the reset were not recorded anywhere.

Now the pg_stat_{all|user}_{tables|indexes} views, as well as pg_stat_user_functions, include a stats_reset column.

SELECT pg_stat_reset_single_table_counters('bookings'::regclass);

 pg_stat_reset_single_table_counters 
-------------------------------------

(1 row)

SELECT stats_reset
FROM pg_stat_user_tables
WHERE relid = 'bookings'::regclass;

          stats_reset          
-------------------------------
 2026-01-13 22:08:12.686371+03
(1 row)

Monitoring the volume of full page images written to WAL

commit: ad25744f436, f9a09aa29520

The column wal_fpi_bytes has been added to the pg_stat_wal view that holds system-wide statistics, and to the pg_stat_get_backend_wal function. The column shows the total size of full page images written to WAL.

CREATE TABLE bookings_copy
  AS SELECT * FROM bookings;

SELECT 4905238

SELECT *
FROM pg_stat_get_backend_wal(pg_backend_pid())
\gx

-[ RECORD 1 ]----+----------
wal_records      | 9842578
wal_fpi          | 160
wal_bytes        | 759246449
wal_fpi_bytes    | 1014192
wal_buffers_full | 84566
stats_reset      |

The full page image bytes value has also been added to the output of VACUUM and ANALYZE commands when run with the verbose option, as well as to the server log:

VACUUM(verbose) bookings_copy;

INFO:  vacuuming "demo.bookings.bookings_copy"
…
WAL usage: 31248 records, 4 full page images, 1876574 bytes, 32768 full page image bytes, 0 buffers full
…
VACUUM

New parameter log_autoanalyze_min_duration

commit: dd3ae378301

The log_autovacuum_min_duration parameter would trigger when an autovacuum worker process exceeded the specified run time for either vacuum or statistics collection.

Each of the two autovacuum operations now has its own dedicated logging parameter. The log_autovacuum_min_duration parameter now applies exclusively to vacuum operations, while the new log_autoanalyze_min_duration parameter controls logging for statistics collection.

\dconfig+ log_auto*_min_duration

                       List of configuration parameters
          Parameter           | Value |  Type   | Context | Access privileges 
------------------------------+-------+---------+---------+-------------------
 log_autoanalyze_min_duration | 10min | integer | sighup  | 
 log_autovacuum_min_duration  | 10min | integer | sighup  | 
(2 rows)

psql: search path in the prompt

commit: b3ce55f413c

The psql prompt can now include the search path, or more precisely, the value of the search_path parameter. To do this, add %S to the PROMPT1 and PROMPT2 variables.

postgres@demo=# \set PROMPT1 '%n%@%/:%S%R%x%# '
postgres@demo:bookings, "$user", public=# 
postgres@demo:bookings, "$user", public=# SET search_path = pg_catalog;

SET

postgres@demo:pg_catalog=#

psql: displaying boolean values

commit: 645cb44c549

psql displays true and false as t and f:

SELECT 1=1, 1<>1;

 ?column? | ?column? 
----------+----------
 t        | f
(1 row)

New \pset command variables allow you to customize how boolean values are displayed.

\pset display_true 'True'
\pset display_false 'False'

SELECT 1=1, 1<>1;

 ?column? | ?column? 
----------+----------
 True     | False
(1 row)

As a reminder, NULL value display is controlled by \pset null.

pg_rewind: skip copying WAL segments already present on the target server

commit: 6ae08d9583e, 5173bfd0443

pg_rewind will run faster by skipping the copying of WAL segments that were created before the source and target servers diverged. Previously, WAL segments were always copied; now they’re only copied if the file is missing or the file sizes differ.

pgbench: continue running after SQL command errors

commit: 0ab208fa505

The pgbench utility can now continue running after any SQL command errors, not just serialization errors or deadlocks. This is particularly useful when working with custom test scripts.

Let’s say we have a table with a primary key:

CREATE TABLE test (id int PRIMARY KEY);

And a script to populate the table with random numbers:

$ cat load.sql

INSERT INTO test VALUES (random(1,100));

Running this script multiple times will obviously cause unique constraint violations. To ignore these errors, we’ll run pgbench with the new --continue-on-error option, while --failures-detailed will show the percentage of failed transactions in the number of other failures line.

$ pgbench -d postgres -f load.sql --continue-on-error --failures-detailed -t 100

pgbench (19devel)
starting vacuum...end.
transaction type: load.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 100
number of transactions actually processed: 65/100
number of failed transactions: 35 (35%)
number of serialization failures: 0 (0%)
number of deadlock failures: 0 (0%)
number of other failures: 35 (35%)
latency average = 519 ms (including failures)
initial connection time = 4,194 ms
tps = 1251.974267 (without initial connection time)

Out of 100 transactions executed, 35 failed, which means the table now contains 65 rows. This is easy to verify:

$ psql -d postgres -c ’sELECT count(*) FROM test'

 count 
-------
    65
(1 row)

That’s all for now. The review for the fourth, January CommitFest is coming up.

← Back to all articles