•   PostgreSQL   •   By Pavel Luzanov

PostgreSQL 18: part 2 or CommitFest 2024-09

Statistically, September CommitFests feature the fewest commits. Apparently, the version 18 CommitFest is an outlier. There are many accepted patches and many interesting new features to talk about.

If you missed the July CommitFest, get up to speed here: 2024-07.

Conflict detection and logging in logical replication

commit: 9758174e2, edcb71258, 640178c92, 6c2b5edec

The data on a subscriber can be changed independently from the publisher. The publisher has no way of knowing about these changes. This can lead to logical replication conflicts. Some examples: adding a row to a table on the publisher while there is a row with the same primary key on the subscriber, or modifying or removing a row on the publisher while there is no matching row on the subscriber to modify. There are six conflict variants in total, all now listed in the Conflicts section in the Logical Replication chapter. Note that not every conflict variant interrupts logical replication entirely.

Along with the documentation, there are new logging tools to monitor these conflicts. Consider the update_missing conflict, for example.

First, both on the publisher and the subscriber, create a table:

CREATE TABLE test(id int PRIMARY KEY);

On the publisher, add a row:

pub=# INSERT INTO test (id) VALUES (1);

On the subscriber, immediately delete it:

sub=# DELETE FROM test WHERE id = 1;

Now, the conflict is just waiting to happen. Update the row on the publisher:

pub=# UPDATE test SET id = 2 WHERE id = 1 RETURNING *;
 id
----
  2
(1 row)

The update comes through, but not on the subscriber: there is no row with id=1 there. Logical replication ignores this command on the subscriber and keeps on going, same as before the patch. And before the patch, we wouldn’t even know that this has happened. Now, though, there are two new tools to catch it.

First, the pg_stat_subscription_stats view. It simply counts conflicts of every type. Looking at the view, the confl_update_missing counter is no longer zero:

sub=# SELECT * FROM pg_stat_subscription_stats WHERE subname = ’sub'
-[ RECORD 1 ]---------------+------
subid                       | 16390
subname                     | sub
apply_error_count           | 0
sync_error_count            | 0
confl_insert_exists         | 0
confl_update_origin_differs | 0
confl_update_exists         | 0
confl_update_missing        | 1
confl_delete_origin_differs | 0
confl_delete_missing        | 0
stats_reset                 |

Secondly, there is detailed information about the conflict in the server log:

2025-01-23 23:33:35.763 MSK [170775] LOG:  conflict detected on relation "public.test": conflict=update_missing
2025-01-23 23:33:35.763 MSK [170775] DETAIL:  Could not find the row to be updated.
    Remote tuple (2); replica identity (id)=(1).
2025-01-23 23:33:35.763 MSK [170775] CONTEXT:  processing remote data for replication origin "pg_16390" during message type "UPDATE" for replication target relation "public.test" in transaction 746, finished at 0/183DC60
As a side note, I would really love to see the developers go a step further, i.e. implement strategies that would automatically resolve the conflicts that interrupt logical replication. There is hope that they will deliver, since the discussion around this patch mentions, among other things, implementing two such strategies: apply(remote_apply) and skip(keep_local). But tracking the conflicts is valuable in and of itself, and the developers have decided to start with it. Looking forward to further updates!

Planner: no more 10000000000

commit: e22253467, c01743aa4, 161320b4b, 84b8fccbe

The parameter family enable_* tells the planner not to use some query execution paths, for example, sequential scanning:

CREATE TABLE t (id int);

SET enable_seqscan = off;
EXPLAIN (settings) SELECT * FROM t;

However, the planner will still choose sequential scanning because the table has no indexes, making index scanning impossible.

In the plan in PostgreSQL 17, we will end up with a gargantuan cost:

                               QUERY PLAN                               
------------------------------------------------------------------------
 Seq Scan on t  (cost=10000000000.00..10000000035.50 rows=2550 width=4)
 Settings: enable_seqscan = 'off'
(2 rows)

The actual Seq Scan cost is increased by the prohibitive constant of 10,000,000,000. This constant makes other access methods more preferable for the planner, if they exist. But since there are no other options, this extreme number appears in the selected plan.

The new patches remove the constant and teach the planner to work with enable_* parameters without it. The query plan now looks like this:

                     QUERY PLAN                      
-----------------------------------------------------
 Seq Scan on t  (cost=0.00..35.50 rows=2550 width=4)
   Disabled: true
 Settings: enable_seqscan = 'off'
(3 rows)

The Seq Scan is still the selected option, but without artificial inflation of the cost. The Disabled: true line indicates that the planner is forced to select a prohibited node due to the lack of other non-prohibited options.

If there is another option, it will be used:

CREATE INDEX ON t(id);
ANALYZE t;

EXPLAIN (settings) SELECT * FROM t;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Index Only Scan using t_id_idx on t  (cost=0.12..8.14 rows=1 width=4)
 Settings: enable_seqscan = 'off'
(2 rows)

In addition, when the parameter setting (first introduced in PostgreSQL 12) is used with the EXPLAIN command, it becomes a great tool for figuring out which parameter changes affect the query execution plan.

Planner: memory management and usage monitoring for temporary tuple storage

commit: 1eff8279d, 53abb1e0e, 590b045c3, 97651b013, 04bcf9e19, 908a96861, 9fba1ed29, 95d6e9af0, 5d56d07ca, 40708acd6

This tremendous effort spans across multiple patch discussions, ergo the long list of commits. Still, it’s best to talk about them all together.

Some query execution plan nodes (listed below) use a common interface for temporary tuple storage (tuplestore). The patches introduce two major changes to it.

First, rows are stored within a separate memory context of the Generation type, instead of the more general CurrentMemoryContext. Without going into details, this increases speed and memory management efficiency.

Secondly, EXPLAIN (analyze) now shows the storage type used (Memory or Disk) and the maximum allocated storage for the nodes that use the common interface. These nodes are:

  • Materialize – saving an intermediate set of rows for later reuse;
  • CTE Scan – for queries with common table expressions;
  • WindowAgg – for queries with window functions;
  • Recursive Union – for recursive queries;
  • Table Function Scan – for queries to the JSON_TABLE and XMLTABLE functions (not to be confused with Function Scan).

In this example, the data for the Materialize node fits into memory. Note the new line Storage:

EXPLAIN (costs off, analyze, timing off, summary off, buffers off)            
SELECT * FROM airports a1 CROSS JOIN airports a2;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Nested Loop (actual rows=10816 loops=1)
   ->  Seq Scan on airports_data ml (actual rows=104 loops=1)
   ->  Materialize (actual rows=104 loops=104)
         Storage: Memory  Maximum Storage: 35kB
         ->  Seq Scan on airports_data ml_1 (actual rows=104 loops=1)
(5 rows) 

But for the CTE in the next example, the rows did not fit into memory, so temporary files were used:

EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
WITH b AS MATERIALIZED (
  SELECT * FROM bookings
)
SELECT * FROM b;
                         QUERY PLAN                         
------------------------------------------------------------
 CTE Scan on b (actual rows=2111110 loops=1)
   Storage: Disk  Maximum Storage: 67856kB
   CTE b
     ->  Seq Scan on bookings (actual rows=2111110 loops=1)
(4 rows)

Additional optimizations have been made for the WindowAgg node when switching to the next partition (PARTITION BY). The more partitions, the greater the effect. For example, this query is 12% faster in PostgreSQL 18 than in 17:

EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
SELECT book_ref, count(*) OVER (PARTITION BY book_ref)
FROM tickets;
                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 WindowAgg (actual rows=2949857 loops=1)
   Storage: Memory  Maximum Storage: 17kB
   ->  Index Only Scan using tickets_book_ref_idx on tickets (actual rows=2949857 loops=1)
         Heap Fetches: 0
(4 rows)

The recursive query plan also shows the type and amount of memory used by Recursive Union and CTE Scan nodes.

EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Aggregate (actual rows=1 loops=1)
   CTE t
     ->  Recursive Union (actual rows=100 loops=1)
           Storage: Memory  Maximum Storage: 33kB
           ->  Result (actual rows=1 loops=1)
           ->  WorkTable Scan on t t_1 (actual rows=1 loops=100)
                 Filter: (n < 100)
                 Rows Removed by Filter: 0
   ->  CTE Scan on t (actual rows=100 loops=1)
         Storage: Memory  Maximum Storage: 20kB
(10 rows)

Finally, the node for JSON_TABLE and XMLTABLE functions.

EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
SELECT * FROM JSON_TABLE('{"a":1}'::jsonb, '$[*]' COLUMNS (a int));
                         QUERY PLAN                          
-------------------------------------------------------------
 Table Function Scan on "json_table" (actual rows=1 loops=1)
   Storage: Memory  Maximum Storage: 17kB
(2 rows)

Reading buffers during index backwards scan

commit: 3f44959f4, 1bd4bc85c, b5ee4e520, caca6d8d2, 4e6e375b0

When scanning indexes in reverse, previously read buffers were unnecessarily locked. This led to additional buffer reads. Compare the number of buffers read during the regular index scan in the first query and the backwards scan in the second query:

17=# EXPLAIN (analyze, buffers, costs off, summary off, timing off)
SELECT * FROM flights ORDER BY flight_id ASC;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Index Scan using flights_pkey on flights (actual rows=214867 loops=1)
   Buffers: shared hit=3499
17=# EXPLAIN (analyze, buffers, costs off, summary off, timing off)
SELECT * FROM flights ORDER BY flight_id DESC;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Index Scan Backward using flights_pkey on flights (actual rows=214867 loops=1)
   Buffers: shared hit=4087

In PostgreSQL 18, scanning the index in any direction will require a minimal number of buffers: 3499 in this case.

pg_upgrade: asynchronous operations in multiple databases

commit: 40e2e5e92, 6d3d2e8e5, 7baa36de5, 46cad8b31, 6ab8f27bc, bbf83cab9, 9db3018cf, c34eabfbb, cf2f82a37, f93f5f7b9, c880cf258

At various points, pg_upgrade executes the same operation in all databases of the upgraded cluster. Historically, it performed these operations sequentially, connecting to each database one after another.

The first commit creates an infrastructure for asynchronous execution of such operations using libpq features. The remaining commits make individual tasks work with the infrastructure.

The asynchronous mode is controlled by the --jobs parameter, and its efficiency increases with the number of databases in the upgraded cluster.

This is a follow-up on the pg_upgrade optimization described in the previous article.

The following four optimizations are notable because there are no parameters. Things just run faster.

Hash Join speed-up

commit: adf97c156

Faster Hash Join, especially on multiple columns. See the commit message for estimates.

Faster text value processing in JSON

commit: ca6fde922

Improved conversion from JSON to text by using SIMD for escaping property names and values. The longer the text value, the greater the optimization. See the commit message for estimates.

Optimized numeric-type multiplication

commit: ca481d3c9, c4e44224c, 8dc28d7eb

Numeric numbers will multiply faster.

Optimized numeric-type division

commit: 9428c001f

Numeric numbers will divide faster.

ANALYZE ONLY and VACUUM ONLY

commit: 62ddf7ee9

Autovacuum and autoanalyse go over partitioned table’s partitions, but not the table itself. There are no rows, therefore, no way to set the conditions when the automatic operations should start.

But while there is no need to vacuum a partitioned table, collecting statistics for it might prove useful for some queries. After all, some statistics are collected for the table itself. The ANALYZE command will collect statistics not only for the partitioned table, but also for all its partitions, and this can take a while. The patch adds the ONLY keyword to allow

ANALYZE and VACUUM to ignore the partitions:

ANALYZE ONLY partitioned_table;
ANALYZE

Same for VACUUM:

VACUUM (analyze) ONLY partitioned_table;
WARNING:  VACUUM ONLY of partitioned table "partitioned_table" has no effect
VACUUM

The warning says that it makes no sense to clean up a partitioned table without ANALYZE.

ONLY with VACUUM and ANALYZE affects not only partitioned tables, but also parent tables with children. Note that the change is incompatible with previous versions.

Previously, the command

ANALYZE parent_table;

would analyze only the parent table. Now, it will go through all of its children. To avoid that, the ONLY keyword should be used:

ANALYZE ONLY parent_table;

The same is true for VACUUM.

Improved checkpointer statistics

commit: 17cc5f666

In PostgreSQL 17, checkpointer statistics can be found in the pg_stat_checkpointer view and in the server log.

Turns out, the number of recorded buffers differs between these two. This is because the server log messages include buffers from both the shared buffer cache (shared buffers) and SLRU, while pg_stat_checkpointer.buffers_written tracks only shared buffers.

To fix it, the developers split the information in the server log to record shared buffers and SLRU buffers separately.

LOG:  checkpoint complete: wrote 47 buffers (0.3%), wrote 0 SLRU buffers; …

And in pg_stat_checkpointer, a new column slru_written is added to complement the existing buffers_written.

Let’s see if the information in the two sources now matches.

CHECKPOINT;
SELECT pg_stat_reset_shared('checkpointer');

Start a task and immediately perform a checkpoint:

CREATE TABLE bookings_copy AS SELECT * FROM bookings;
CHECKPOINT;

Check the information in the server log and pg_stat_checkpointer:

\! tail -n 1 logfile
2024-11-12 16:56:33.443 MSK [63957] LOG:  checkpoint complete: wrote 2016 buffers (12.3%), wrote 1 SLRU buffers; 0 WAL file(s) added, 5 removed, 5 recycled; write=0.044 s, sync=0.311 s, total=0.773 s; sync files=20, longest=0.226 s, average=0.016 s; distance=165531 kB, estimate=444398 kB; lsn=1/4A6A8E20, redo lsn=1/4A6A8DC8
SELECT buffers_written, slru_written FROM pg_stat_checkpointer;
 buffers_written | slru_written
-----------------+--------------
            2016 |            1
(1 row)

pg_stat_statements: normalization of SET commands

commit: ba90eac7a, dc6851596

Another follow-up on the normalization of commands. This time it’s the parameter setter command SET. Commands with different formatting and different parameter values will be converted to the same format, which will help reduce the number of stored queries in pg_stat_statements.

SELECT pg_stat_statements_reset();

SET SEARCH_PATH = pg_catalog;
SET   search_path  =  public;
set search_path=bookings,public;

SET CUSTOM.PARAMETER = 1;
SET   CUSTOM.parameter  =  2;
set custom.parameter=42;

SELECT queryid, query, calls
FROM pg_stat_statements
WHERE query ILIKE ’sET%';
       queryid        |           query           | calls 
----------------------+---------------------------+-------
 -5443897209013767274 | SET CUSTOM.PARAMETER = $1 |     3
 -1735366501441382531 | SET SEARCH_PATH = $1      |     3
(2 rows)

postgres_fdw_get_connections and remote connection status

commit: c297a47c5, 857df3cef, 4f08ab554

Let’s access a remote table within a transaction.

BEGIN;
SELECT count(*) FROM remote_bookings.tickets;
  count  
---------
 2949857
(1 row)

The remote connection should be active now, we can check the status of all connections with the postgres_fdw_get_connections function. The three new columns are going to help us with that: user_name, user_in_xact and closed.

SELECT * FROM postgres_fdw_get_connections(check_conn => true);
 server_name | user_name | valid | used_in_xact | closed
-------------+-----------+-------+--------------+--------
 demo_srv    | postgres  | t     | t            | f
(1 row)

The column user_name is the name of the local user, used_in_xact is whether the connection is in use in the current transaction, and closed shows the connection status.

If the connection is in order, the local transaction can carry on. Otherwise, it needs to be rolled back: there is no point in doing anything, because the transaction will not be able to complete successfully due to a closed connection on the remote server.

file_fdw: ignore format conversion errors

commit: e7834a1a2, a1c4c8a9e

In PostgreSQL 17, the COPY command got the ability to ignore format conversion errors when reading rows with the parameters on_error and log_verbosity. The file_fdw extension uses COPY to read files, so adding the features to the extension seems like the next reasonable step.

For example, here’s a file where not all strings are numbers:

$ cat /tmp/t.txt
1
two
three
4

Now, create a remote table for the file. It has an integer column and can ignore format conversion errors.

CREATE EXTENSION file_fdw;
CREATE SERVER file_server
    FOREIGN DATA WRAPPER file_fdw;


CREATE FOREIGN TABLE ft (
    id integer
)
SERVER file_server
OPTIONS (
    filename '/tmp/t.txt',
    on_error 'ignore',
    log_verbosity 'verbose'
);

The query works. It shows the successfully loaded rows and warnings for the rest:

SELECT * FROM ft;
NOTICE:  skipping row due to data type incompatibility at line 2 for column "id": "two"
NOTICE:  skipping row due to data type incompatibility at line 3 for column "id": "three"
NOTICE:  2 rows were skipped due to data type incompatibility
 id
----
  1
  4
(2 rows)

Warnings can be disabled by using the new silent value for the log_verbosity parameter:

ALTER FOREIGN TABLE ft OPTIONS (SET log_verbosity 'silent');

SELECT * FROM ft;
 id
----
  1
  4
(2 rows)

New function has_largeobject_privilege

commit: 4eada203a

A new addition to the has_*_privilege family of functions, has_largeobject_privilege allows you to check access rights to large objects.

\lo_import 'logfile'
lo_import 24578

GRANT SELECT ON LARGE OBJECT 24578 TO public;

SELECT has_largeobject_privilege('alice', 24578, 'SELECT');
 has_largeobject_privilege
---------------------------
 t
(1 row)

An important advantage of this family of functions is that it checks for privileges granted not only directly, but also indirectly through membership in a privileged role, including membership in the role of the owner of the object.

Note that between the list of object types you can grant privileges for and the list of has_*_privilege functions, the single difference is has_domain_privilege. The reason is that domains are not exactly an independent object type. Rather, they are regular types and are stored in pg_type. Therefore, instead of the has_domain_privilege function, has_type_privilege works fine with domains. But there is a separate command CREATE DOMAIN for creating domains, and privileges for domains are granted with GRANT ... ON DOMAIN. So, the absence of the has_domain_privilege function doesn’t feel right. Perhaps it’s high time to include a mention of domains in has_type_privilege documentation.

Functions crc32 and crc32c

commit: 760162fed

New functions for calculating CRC-32 and CRC-32C checksums:

SELECT crc32('42'::bytea), crc32c('42'::bytea);
   crc32   |  crc32c   
-----------+-----------
 841265288 | 779959755
(1 row)

Client-server protocol: report search_path changes to the client

commit: 28a1121fd, 0d06a7eac

With the client-server protocol, immediately after connection the clients receive a ParameterStatus message containing certain parameter values important for the application, such as client_encoding and DateStyle. When one of these parameters changes during a session (usually via the SET command), the server process automatically notifies the client of the new value.

The list of the parameters is fixed, but the first commit adds the parameter search_path to the list.

This parameter is important for developers of drivers that implement the client-server protocol, especially for developers of connection pools. In transactional mode, when different clients are served by a single server process, the pooler will be able to automatically match the search_path on the server with the stored value for the current client.

psql: support for named prepared statements

commit: d55322b0d

The extended query protocol support was first introduced in PostgreSQL 16. Only unnamed prepared statements were supported at that point, though.

With the new commands \parse, \bind_named, and \close, prepared statements can now be used to a fuller extent. Here’s how it works.

SELECT $2||$1 AS str \parse q

SELECT * FROM pg_prepared_statements \gx
-[ RECORD 1 ]---+------------------------------
name            | q
statement       | SELECT $2||$1 AS str
prepare_time    | 2024-11-08 09:30:50.326934+03
parameter_types | {text,text}
result_types    | {text}
from_sql        | f
generic_plans   | 0
custom_plans    | 0

The first command declares a prepared statement q with two text parameters. The second query verifies that the prepared statement now exists.

The command \bind_named binds the parameters to the actual values and executes the prepared statement. The prepared statement name goes first, then the two parameter values:

\bind_named q 42 'Answer: ' \g
    str     
------------
 Answer: 42
(1 row)
\bind_named q 'World!' 'Hello,' \g
     str      
--------------
 Hello,World!
(1 row)

Finally, the prepared statement is closed (DEALLOCATE):

\close q

pg_verifybackup: integrity verification for tar backups

commit: 8dfd31290

pg_verifybackup can now verify cluster backups not only in plain format, but also in tar. A copy in tar format can be compressed:

$ pg_basebackup -D backup --format=tar --gzip
$ ls -l backup
total 569416
-rw------- 1 pal pal    190742 dec 24 18:57 backup_manifest
-rw------- 1 pal pal 582862489 dec 24 18:57 base.tar.gz
-rw------- 1 pal pal     17120 dec 24 18:57 pg_wal.tar.gz

But there is a catch. pg_verifybackup does not support WAL files, so you need to explicitly specify that WAL verification is not required:

$ pg_verifybackup backup --format=tar --no-parse-wal
backup successfully verified

 

That’s all for now. The best commits of the November CommitFest are coming next time.

← Back to all articles

Pavel Luzanov