PostgreSQL 18: part 2 or CommitFest 2024-09
Statistically, September CommitFests feature the fewest commits. Apparently, the version 18 CommitFest is an outlier. There are many accepted patches and many interesting new features to talk about.
If you missed the July CommitFest, get up to speed here: 2024-07.
- Conflict detection and logging in logical replication
- Planner: no more 10000000000
- Planner: memory management and usage monitoring for temporary tuple storage
- Reading buffers during index backwards scan
- pg_upgrade: asynchronous operations in multiple databases
- Hash Join speed-up
- Faster text value processing in JSON
- Optimized numeric-type multiplication
- Optimized numeric-type division
- ANALYZE ONLY and VACUUM ONLY
- Improved checkpointer statistics
- pg_stat_statements: normalization of SET commands
- postgres_fdw_get_connections and remote connection status
- file_fdw: ignore format conversion errors
- New function has_largeobject_privilege
- Functions crc32 and crc32c
- Client-server protocol: report search_path changes to the client
- psql: support for named prepared statements
- pg_verifybackup: integrity verification for tar backups
Conflict detection and logging in logical replication
commit: 9758174e2, edcb71258, 640178c92, 6c2b5edec
The data on a subscriber can be changed independently from the publisher. The publisher has no way of knowing about these changes. This can lead to logical replication conflicts. Some examples: adding a row to a table on the publisher while there is a row with the same primary key on the subscriber, or modifying or removing a row on the publisher while there is no matching row on the subscriber to modify. There are six conflict variants in total, all now listed in the Conflicts section in the Logical Replication chapter. Note that not every conflict variant interrupts logical replication entirely.
Along with the documentation, there are new logging tools to monitor these conflicts. Consider the update_missing
conflict, for example.
First, both on the publisher and the subscriber, create a table:
CREATE TABLE test(id int PRIMARY KEY);
On the publisher, add a row:
pub=# INSERT INTO test (id) VALUES (1);
On the subscriber, immediately delete it:
sub=# DELETE FROM test WHERE id = 1;
Now, the conflict is just waiting to happen. Update the row on the publisher:
pub=# UPDATE test SET id = 2 WHERE id = 1 RETURNING *;
id
----
2
(1 row)
The update comes through, but not on the subscriber: there is no row with id=1
there. Logical replication ignores this command on the subscriber and keeps on going, same as before the patch. And before the patch, we wouldn’t even know that this has happened. Now, though, there are two new tools to catch it.
First, the pg_stat_subscription_stats view. It simply counts conflicts of every type. Looking at the view, the confl_update_missing
counter is no longer zero:
sub=# SELECT * FROM pg_stat_subscription_stats WHERE subname = ’sub'
-[ RECORD 1 ]---------------+------
subid | 16390
subname | sub
apply_error_count | 0
sync_error_count | 0
confl_insert_exists | 0
confl_update_origin_differs | 0
confl_update_exists | 0
confl_update_missing | 1
confl_delete_origin_differs | 0
confl_delete_missing | 0
stats_reset |
Secondly, there is detailed information about the conflict in the server log:
2025-01-23 23:33:35.763 MSK [170775] LOG: conflict detected on relation "public.test": conflict=update_missing
2025-01-23 23:33:35.763 MSK [170775] DETAIL: Could not find the row to be updated.
Remote tuple (2); replica identity (id)=(1).
2025-01-23 23:33:35.763 MSK [170775] CONTEXT: processing remote data for replication origin "pg_16390" during message type "UPDATE" for replication target relation "public.test" in transaction 746, finished at 0/183DC60
As a side note, I would really love to see the developers go a step further, i.e. implement strategies that would automatically resolve the conflicts that interrupt logical replication. There is hope that they will deliver, since the discussion around this patch mentions, among other things, implementing two such strategies:apply(remote_apply)
andskip(keep_local)
. But tracking the conflicts is valuable in and of itself, and the developers have decided to start with it. Looking forward to further updates!
Planner: no more 10000000000
commit: e22253467, c01743aa4, 161320b4b, 84b8fccbe
The parameter family enable_* tells the planner not to use some query execution paths, for example, sequential scanning:
CREATE TABLE t (id int);
SET enable_seqscan = off;
EXPLAIN (settings) SELECT * FROM t;
However, the planner will still choose sequential scanning because the table has no indexes, making index scanning impossible.
In the plan in PostgreSQL 17, we will end up with a gargantuan cost:
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on t (cost=10000000000.00..10000000035.50 rows=2550 width=4)
Settings: enable_seqscan = 'off'
(2 rows)
The actual Seq Scan
cost is increased by the prohibitive constant of 10,000,000,000. This constant makes other access methods more preferable for the planner, if they exist. But since there are no other options, this extreme number appears in the selected plan.
The new patches remove the constant and teach the planner to work with enable_* parameters without it. The query plan now looks like this:
QUERY PLAN
-----------------------------------------------------
Seq Scan on t (cost=0.00..35.50 rows=2550 width=4)
Disabled: true
Settings: enable_seqscan = 'off'
(3 rows)
The Seq Scan
is still the selected option, but without artificial inflation of the cost. The Disabled: true
line indicates that the planner is forced to select a prohibited node due to the lack of other non-prohibited options.
If there is another option, it will be used:
CREATE INDEX ON t(id);
ANALYZE t;
EXPLAIN (settings) SELECT * FROM t;
QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using t_id_idx on t (cost=0.12..8.14 rows=1 width=4)
Settings: enable_seqscan = 'off'
(2 rows)
In addition, when the parameter setting
(first introduced in PostgreSQL 12) is used with the EXPLAIN
command, it becomes a great tool for figuring out which parameter changes affect the query execution plan.
Planner: memory management and usage monitoring for temporary tuple storage
commit: 1eff8279d, 53abb1e0e, 590b045c3, 97651b013, 04bcf9e19, 908a96861, 9fba1ed29, 95d6e9af0, 5d56d07ca, 40708acd6
This tremendous effort spans across multiple patch discussions, ergo the long list of commits. Still, it’s best to talk about them all together.
Some query execution plan nodes (listed below) use a common interface for temporary tuple storage (tuplestore). The patches introduce two major changes to it.
First, rows are stored within a separate memory context of the Generation
type, instead of the more general CurrentMemoryContext
. Without going into details, this increases speed and memory management efficiency.
Secondly, EXPLAIN (analyze)
now shows the storage type used (Memory
or Disk
) and the maximum allocated storage for the nodes that use the common interface. These nodes are:
Materialize
– saving an intermediate set of rows for later reuse;CTE Scan
– for queries with common table expressions;WindowAgg
– for queries with window functions;Recursive Union
– for recursive queries;Table Function Scan
– for queries to theJSON_TABLE
andXMLTABLE
functions (not to be confused withFunction Scan
).
In this example, the data for the Materialize
node fits into memory. Note the new line Storage
:
EXPLAIN (costs off, analyze, timing off, summary off, buffers off)
SELECT * FROM airports a1 CROSS JOIN airports a2;
QUERY PLAN
----------------------------------------------------------------------
Nested Loop (actual rows=10816 loops=1)
-> Seq Scan on airports_data ml (actual rows=104 loops=1)
-> Materialize (actual rows=104 loops=104)
Storage: Memory Maximum Storage: 35kB
-> Seq Scan on airports_data ml_1 (actual rows=104 loops=1)
(5 rows)
But for the CTE in the next example, the rows did not fit into memory, so temporary files were used:
EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
WITH b AS MATERIALIZED (
SELECT * FROM bookings
)
SELECT * FROM b;
QUERY PLAN
------------------------------------------------------------
CTE Scan on b (actual rows=2111110 loops=1)
Storage: Disk Maximum Storage: 67856kB
CTE b
-> Seq Scan on bookings (actual rows=2111110 loops=1)
(4 rows)
Additional optimizations have been made for the WindowAgg
node when switching to the next partition (PARTITION BY
). The more partitions, the greater the effect. For example, this query is 12% faster in PostgreSQL 18 than in 17:
EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
SELECT book_ref, count(*) OVER (PARTITION BY book_ref)
FROM tickets;
QUERY PLAN
-------------------------------------------------------------------------------------------
WindowAgg (actual rows=2949857 loops=1)
Storage: Memory Maximum Storage: 17kB
-> Index Only Scan using tickets_book_ref_idx on tickets (actual rows=2949857 loops=1)
Heap Fetches: 0
(4 rows)
The recursive query plan also shows the type and amount of memory used by Recursive Union
and CTE Scan
nodes.
EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (actual rows=1 loops=1)
CTE t
-> Recursive Union (actual rows=100 loops=1)
Storage: Memory Maximum Storage: 33kB
-> Result (actual rows=1 loops=1)
-> WorkTable Scan on t t_1 (actual rows=1 loops=100)
Filter: (n < 100)
Rows Removed by Filter: 0
-> CTE Scan on t (actual rows=100 loops=1)
Storage: Memory Maximum Storage: 20kB
(10 rows)
Finally, the node for JSON_TABLE
and XMLTABLE
functions.
EXPLAIN (analyze, costs off, summary off, timing off, buffers off)
SELECT * FROM JSON_TABLE('{"a":1}'::jsonb, '$[*]' COLUMNS (a int));
QUERY PLAN
-------------------------------------------------------------
Table Function Scan on "json_table" (actual rows=1 loops=1)
Storage: Memory Maximum Storage: 17kB
(2 rows)
Reading buffers during index backwards scan
commit: 3f44959f4, 1bd4bc85c, b5ee4e520, caca6d8d2, 4e6e375b0
When scanning indexes in reverse, previously read buffers were unnecessarily locked. This led to additional buffer reads. Compare the number of buffers read during the regular index scan in the first query and the backwards scan in the second query:
17=# EXPLAIN (analyze, buffers, costs off, summary off, timing off)
SELECT * FROM flights ORDER BY flight_id ASC;
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using flights_pkey on flights (actual rows=214867 loops=1)
Buffers: shared hit=3499
17=# EXPLAIN (analyze, buffers, costs off, summary off, timing off)
SELECT * FROM flights ORDER BY flight_id DESC;
QUERY PLAN
--------------------------------------------------------------------------------
Index Scan Backward using flights_pkey on flights (actual rows=214867 loops=1)
Buffers: shared hit=4087
In PostgreSQL 18, scanning the index in any direction will require a minimal number of buffers: 3499 in this case.
pg_upgrade: asynchronous operations in multiple databases
commit: 40e2e5e92, 6d3d2e8e5, 7baa36de5, 46cad8b31, 6ab8f27bc, bbf83cab9, 9db3018cf, c34eabfbb, cf2f82a37, f93f5f7b9, c880cf258
At various points, pg_upgrade
executes the same operation in all databases of the upgraded cluster. Historically, it performed these operations sequentially, connecting to each database one after another.
The first commit creates an infrastructure for asynchronous execution of such operations using libpq features. The remaining commits make individual tasks work with the infrastructure.
The asynchronous mode is controlled by the --jobs
parameter, and its efficiency increases with the number of databases in the upgraded cluster.
This is a follow-up on the pg_upgrade optimization described in the previous article.
The following four optimizations are notable because there are no parameters. Things just run faster.
Hash Join speed-up
commit: adf97c156
Faster Hash Join
, especially on multiple columns. See the commit message for estimates.
Faster text value processing in JSON
commit: ca6fde922
Improved conversion from JSON to text by using SIMD for escaping property names and values. The longer the text value, the greater the optimization. See the commit message for estimates.
Optimized numeric-type multiplication
commit: ca481d3c9, c4e44224c, 8dc28d7eb
Numeric numbers will multiply faster.
Optimized numeric-type division
commit: 9428c001f
Numeric numbers will divide faster.
ANALYZE ONLY and VACUUM ONLY
commit: 62ddf7ee9
Autovacuum and autoanalyse go over partitioned table’s partitions, but not the table itself. There are no rows, therefore, no way to set the conditions when the automatic operations should start.
But while there is no need to vacuum a partitioned table, collecting statistics for it might prove useful for some queries. After all, some statistics are collected for the table itself. The ANALYZE
command will collect statistics not only for the partitioned table, but also for all its partitions, and this can take a while. The patch adds the ONLY
keyword to allow
ANALYZE
and VACUUM
to ignore the partitions:
ANALYZE ONLY partitioned_table;
ANALYZE
Same for VACUUM
:
VACUUM (analyze) ONLY partitioned_table;
WARNING: VACUUM ONLY of partitioned table "partitioned_table" has no effect
VACUUM
The warning says that it makes no sense to clean up a partitioned table without ANALYZE
.
ONLY
with VACUUM
and ANALYZE
affects not only partitioned tables, but also parent tables with children. Note that the change is incompatible with previous versions.
Previously, the command
ANALYZE parent_table;
would analyze only the parent table. Now, it will go through all of its children. To avoid that, the ONLY
keyword should be used:
ANALYZE ONLY parent_table;
The same is true for VACUUM
.
Improved checkpointer statistics
commit: 17cc5f666
In PostgreSQL 17, checkpointer statistics can be found in the pg_stat_checkpointer
view and in the server log.
Turns out, the number of recorded buffers differs between these two. This is because the server log messages include buffers from both the shared buffer cache (shared buffers) and SLRU, while pg_stat_checkpointer.buffers_written
tracks only shared buffers.
To fix it, the developers split the information in the server log to record shared buffers and SLRU buffers separately.
LOG: checkpoint complete: wrote 47 buffers (0.3%), wrote 0 SLRU buffers; …
And in pg_stat_checkpointer
, a new column slru_written
is added to complement the existing buffers_written
.
Let’s see if the information in the two sources now matches.
CHECKPOINT;
SELECT pg_stat_reset_shared('checkpointer');
Start a task and immediately perform a checkpoint:
CREATE TABLE bookings_copy AS SELECT * FROM bookings;
CHECKPOINT;
Check the information in the server log and pg_stat_checkpointer
:
\! tail -n 1 logfile
2024-11-12 16:56:33.443 MSK [63957] LOG: checkpoint complete: wrote 2016 buffers (12.3%), wrote 1 SLRU buffers; 0 WAL file(s) added, 5 removed, 5 recycled; write=0.044 s, sync=0.311 s, total=0.773 s; sync files=20, longest=0.226 s, average=0.016 s; distance=165531 kB, estimate=444398 kB; lsn=1/4A6A8E20, redo lsn=1/4A6A8DC8
SELECT buffers_written, slru_written FROM pg_stat_checkpointer;
buffers_written | slru_written
-----------------+--------------
2016 | 1
(1 row)
pg_stat_statements: normalization of SET commands
Another follow-up on the normalization of commands. This time it’s the parameter setter command SET
. Commands with different formatting and different parameter values will be converted to the same format, which will help reduce the number of stored queries in pg_stat_statements
.
SELECT pg_stat_statements_reset();
SET SEARCH_PATH = pg_catalog;
SET search_path = public;
set search_path=bookings,public;
SET CUSTOM.PARAMETER = 1;
SET CUSTOM.parameter = 2;
set custom.parameter=42;
SELECT queryid, query, calls
FROM pg_stat_statements
WHERE query ILIKE ’sET%';
queryid | query | calls
----------------------+---------------------------+-------
-5443897209013767274 | SET CUSTOM.PARAMETER = $1 | 3
-1735366501441382531 | SET SEARCH_PATH = $1 | 3
(2 rows)
postgres_fdw_get_connections and remote connection status
commit: c297a47c5, 857df3cef, 4f08ab554
Let’s access a remote table within a transaction.
BEGIN;
SELECT count(*) FROM remote_bookings.tickets;
count
---------
2949857
(1 row)
The remote connection should be active now, we can check the status of all connections with the postgres_fdw_get_connections function. The three new columns are going to help us with that: user_name
, user_in_xact
and closed
.
SELECT * FROM postgres_fdw_get_connections(check_conn => true);
server_name | user_name | valid | used_in_xact | closed
-------------+-----------+-------+--------------+--------
demo_srv | postgres | t | t | f
(1 row)
The column user_name
is the name of the local user, used_in_xact
is whether the connection is in use in the current transaction, and closed
shows the connection status.
If the connection is in order, the local transaction can carry on. Otherwise, it needs to be rolled back: there is no point in doing anything, because the transaction will not be able to complete successfully due to a closed connection on the remote server.
file_fdw: ignore format conversion errors
In PostgreSQL 17, the COPY
command got the ability to ignore format conversion errors when reading rows with the parameters on_error
and log_verbosity
. The file_fdw
extension uses COPY
to read files, so adding the features to the extension seems like the next reasonable step.
For example, here’s a file where not all strings are numbers:
$ cat /tmp/t.txt
1
two
three
4
Now, create a remote table for the file. It has an integer column and can ignore format conversion errors.
CREATE EXTENSION file_fdw;
CREATE SERVER file_server
FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE ft (
id integer
)
SERVER file_server
OPTIONS (
filename '/tmp/t.txt',
on_error 'ignore',
log_verbosity 'verbose'
);
The query works. It shows the successfully loaded rows and warnings for the rest:
SELECT * FROM ft;
NOTICE: skipping row due to data type incompatibility at line 2 for column "id": "two"
NOTICE: skipping row due to data type incompatibility at line 3 for column "id": "three"
NOTICE: 2 rows were skipped due to data type incompatibility
id
----
1
4
(2 rows)
Warnings can be disabled by using the new silent
value for the log_verbosity
parameter:
ALTER FOREIGN TABLE ft OPTIONS (SET log_verbosity 'silent');
SELECT * FROM ft;
id
----
1
4
(2 rows)
New function has_largeobject_privilege
commit: 4eada203a
A new addition to the has_*_privilege family of functions, has_largeobject_privilege
allows you to check access rights to large objects.
\lo_import 'logfile'
lo_import 24578
GRANT SELECT ON LARGE OBJECT 24578 TO public;
SELECT has_largeobject_privilege('alice', 24578, 'SELECT');
has_largeobject_privilege
---------------------------
t
(1 row)
An important advantage of this family of functions is that it checks for privileges granted not only directly, but also indirectly through membership in a privileged role, including membership in the role of the owner of the object.
Note that between the list of object types you can grant privileges for and the list of has_*_privilege functions, the single difference ishas_domain_privilege
. The reason is that domains are not exactly an independent object type. Rather, they are regular types and are stored inpg_type
. Therefore, instead of thehas_domain_privilege
function,has_type_privilege
works fine with domains. But there is a separate commandCREATE DOMAIN
for creating domains, and privileges for domains are granted withGRANT ... ON DOMAIN
. So, the absence of thehas_domain_privilege
function doesn’t feel right. Perhaps it’s high time to include a mention of domains inhas_type_privilege
documentation.
Functions crc32 and crc32c
commit: 760162fed
New functions for calculating CRC-32 and CRC-32C checksums:
SELECT crc32('42'::bytea), crc32c('42'::bytea);
crc32 | crc32c
-----------+-----------
841265288 | 779959755
(1 row)
Client-server protocol: report search_path changes to the client
With the client-server protocol, immediately after connection the clients receive a ParameterStatus message containing certain parameter values important for the application, such as client_encoding and DateStyle. When one of these parameters changes during a session (usually via the SET
command), the server process automatically notifies the client of the new value.
The list of the parameters is fixed, but the first commit adds the parameter search_path to the list.
This parameter is important for developers of drivers that implement the client-server protocol, especially for developers of connection pools. In transactional mode, when different clients are served by a single server process, the pooler will be able to automatically match the search_path on the server with the stored value for the current client.
psql: support for named prepared statements
commit: d55322b0d
The extended query protocol support was first introduced in PostgreSQL 16. Only unnamed prepared statements were supported at that point, though.
With the new commands \parse, \bind_named, and \close, prepared statements can now be used to a fuller extent. Here’s how it works.
SELECT $2||$1 AS str \parse q
SELECT * FROM pg_prepared_statements \gx
-[ RECORD 1 ]---+------------------------------
name | q
statement | SELECT $2||$1 AS str
prepare_time | 2024-11-08 09:30:50.326934+03
parameter_types | {text,text}
result_types | {text}
from_sql | f
generic_plans | 0
custom_plans | 0
The first command declares a prepared statement q
with two text parameters. The second query verifies that the prepared statement now exists.
The command \bind_named
binds the parameters to the actual values and executes the prepared statement. The prepared statement name goes first, then the two parameter values:
\bind_named q 42 'Answer: ' \g
str
------------
Answer: 42
(1 row)
\bind_named q 'World!' 'Hello,' \g
str
--------------
Hello,World!
(1 row)
Finally, the prepared statement is closed (DEALLOCATE
):
\close q
pg_verifybackup: integrity verification for tar backups
commit: 8dfd31290
pg_verifybackup can now verify cluster backups not only in plain
format, but also in tar
. A copy in tar
format can be compressed:
$ pg_basebackup -D backup --format=tar --gzip
$ ls -l backup
total 569416
-rw------- 1 pal pal 190742 dec 24 18:57 backup_manifest
-rw------- 1 pal pal 582862489 dec 24 18:57 base.tar.gz
-rw------- 1 pal pal 17120 dec 24 18:57 pg_wal.tar.gz
But there is a catch. pg_verifybackup
does not support WAL files, so you need to explicitly specify that WAL verification is not required:
$ pg_verifybackup backup --format=tar --no-parse-wal
backup successfully verified
That’s all for now. The best commits of the November CommitFest are coming next time.