PostgreSQL 18: part 1 or CommitFest 2024-07
This article is the first in the series about the upcoming PostgreSQL 18 release. Let us take a look at the features introduced in the July CommitFest.
- Planner: Hash Right Semi Join support
- Planner: materializing an internal row set for parallel nested loop join
- Planner support functions for generate_series
- EXPLAIN (analyze): statistics for Parallel Bitmap Heap Scan node workers
- Functions min and max for composite types
- Parameter names for regexp* functions
- Debug mode in pgbench
- pg_get_backend_memory_contexts: column path instead of parent, new column type
- Function pg_get_acl
- pg_upgrade: pg_dump optimization
- Predefined role pg_signal_autovacuum_worker
Planner: Hash Right Semi Join support
commit: aa86129e1
The plan below in PostgreSQL 17 employs hash join and sequential scan of both tables.
17=# EXPLAIN (costs off, analyze)
SELECT * FROM flights
WHERE flight_id IN (SELECT flight_id FROM ticket_flights);
QUERY PLAN
------------------------------------------------------------------------------------------------
Hash Join (actual time=2133.122..2195.619 rows=150588 loops=1)
Hash Cond: (flights.flight_id = ticket_flights.flight_id)
-> Seq Scan on flights (actual time=0.018..10.301 rows=214867 loops=1)
-> Hash (actual time=2132.969..2132.970 rows=150588 loops=1)
Buckets: 262144 (originally 131072) Batches: 1 (originally 1) Memory Usage: 7343kB
-> HashAggregate (actual time=1821.476..2114.218 rows=150588 loops=1)
Group Key: ticket_flights.flight_id
Batches: 5 Memory Usage: 10289kB Disk Usage: 69384kB
-> Seq Scan on ticket_flights (actual time=7.200..655.356 rows=8391852 loops=1)
Planning Time: 0.325 ms
Execution Time: 2258.237 ms
(11 rows)
However, the hash is built on the larger table ticket_flights
which obviously requires more resources compared to the smaller table flights
.
In PostgreSQL 18, it becomes possible to choose which table is hashed. Here, in line 4, we can see that the planner goes with the Hash Right Semi Join, and even in parallel mode. As a result, the execution time is noticeably reduced:
QUERY PLAN
---------------------------------------------------------------------------------------------------
Gather (actual time=56.771..943.233 rows=150590 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Right Semi Join (actual time=41.754..909.894 rows=50197 loops=3)
Hash Cond: (ticket_flights.flight_id = flights.flight_id)
-> Parallel Seq Scan on ticket_flights (actual time=0.047..221.511 rows=2797284 loops=3)
-> Parallel Hash (actual time=40.309..40.309 rows=71622 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 23808kB
-> Parallel Seq Scan on flights (actual time=0.008..6.631 rows=71622 loops=3)
Planning Time: 0.555 ms
Execution Time: 949.831 ms
(11 rows)
Planner: materializing an internal row set for parallel nested loop join
commit: 22d946b0f
Before PostgreSQL 18, the planner never considered materializing an internal row set for parallel nested loop joins.
17=# EXPLAIN (costs off)
SELECT *
FROM ticket_flights tf
JOIN flights f USING (flight_id)
WHERE f.flight_id = 12345;
QUERY PLAN
----------------------------------------------------
Nested Loop
-> Index Scan using flights_pkey on flights f
Index Cond: (flight_id = 12345)
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on ticket_flights tf
Filter: (flight_id = 12345)
(7 rows)
The plan for the same query in PostgreSQL 18 materializes the flights:
QUERY PLAN
--------------------------------------------------------------
Gather
Workers Planned: 2
-> Nested Loop
-> Parallel Seq Scan on ticket_flights tf
Filter: (flight_id = 12345)
-> Materialize
-> Index Scan using flights_pkey on flights f
Index Cond: (flight_id = 12345)
(8 rows)
Notably, the performance does not change much with the new plan. In both plans, the same actions are performed, but the planner has more opportunities to choose the optimal plan in other situations, which is good news.
Planner support functions for generate_series
commit: 036bdcec9
The new support function tells the planner how many rows generate_series
will return for a range of dates and times:
EXPLAIN
SELECT *
FROM generate_series(current_date, current_date + '1 day'::interval, '1 hour');
QUERY PLAN
---------------------------------------------------------------------
Function Scan on generate_series (cost=0.01..0.26 rows=25 width=8)
(1 row)
Previously, the number of rows was always estimated at 1000 (the default ROWS
value for functions).
Note that a similar support function for integers has long existed:
EXPLAIN
SELECT *
FROM generate_series(1, 42, 2);
QUERY PLAN
---------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..0.21 rows=21 width=4)
(1 row)
EXPLAIN (analyze): statistics for Parallel Bitmap Heap Scan node workers
commit: 5a1e6df3b
The EXPLAIN
command now shows statistics (the number of exact and lossy fragments) for each worker involved in a parallel bitmap scan. In the example below, these are lines starting with Worker 0
and Worker 1
.
EXPLAIN (analyze, costs off, timing off, summary off)
SELECT count(*) FROM bookings
WHERE total_amount < 20000 AND book_date > '2017-07-15';
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
-> Gather (actual rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Bitmap Heap Scan on bookings (actual rows=6415 loops=3)
Recheck Cond: (total_amount < '20000'::numeric)
Filter: (book_date > '2017-07-15 00:00:00+03'::timestamp with time zone)
Rows Removed by Filter: 67072
Heap Blocks: exact=6345
Worker 0: Heap Blocks: exact=3632
Worker 1: Heap Blocks: exact=3470
-> Bitmap Index Scan on bookings_total_amount_idx (actual rows=220460 loops=1)
Index Cond: (total_amount < '20000'::numeric)
Functions min and max for composite types
commit: a0f1fce80
Composite type values can be compared with each other and have been for a while now. The comparison is done element-by-element. For example, let us sort the following rows:
CREATE TABLE seats(
line text,
number integer
);
INSERT INTO seats VALUES
('A', 42), ('B', 1), ('C', 27);
SELECT * FROM seats s ORDER BY s DESC;
line | number
------+--------
C | 27
B | 1
A | 42
(3 rows)
However, there were no aggregate functions min
and max
for composite types. Until now.
SELECT min(s.*), max(s.*) FROM seats s;
min | max
--------+--------
(A,42) | (C,27)
(1 row)
Parameter names for regexp* functions
commit: 580f8727
Functions for regular expressions have got parameter names for improved clarity:
SELECT regexp_replace(
pattern => '$',
replacement => 'Postgres!',
string => 'All You Need Is '
);
regexp_replace
---------------------------
All You Need Is Postgres!
The parameter names are listed in the documentation or in the \df
command output in psql. Previously, the parameters could be passed only positionally.
Debug mode in pgbench
commit: 3ff01b2b
Most server utilities use the -d
option to specify the database to connect to. However, in pgbench, this option entered the debug mode. The parameter names have been changed for consistency: now you specify the database with -d
or --dbname
and enter the debug more with the full-name option --debug
.
The change has been implemented in PostgreSQL 17 already, although it was formally introduced only in the PostgreSQL 18 CommitFest in July.
pg_get_backend_memory_contexts: column path instead of parent, new column type
commit: 32d3ed81, f0d11275, 12227a1d
Core developers and enthusiasts researching memory usage will appreciate the changes in the pg_backend_memory_contexts view.
The patch makes linking child and parent contexts more convenient and reliable. There is a new column path
, which holds an array of context IDs. The first element of the array is always the top level context TopMemoryContext
, and the last element is the current row context ID. In addition, the numbering in the level
column now starts with 1 instead of 0, which is better for writing queries.
Below is an example of a query showing the memory context of TopPortalContext
and all its child contexts.
WITH memory_contexts AS (
SELECT * FROM pg_backend_memory_contexts
)
SELECT child.name, child.type, child.level, child.path, child.total_bytes
FROM memory_contexts parent, memory_contexts child
WHERE parent.name = 'TopPortalContext' AND
child.path[parent.level] = parent.path[parent.level];
name | type | level | path | total_bytes
--------------------------------+------------+-------+-----------------------+-------------
TopPortalContext | AllocSet | 2 | {1,20} | 8192
PortalContext | AllocSet | 3 | {1,20,31} | 1024
ExecutorState | AllocSet | 4 | {1,20,31,136} | 65632
tuplestore tuples | Generation | 5 | {1,20,31,136,138} | 32768
printtup | AllocSet | 5 | {1,20,31,136,139} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,140} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,141} | 8192
tuplestore tuples | Generation | 5 | {1,20,31,136,142} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,143} | 8192
Table function arguments | AllocSet | 5 | {1,20,31,136,144} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,145} | 32768
pg_get_backend_memory_contexts | AllocSet | 6 | {1,20,31,136,145,146} | 16384
(12 rows)
Previously, the contexts could be joined by linking the parent
column (removed by the second commit) to the name
column. This approach required a more complex, recursive query. More importantly, this sort of linking was less reliable, as the context names were not guaranteed to be unique.
Lastly, the third commit adds the type
column. Currently, four types of context are used: AllocSet
, Generation
, Slab
, Bump
. For more information about memory contexts, see the source code: src/backend/utils/mmgr/README
.
Function pg_get_acl
commit: 4564f1ce, e311c6e5, d898665b
The following commands make the table test
depend on the role alice
.
CREATE ROLE alice;
CREATE TABLE test (id int);
GRANT SELECT ON test TO alice;
Now the role alice
cannot be deleted without detaching the dependant table:
DROP ROLE alice;
ERROR: role "alice" cannot be dropped because some objects depend on it
DETAIL: privileges for table test
The object dependency data is stored in two system catalogs: pg_depend
(between objects within the same database) and pg_shdepend
for shared cluster objects, such as roles.
Currently, pg_shdepend
contains just the record of the dependency we just created.
SELECT * FROM pg_shdepend\gx
-[ RECORD 1 ]-----
dbid | 5
classid | 1259
objid | 16528
objsubid | 0
refclassid | 1260
refobjid | 16527
deptype | a
This information is not easily readable since the objects are represented as identifiers: the system catalog table ID (classid
), the row ID in the system catalog (objid
) and the ordinal column number for tables (objsubid
). The referenced objects are represented by refclassid
and refobjid
, respectively (there are no references to table columns, so there is no refobjsubid
column).
There is a number of convenient functions for accessing objects by their IDs. For example, the function pg_identify_object
returns the object record:
SELECT * FROM pg_identify_object(1259,16528,0);
type | schema | name | identity
-------+--------+------+-------------
table | public | test | public.test
The object on which the table depends:
SELECT * FROM pg_identify_object(1260,16527,0);
type | schema | name | identity
------+--------+-------+----------
role | | alice | alice
There is a new addition to this set of functions. The new function pg_get_acl
returns the object access rights without the need to query a specific system catalog.
This query shows what privileges the role alice
has:
SELECT *,
pg_identify_object(classid, objid, objsubid) AS object,
pg_identify_object(refclassid, refobjid, 0) AS ref_object,
pg_get_acl(classid, objid, objsubid)
FROM pg_shdepend\gx
-[ RECORD 1 ]---------------------------------------------
dbid | 5
classid | 1259
objid | 16528
objsubid | 0
refclassid | 1260
refobjid | 16527
deptype | a
object | (table,public,test,public.test)
ref_object | (role,,alice,alice)
pg_get_acl | {postgres=arwdDxtm/postgres,alice=r/postgres}
pg_upgrade: pg_dump optimization
commit: 6e16b1e4, 64f34eb2e, 2329cad1b, 23687e925, 68e962998, c8b06bb96, bd15b7db4, 6ebeeae29
There are several new optimizations for the server version upgrade process. Let us consider them together, since they all are more or less related to pg_dump
, the tool that dumps the content of the system catalogs during upgrade.
The first optimization is that pg_upgrade
will run pg_dump
with the --no-sync
key. When upgrading the server, the guarantee that the files are flushed to disk is redundant: if something goes wrong, the whole upgrade process needs to be started from scratch anyway, while disk sync time is time to be saved.
The second optimization makes database creation commands use the FILE_COPY
strategy instead of the default WAL_LOG
:
CREATE DATABASE .. STRATEGY = FILE_COPY;
With this strategy, the contents of the template database are not logged, and durability is guaranteed by performing checkpoints before and after copying the files. In normal operation, this is not the most efficient approach, but during binary upgrade the server will not perform extra checkpoints anyway (there will be just one, after creating the template database template0
).
The effectiveness of these two optimizations will increase with the number of databases in the upgraded cluster.
The third optimization speeds up the dumping of table descriptions in pg_dump --binary-upgrade
mode. In this mode, pg_upgrade
starts pg_dump
. When it starts, pg_dump
fetches information about all tables with a single query to pg_class
and stores it in an ordered array in memory, saving time. Before this patch, there was a separate query to pg_class
for each table, which increased the costs considerably.
The fourth optimization speeds up the dumping of sequences in the same way the previous patch does with tables. When pg_dump
starts, it puts all the sequences data into an ordered array, eliminating the need to access the system catalog every time in the future. Notably, this optimization is not restricted to --binary-upgrade
mode and works just as fine in the regular mode.
The third and fourth optimizations come with a small tradeoff in the form of the RAM required to store the table and sequence information, but the developers behind the patches believe that this is absolutely worth it.
In databases with a small number of tables and sequences, the last two optimizations will not have much of an effect. On the other hand, in databases with tens and hundreds of thousands of tables and sequences, the speed increase is actually noticeable.
Let us test it. Create a database in PostgreSQL 17 with 100 000 tables, each with a column linked to a sequence. Then do the same in PostgreSQL 18.
$ psql -d test -o /dev/null <<EOF
SELECT format('CREATE TABLE t%s (id int GENERATED ALWAYS AS IDENTITY)', g.i)
FROM generate_series(1,100_000) AS g(i)
\gexec
EOF
Now, time how long pg_dump --binary-upgrade
takes.
PostgreSQL 17:
$ time pg_dump -d test --schema-only --binary-upgrade > /dev/null
real 1m40,677s
user 0m5,403s
sys 0m3,265s
PostgreSQL 18:
$ time pg_dump -d test --schema-only --binary-upgrade > /dev/null
real 0m7,427s
user 0m2,220s
sys 0m0,264s
The patches make it run more than ten times faster.
Predefined role pg_signal_autovacuum_worker
commit: ccd38024b
Members of the predefined role pg_signal_backend
can send the query termination signal (pg_cancel_backend
) and process termination signal (pg_terminate_backend
) to server backend processes. With one exception: the target process may not belong to a superuser.
However, there may be maintenance scenarios when you need to interrupt an autovacuum process. Members of the new predefined role pg_signal_autovacuum_worker
can, while not being superusers, send autovacuum workers signals to terminate the current table vacuuming job or to terminate the process entirely by using the same two functions.
This is all for now. The news of the September CommitFest are soon to follow!