•   PostgreSQL   •   By Pavel Luzanov

PostgreSQL 18: part 1 or CommitFest 2024-07

This article is the first in the series about the upcoming PostgreSQL 18 release. Let us take a look at the features introduced in the July CommitFest.

Planner: Hash Right Semi Join support

commit: aa86129e1

The plan below in PostgreSQL 17 employs hash join and sequential scan of both tables.


17=# EXPLAIN (costs off, analyze)
SELECT * FROM flights
WHERE flight_id IN (SELECT flight_id FROM ticket_flights);

                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Hash Join (actual time=2133.122..2195.619 rows=150588 loops=1)
   Hash Cond: (flights.flight_id = ticket_flights.flight_id)
   ->  Seq Scan on flights (actual time=0.018..10.301 rows=214867 loops=1)
   ->  Hash (actual time=2132.969..2132.970 rows=150588 loops=1)
         Buckets: 262144 (originally 131072)  Batches: 1 (originally 1)  Memory Usage: 7343kB
         ->  HashAggregate (actual time=1821.476..2114.218 rows=150588 loops=1)
               Group Key: ticket_flights.flight_id
               Batches: 5  Memory Usage: 10289kB  Disk Usage: 69384kB
               ->  Seq Scan on ticket_flights (actual time=7.200..655.356 rows=8391852 loops=1)
 Planning Time: 0.325 ms
 Execution Time: 2258.237 ms
(11 rows)

However, the hash is built on the larger table ticket_flights which obviously requires more resources compared to the smaller table flights.

In PostgreSQL 18, it becomes possible to choose which table is hashed. Here, in line 4, we can see that the planner goes with the Hash Right Semi Join, and even in parallel mode. As a result, the execution time is noticeably reduced:


                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Gather (actual time=56.771..943.233 rows=150590 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Hash Right Semi Join (actual time=41.754..909.894 rows=50197 loops=3)
         Hash Cond: (ticket_flights.flight_id = flights.flight_id)
         ->  Parallel Seq Scan on ticket_flights (actual time=0.047..221.511 rows=2797284 loops=3)
         ->  Parallel Hash (actual time=40.309..40.309 rows=71622 loops=3)
               Buckets: 262144  Batches: 1  Memory Usage: 23808kB
               ->  Parallel Seq Scan on flights (actual time=0.008..6.631 rows=71622 loops=3)
 Planning Time: 0.555 ms
 Execution Time: 949.831 ms
(11 rows)

Planner: materializing an internal row set for parallel nested loop join

commit: 22d946b0f

Before PostgreSQL 18, the planner never considered materializing an internal row set for parallel nested loop joins.


17=# EXPLAIN (costs off)
SELECT *
FROM ticket_flights tf
     JOIN flights f USING (flight_id)
WHERE f.flight_id = 12345;

                     QUERY PLAN                     
----------------------------------------------------
 Nested Loop
   ->  Index Scan using flights_pkey on flights f
         Index Cond: (flight_id = 12345)
   ->  Gather
         Workers Planned: 2
         ->  Parallel Seq Scan on ticket_flights tf
               Filter: (flight_id = 12345)
(7 rows)

The plan for the same query in PostgreSQL 18 materializes the flights:


                          QUERY PLAN                          
--------------------------------------------------------------
 Gather
   Workers Planned: 2
   ->  Nested Loop
         ->  Parallel Seq Scan on ticket_flights tf
               Filter: (flight_id = 12345)
         ->  Materialize
               ->  Index Scan using flights_pkey on flights f
                     Index Cond: (flight_id = 12345)
(8 rows)

Notably, the performance does not change much with the new plan. In both plans, the same actions are performed, but the planner has more opportunities to choose the optimal plan in other situations, which is good news.

Planner support functions for generate_series

commit: 036bdcec9

The new support function tells the planner how many rows generate_series will return for a range of dates and times:


EXPLAIN
SELECT *
FROM generate_series(current_date, current_date + '1 day'::interval, '1 hour');

                             QUERY PLAN                              
---------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.01..0.26 rows=25 width=8)
(1 row)

Previously, the number of rows was always estimated at 1000 (the default ROWS value for functions).

Note that a similar support function for integers has long existed:


EXPLAIN
SELECT *
FROM generate_series(1, 42, 2);

                             QUERY PLAN                              
---------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..0.21 rows=21 width=4)
(1 row)

EXPLAIN (analyze): statistics for Parallel Bitmap Heap Scan node workers

commit: 5a1e6df3b

The EXPLAIN command now shows statistics (the number of exact and lossy fragments) for each worker involved in a parallel bitmap scan. In the example below, these are lines starting with Worker 0 and Worker 1.


EXPLAIN (analyze, costs off, timing off, summary off)
SELECT count(*) FROM bookings
  WHERE total_amount < 20000 AND book_date > '2017-07-15';

                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Finalize Aggregate (actual rows=1 loops=1)
   ->  Gather (actual rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual rows=1 loops=3)
               ->  Parallel Bitmap Heap Scan on bookings (actual rows=6415 loops=3)
                     Recheck Cond: (total_amount < '20000'::numeric)
                     Filter: (book_date > '2017-07-15 00:00:00+03'::timestamp with time zone)
                     Rows Removed by Filter: 67072
                     Heap Blocks: exact=6345
                     Worker 0:  Heap Blocks: exact=3632
                     Worker 1:  Heap Blocks: exact=3470
                     ->  Bitmap Index Scan on bookings_total_amount_idx (actual rows=220460 loops=1)
                           Index Cond: (total_amount < '20000'::numeric)

Functions min and max for composite types

commit: a0f1fce80

Composite type values can be compared with each other and have been for a while now. The comparison is done element-by-element. For example, let us sort the following rows:


CREATE TABLE seats(
    line text,
    number integer
);

INSERT INTO seats VALUES
    ('A', 42), ('B',  1), ('C', 27);

SELECT * FROM seats s ORDER BY s DESC;

 line | number
------+--------
 C    |     27
 B    |      1
 A    |     42
(3 rows)

However, there were no aggregate functions min and max for composite types. Until now.


SELECT min(s.*), max(s.*) FROM seats s;

  min   |  max   
--------+--------
 (A,42) | (C,27)
(1 row)

Parameter names for regexp* functions

commit: 580f8727

Functions for regular expressions have got parameter names for improved clarity:


SELECT regexp_replace(
    pattern => '$',
    replacement => 'Postgres!',
    string => 'All You Need Is '
);

      regexp_replace       
---------------------------
 All You Need Is Postgres!

The parameter names are listed in the documentation or in the \df command output in psql. Previously, the parameters could be passed only positionally.

Debug mode in pgbench

commit: 3ff01b2b

Most server utilities use the -d option to specify the database to connect to. However, in pgbench, this option entered the debug mode. The parameter names have been changed for consistency: now you specify the database with -d or --dbname and enter the debug more with the full-name option --debug.

The change has been implemented in PostgreSQL 17 already, although it was formally introduced only in the PostgreSQL 18 CommitFest in July.

pg_get_backend_memory_contexts: column path instead of parent, new column type

commit: 32d3ed81, f0d11275, 12227a1d

Core developers and enthusiasts researching memory usage will appreciate the changes in the pg_backend_memory_contexts view.

The patch makes linking child and parent contexts more convenient and reliable. There is a new column path, which holds an array of context IDs. The first element of the array is always the top level context TopMemoryContext, and the last element is the current row context ID. In addition, the numbering in the level column now starts with 1 instead of 0, which is better for writing queries.

Below is an example of a query showing the memory context of TopPortalContext and all its child contexts.


WITH memory_contexts AS (
    SELECT * FROM pg_backend_memory_contexts
)
SELECT child.name, child.type, child.level, child.path, child.total_bytes
FROM memory_contexts parent, memory_contexts child
WHERE parent.name = 'TopPortalContext' AND
      child.path[parent.level] = parent.path[parent.level];

              name              |    type    | level |         path          | total_bytes
--------------------------------+------------+-------+-----------------------+-------------
 TopPortalContext               | AllocSet   |     2 | {1,20}                |        8192
 PortalContext                  | AllocSet   |     3 | {1,20,31}             |        1024
 ExecutorState                  | AllocSet   |     4 | {1,20,31,136}         |       65632
 tuplestore tuples              | Generation |     5 | {1,20,31,136,138}     |       32768
 printtup                       | AllocSet   |     5 | {1,20,31,136,139}     |        8192
 ExprContext                    | AllocSet   |     5 | {1,20,31,136,140}     |        8192
 ExprContext                    | AllocSet   |     5 | {1,20,31,136,141}     |        8192
 tuplestore tuples              | Generation |     5 | {1,20,31,136,142}     |        8192
 ExprContext                    | AllocSet   |     5 | {1,20,31,136,143}     |        8192
 Table function arguments       | AllocSet   |     5 | {1,20,31,136,144}     |        8192
 ExprContext                    | AllocSet   |     5 | {1,20,31,136,145}     |       32768
 pg_get_backend_memory_contexts | AllocSet   |     6 | {1,20,31,136,145,146} |       16384
(12 rows)

Previously, the contexts could be joined by linking the parent column (removed by the second commit) to the name column. This approach required a more complex, recursive query. More importantly, this sort of linking was less reliable, as the context names were not guaranteed to be unique.

Lastly, the third commit adds the type column. Currently, four types of context are used: AllocSet, Generation, Slab, Bump. For more information about memory contexts, see the source code: src/backend/utils/mmgr/README.

Function pg_get_acl

commit: 4564f1ce, e311c6e5, d898665b

The following commands make the table test depend on the role alice.


CREATE ROLE alice;
CREATE TABLE test (id int);
GRANT SELECT ON test TO alice;

Now the role alice cannot be deleted without detaching the dependant table:


DROP ROLE alice;

ERROR:  role "alice" cannot be dropped because some objects depend on it
DETAIL:  privileges for table test

The object dependency data is stored in two system catalogs: pg_depend (between objects within the same database) and pg_shdepend for shared cluster objects, such as roles.

Currently, pg_shdepend contains just the record of the dependency we just created.


SELECT * FROM pg_shdepend\gx

-[ RECORD 1 ]-----
dbid       | 5
classid    | 1259
objid      | 16528
objsubid   | 0
refclassid | 1260
refobjid   | 16527
deptype    | a

This information is not easily readable since the objects are represented as identifiers: the system catalog table ID (classid), the row ID in the system catalog (objid) and the ordinal column number for tables (objsubid). The referenced objects are represented by refclassid and refobjid, respectively (there are no references to table columns, so there is no refobjsubid column).

There is a number of convenient functions for accessing objects by their IDs. For example, the function pg_identify_object returns the object record:


SELECT * FROM pg_identify_object(1259,16528,0);

 type  | schema | name |  identity   
-------+--------+------+-------------
 table | public | test | public.test

The object on which the table depends:


SELECT * FROM pg_identify_object(1260,16527,0);

 type | schema | name  | identity
------+--------+-------+----------
 role |        | alice | alice

There is a new addition to this set of functions. The new function pg_get_acl returns the object access rights without the need to query a specific system catalog.

This query shows what privileges the role alice has:


SELECT *,
    pg_identify_object(classid, objid, objsubid) AS object,
    pg_identify_object(refclassid, refobjid, 0) AS ref_object,
    pg_get_acl(classid, objid, objsubid)
FROM pg_shdepend\gx

-[ RECORD 1 ]---------------------------------------------
dbid       | 5
classid    | 1259
objid      | 16528
objsubid   | 0
refclassid | 1260
refobjid   | 16527
deptype    | a
object     | (table,public,test,public.test)
ref_object | (role,,alice,alice)
pg_get_acl | {postgres=arwdDxtm/postgres,alice=r/postgres}

pg_upgrade: pg_dump optimization

commit: 6e16b1e4, 64f34eb2e, 2329cad1b, 23687e925, 68e962998, c8b06bb96, bd15b7db4, 6ebeeae29

There are several new optimizations for the server version upgrade process. Let us consider them together, since they all are more or less related to pg_dump, the tool that dumps the content of the system catalogs during upgrade.

The first optimization is that pg_upgrade will run pg_dump with the --no-sync key. When upgrading the server, the guarantee that the files are flushed to disk is redundant: if something goes wrong, the whole upgrade process needs to be started from scratch anyway, while disk sync time is time to be saved.

The second optimization makes database creation commands use the FILE_COPY strategy instead of the default WAL_LOG:


CREATE DATABASE .. STRATEGY = FILE_COPY;

With this strategy, the contents of the template database are not logged, and durability is guaranteed by performing checkpoints before and after copying the files. In normal operation, this is not the most efficient approach, but during binary upgrade the server will not perform extra checkpoints anyway (there will be just one, after creating the template database template0).

The effectiveness of these two optimizations will increase with the number of databases in the upgraded cluster.

The third optimization speeds up the dumping of table descriptions in pg_dump --binary-upgrade mode. In this mode, pg_upgrade starts pg_dump. When it starts, pg_dump fetches information about all tables with a single query to pg_class and stores it in an ordered array in memory, saving time. Before this patch, there was a separate query to pg_class for each table, which increased the costs considerably.

The fourth optimization speeds up the dumping of sequences in the same way the previous patch does with tables. When pg_dump starts, it puts all the sequences data into an ordered array, eliminating the need to access the system catalog every time in the future. Notably, this optimization is not restricted to --binary-upgrade mode and works just as fine in the regular mode.

The third and fourth optimizations come with a small tradeoff in the form of the RAM required to store the table and sequence information, but the developers behind the patches believe that this is absolutely worth it.

In databases with a small number of tables and sequences, the last two optimizations will not have much of an effect. On the other hand, in databases with tens and hundreds of thousands of tables and sequences, the speed increase is actually noticeable.

Let us test it. Create a database in PostgreSQL 17 with 100 000 tables, each with a column linked to a sequence. Then do the same in PostgreSQL 18.


$ psql -d test -o /dev/null <<EOF
SELECT format('CREATE TABLE t%s (id int GENERATED ALWAYS AS IDENTITY)', g.i)
FROM generate_series(1,100_000) AS g(i)
\gexec
EOF

Now, time how long pg_dump --binary-upgrade takes.

PostgreSQL 17:


$ time pg_dump -d test --schema-only --binary-upgrade > /dev/null

real    1m40,677s
user    0m5,403s
sys    0m3,265s

PostgreSQL 18:


$ time pg_dump -d test --schema-only --binary-upgrade > /dev/null

real    0m7,427s
user    0m2,220s
sys    0m0,264s

The patches make it run more than ten times faster.

Predefined role pg_signal_autovacuum_worker

commit: ccd38024b

Members of the predefined role pg_signal_backend can send the query termination signal (pg_cancel_backend) and process termination signal (pg_terminate_backend) to server backend processes. With one exception: the target process may not belong to a superuser.

However, there may be maintenance scenarios when you need to interrupt an autovacuum process. Members of the new predefined role pg_signal_autovacuum_worker can, while not being superusers, send autovacuum workers signals to terminate the current table vacuuming job or to terminate the process entirely by using the same two functions.

 

This is all for now. The news of the September CommitFest are soon to follow!

← Back to all articles

Pavel Luzanov