Home > Blog

November 5, 2023 • PostgreSQL • By Pavel Luzanov

PostgreSQL 17: part 2 or CommitFest 2023-09

PostgreSQL

Source:
habr.com

Subscribe to blog

Source:
habr.com

Subscribe to blog

We continue to follow the news of the PostgreSQL 17 development. Let’s find out what the September commitfest brings to the table.

If you missed our July commitfest review, you can check it out here: 2023-07.

Removed the parameter old_snapshot_threshold
New parameter event_triggers
New functions to_bin and to_oct
New system view pg_wait_events
EXPLAIN: a JIT compilation time counter for tuple deforming
Planner: better estimate of the initial cost of the WindowAgg node
pg_constraint: NOT NULL constraints
Normalization of CALL, DEALLOCATE and two-phase commit control commands
unaccent: the target rule expressions now support values in quotation marks
COPY FROM: FORCE_NOT_NULL * and FORCE_NULL *
Audit of connections without authentication
pg_stat_subscription: new column worker_type
The behaviour of pg_promote in case of unsuccessful switchover to a replica
Choosing the disk synchronization method in server utilities
pg_restore: optimization of parallel recovery of a large number of tables
pg_basebackup and pg_receivewal with the parameter dbname
Parameter names for a number of built-in functions
psql: \watch min_rows

Removed the parameter old_snapshot_threshold

commit: f691f5b8

The parameter old_snapshot_threshold first appeared in patch 9.6. The parameter defines the time threshold for a snapshot during which old row versions will not be deleted. After the threshold expires, all visible row versions in the snapshot become vacuumable. When querying the vacuumed row versions, PostgreSQL returns an error that is all too familiar to Oracle users: “Snapshot too old”.

As it turns out, there are issues with the parameter’s implementation, including some performance-related ones. With no clear-cut solution in sight so far, the parameter had to be scrapped.

New parameter event_triggers

commit: 7750fefd

The parameter event_triggers enables and disables event triggers.


\dconfig+ event_triggers


               List of configuration parameters
   Parameter    | Value | Type |  Context  | Access privileges
----------------+-------+------+-----------+-------------------
 event_triggers | on    | bool | superuser |
(1 row)

It is intended as a trigger debugging tool. Previously, the CREATE EVENT TRIGGER command documentation suggested to run the server in single-user mode if a malfunctioning trigger could not be deleted.

New functions to_bin and to_oct

commit: 260a1f18

The existing decimal to hexadecimal conversion function to_hex gets its binary and octal siblings:


SELECT to_bin(2), to_oct(8), to_hex(16);


 to_bin | to_oct | to_hex
--------+--------+--------
 10     | 10     | 10
(1 row)

New system view pg_wait_events

commit: 1e68e43d

In the documentation for the pg_stat_activity view, the wait_event_type and wait_event columns are described in a table explaining the types of wait events and several tables describing each wait event type. You can now access these descriptions by querying the new pg_wait_events system view.

This can be useful in monitoring systems to get a description of what current processes are waiting for. What is the checkpoint process doing now?


SELECT a.pid, a.state, a.wait_event_type, a.wait_event, w.description
FROM pg_stat_activity a JOIN pg_wait_events w
     ON (a.wait_event_type = w.type AND a.wait_event = w.name)
WHERE a.backend_type = 'checkpointer'\gx


-[ RECORD 1 ]---+---------------------------------------------
pid             | 21090
state           |
wait_event_type | Activity
wait_event      | CheckpointerMain
description     | Waiting in main loop of checkpointer process

EXPLAIN: a JIT compilation time counter for tuple deforming

commit: 5a3423ad

The patch adds specific time spent on JIT compilation for tuple deforming to EXPLAIN ANALYZE and pg_stat_statements outputs.

In the following query, JIT compilation is used to calculate the 𝜋:


EXPLAIN (analyze, summary off)
WITH pi AS (
  SELECT random() x, random() y
  FROM generate_series(1,10000000)
)
SELECT 4*sum(1-floor(x*x+y*y))/count(*) val FROM pi;


                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=525000.00..525000.02 rows=1 width=8) (actual time=3934.330..3934.331 rows=1 loops=1)
   CTE pi
     ->  Function Scan on generate_series  (cost=0.00..150000.00 rows=10000000 width=16) (actual time=725.126..1654.054 rows=10000000 loops=1)
   ->  CTE Scan on pi  (cost=0.00..200000.00 rows=10000000 width=16) (actual time=725.129..3429.925 rows=10000000 loops=1)
 JIT:
   Functions: 6
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.954 ms (Deform 0.161 ms), Inlining 8.476 ms, Optimization 25.632 ms, Emission 22.098 ms, Total 57.160 ms
(8 rows)

The Generation value in the last line of the plan is the sum of the expression compilation time (toggled by the jit_expressions parameter) and the deforming compilation time (jit_tuple_deforming). The deforming compilation time specifically is listed in the parentheses (Deform).

Since there are no tables in this example, tuple deforming compilation can be switched off to save time. Disable the parameter to see how much we save:


SET jit_tuple_deforming = off;


The JIT section of the plan:

 JIT:
   Functions: 5
   Options: Inlining true, Optimization true, Expressions true, Deforming false
   Timing: Generation 0.665 ms (Deform 0.000 ms), Inlining 15.209 ms, Optimization 25.897 ms, Emission 16.664 ms, Total 58.435 ms

Planner: better estimate of the initial cost of the WindowAgg node

commit: 3900a02c

The planner greatly underestimated the cost of getting the first row from a window function in cases where the window function processes a large number of rows.

In the following (slightly contrived) example, the planner selects a nested loop to join two tables, because the query returns only one row, and the cost of getting the first row is very small:


16=> EXPLAIN (analyze, settings)
SELECT t.ticket_no, SUM(tf.amount) OVER ()
FROM tickets t JOIN ticket_flights tf ON t.ticket_no = tf.ticket_no
LIMIT 1;


                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..0.86 rows=1 width=46) (actual time=30416.237..30416.239 rows=1 loops=1)
   ->  WindowAgg  (cost=0.56..2525161.21 rows=8391708 width=46) (actual time=30416.236..30416.237 rows=1 loops=1)
         ->  Nested Loop  (cost=0.56..2420264.86 rows=8391708 width=20) (actual time=0.049..27742.105 rows=8391852 loops=1)
               ->  Seq Scan on tickets t  (cost=0.00..78913.45 rows=2949845 width=14) (actual time=0.015..287.636 rows=2949857 loops=1)
               ->  Index Scan using ticket_flights_pkey on ticket_flights tf  (cost=0.56..0.76 rows=3 width=20) (actual time=0.007..0.009 rows=3 loops=2949857)
                     Index Cond: (ticket_no = t.ticket_no)
 Settings: jit = 'off', search_path = 'bookings, public'
 Planning Time: 0.456 ms
 Execution Time: 30454.862 ms
(9 rows)

Note the second line of the plan, the one where the WindowAgg node appears. The initial cost is 0.56, but the actual cost of getting the first row was about 30 seconds.

The new patch amends the assessment algorithm, so the planner uses a more appropriate hash join instead, immediately increasing the speed of the query:


17=> EXPLAIN (analyze, settings)
SELECT t.ticket_no, SUM(tf.amount) OVER ()
FROM tickets t JOIN ticket_flights tf ON t.ticket_no = tf.ticket_no
LIMIT 1;


                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=523778.31..523778.31 rows=1 width=46) (actual time=7157.639..7157.642 rows=1 loops=1)
   ->  WindowAgg  (cost=523778.31..523778.36 rows=8392150 width=46) (actual time=7157.638..7157.640 rows=1 loops=1)
         ->  Hash Join  (cost=130215.56..418876.48 rows=8392150 width=20) (actual time=740.005..5075.390 rows=8391852 loops=1)
               Hash Cond: (tf.ticket_no = t.ticket_no)
               ->  Seq Scan on ticket_flights tf  (cost=0.00..153881.50 rows=8392150 width=20) (actual time=0.066..794.699 rows=8391852 loops=1)
               ->  Hash  (cost=78938.47..78938.47 rows=2949847 width=14) (actual time=737.972..737.973 rows=2949857 loops=1)
                     Buckets: 262144  Batches: 32  Memory Usage: 6190kB
                     ->  Seq Scan on tickets t  (cost=0.00..78938.47 rows=2949847 width=14) (actual time=0.029..307.040 rows=2949857 loops=1)
 Settings: search_path = 'bookings, public', jit = 'off'
 Planning Time: 0.331 ms
 Execution Time: 7202.957 ms
(11 rows)

pg_constraint: NOT NULL constraints

commit: b0e96f31

A new, but well-known, type of integrity constraint, NOT NULL is now officially introduced into the pg_constraint system catalog. For various reasons, NOT NULL constraints were not written to pg_constraint before.

This change does not bring new application development opportunities, but the information in the system catalog becomes more complete, and working with constraints becomes more consistent.

The NOT NULL constraints in pg_constraint have the contype = 'n' type:


CREATE TABLE t (
    col1 int NOT NULL,
    col2 int NOT NULL
);

SELECT conname, contype, conkey
FROM pg_constraint
WHERE conrelid = 't'::regclass;


     conname     | contype | conkey
-----------------+---------+--------
 t_col1_not_null | n       | {1}
 t_col2_not_null | n       | {2}
(2 rows)

Just like any other constraint, the NOT NULL constraint can be removed by name:


ALTER TABLE t DROP CONSTRAINT t_col2_not_null;

The constraints are listed in a separate section of the \d+ output:


postgres@postgres(17.0)=# \d+ t


                                            Table "public.t"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
 col1   | integer |           | not null |         | plain   |             |              |
 col2   | integer |           |          |         | plain   |             |              |
Not-null constraints:
    "t_col1_not_null" NOT NULL "col1"
Access method: heap

Currently, NOT NULL constraints cannot be created without checking existing data (NOT VALID) or declared deferred (DEFERRED). In addition, NOT NULL constraints for system catalog tables are not written in pg_constraint.

Normalization of CALL, DEALLOCATE and two-phase commit control commands

commit: 11c34b34, 638d42a3, bb45156f

The command normalization mechanism can now place constants in place of procedure parameters in CALL statements. With this patch, a procedure call with different parameters will be recorded in pg_stat_statements as a single line:


CALL p(1);
CALL p(2);
CALL p(3);

SELECT queryid, query, calls
FROM pg_stat_statements
WHERE query ILIKE 'CALL%';


       queryid       |   query    | calls
---------------------+------------+-------
 7076031282516132618 | CALL p($1) |     3
(1 row)

In addition, transaction identifiers in two-phase commit control commands (see the second commit) and names of prepared statements in DEALLOCATE (the third commit) can now be replaced with constants as well.

Normalization of DDL and service commands, continued

unaccent: the target rule expressions now support values in quotation marks

commit: 59f47fb9

To eliminate diacritics, the built-in full-text search suggests using the unaccent dictionary supplied with the server as an extension under the same name.

When defining custom dictionary rules, target expressions can now be enclosed in quotation marks. This is necessary if the target expression must start with a whitespace or end with it. Previously, the leading and trailing spaces were cut off.

COPY FROM: FORCE_NOT_NULL * and FORCE_NULL *

commit: f6d4c9cf

The COPY command parameters FORCE_NOT_NULL and FORCE_NULL appeared a long time ago, in PostgreSQL 9.0 and 9.4, respectively. However, they required an explicit enumeration of columns, even if they were to be applied to all columns in the table.

You can now use the asterisk character * instead.


CREATE TABLE t (col1 text NOT NULL, col2 text NOT NULL);

COPY t FROM STDIN WITH (FORMAT csv, FORCE_NOT_NULL *);


Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> "",""
>> \.
COPY 1


SELECT * FROM t WHERE col1 = '' and col2 = '';


 col1 | col2
------+------
      |
(1 row)

Audit of connections without authentication

commit: e48b19c5

If a user connects without authentication while the parameter log_connections is enabled, an entry about the connection will be recorded in the server log, indicating the triggered line number (117) with the trust method in pg_hba.conf:


2023-10-19 11:40:07.609 MSK [111233] LOG:  connection authenticated: user="postgres" method=trust (/home/pal/master/data/pg_hba.conf:117)
2023-10-19 11:40:07.609 MSK [111233] LOG:  connection authorized: user=postgres database=demo application_name=psql

Previously, only the second line was recorded for the trust method.

Note that the SYSTEM_USER function still returns NULL for such connections.

pg_stat_subscription: new column worker_type

commit: 13aeaf07

The new pg_stat_subscription view column worker_type helps to quickly identify what task is being performed by the process serving the subscription. Possible values are self-explanatory: apply, parallel apply and table synchronization.

Remember that from PostgreSQL 16 and onwards, the server can now use multiple processes to apply transactions.

The behaviour of pg_promote in case of unsuccessful switchover to a replica

commit: f593c551

One of the means of switching to a replica is the pg_promote function. If the switchover is unsuccessful, the function returns false, no matter what caused it:

the switchover timeout,
failure to send the SIGUSR1 signal to the postmaster,
or even the postmaster crashing mid-switch.

The two latter cases resulted in a warning in the server log, though. With the new patch, the warnings are changed to errors, so the function will only return false upon timeout, and will error out in other cases.

Choosing the disk synchronization method in server utilities

commit: 8c16ad3b

Server utilities that write files to disk now have a new parameter --synch-method. These include initdb, pg_basebackup, pg_checksums, pg_dump, pg_rewind and pg_upgrade.

By default, --synch-method=fsync is used.

On linux systems, you can use syncfs to synchronize the entire file system at once, instead of calling fsync on each file. It will be faster in most cases. However, if other applications that modify files are actively working with the same file system, some side effects are possible, because their writing to disk must also be synchronized. There is a new appendix in the documentation warning about this: Appendix O. syncfs() Caveats.

pg_restore: optimization of parallel recovery of a large number of tables

commit: c103d073, 5af0263a, 9bfd44bb

Parallel recovery in pg_restore was performed suboptimally on backups containing a large number of tables (tens of thousands and more).

The problem is that the parent coordinator process sorted the list of tables each time to find the largest one and pass it to a worker process for recovery. The sorting took up so many resources that the parent process itself became a bottleneck while workers were idle waiting for work.

To optimize the selection of the next table, the list of tables was placed in a binary heap. Now pg_restore will be much faster when recovering a large number of tables.

pg_basebackup and pg_receivewal with the parameter dbname

commit: cca97ce6

pg_basebackup and pg_receivewal accept dbname as a connection parameter. Generally, you don’t need to specify it, since backups and WAL are at the database cluster level. But in case of connection via proxy, particularly with pgbouncer, the database must be specified.

Whether you should connect to the system over pgbouncer to perform backups may be up for debate, but if there is no other way to connect to the database, then this patch gives you an out.

Parameter names for a number of built-in functions

commit: b575a26c

When defining a function, you don’t have to define its parameter names – just their types. This is why most of the built-in functions don’t have parameter names. But if a function has more than one parameter of the same type, then distinguishing which parameter does what in the \df output becomes sinuous, forcing you to look it up in the documentation.


16=> \df string_agg


                            List of functions
   Schema   |    Name    | Result data type | Argument data types | Type
------------+------------+------------------+---------------------+------
 pg_catalog | string_agg | bytea            | bytea, bytea        | agg
 pg_catalog | string_agg | text             | text, text          | agg
(2 rows)

The patch adds parameter names to definitions of some aggregate functions that have multiple parameters. The clarity has improved:


17=> \df string_agg


                                List of functions
   Schema   |    Name    | Result data type |     Argument data types      | Type
------------+------------+------------------+------------------------------+------
 pg_catalog | string_agg | bytea            | value bytea, delimiter bytea | agg
 pg_catalog | string_agg | text             | value text, delimiter text   | agg
(2 rows)

psql: \watch min_rows

commit: f347ec76

PostgreSQL 16 added the count parameter to the \watch command. The parameter allows the user to specify how many times should the query repeat.

The patch adds a new way to interrupt the execution of \watch. If the query returns less rows than specified by the parameter min_rows, the query will no longer repeat. This can be convenient when monitoring job execution.


CREATE TABLE job (id int, status text DEFAULT 'NEW');
INSERT INTO job VALUES (1),(2),(3);

For the demonstration, job processing is done together with monitoring:


WITH process_job AS (
    UPDATE job SET status = 'PROCESSED'
    WHERE id = (SELECT id
                FROM job
                WHERE status = 'NEW'
                ORDER BY id LIMIT 1)
)
SELECT * FROM job WHERE status = 'NEW'
\watch min_rows=1


Mon 16 Oct 2023 05:55:53 PM MSK (every 2s)

 id | status
----+--------
  1 | NEW
  2 | NEW
  3 | NEW
(3 rows)

Mon 16 Oct 2023 05:55:55 PM MSK (every 2s)

 id | status
----+--------
  2 | NEW
  3 | NEW
(2 rows)

Mon 16 Oct 2023 05:55:57 PM MSK (every 2s)

 id | status
----+--------
  3 | NEW
(1 row)

Mon 16 Oct 2023 05:55:59 PM MSK (every 2s)

 id | status
----+--------
(0 rows)

This is all for now. Looking forward to the November commitfest!

← Back to all articles