LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers

From Alexandra Wang
Subject LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date
Msg-id CAK98qZ3wZLE-RZJN_Y+TFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
Whole thread Raw
Responses Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
List pgsql-hackers
Hi,

I'm bringing up a bug that was reported multiple times [1][2][3] in
the bugs list here, for a broader audience.

The issue is that an ERROR like the one below occurs when trying to
register any listener in the database.

test=# listen c21;
ERROR:  58P01: could not access status of transaction 14279685
DETAIL:  Could not open file "pg_xact/000D": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

In [1], Andrei Varashen provided detailed reproduction steps. I’m
copying and pasting his example from thread [1] here, with slight
simplification.

Pre-conditions:
- Disable autovacuum to avoid its intervention.

Steps to reproduce:

1. Create a test table and notify (but not listen) to a channel
(backend 1):

create table test (id int);
insert into test values (1);
notify c1;

2. List pg_xact files so we know its starting state:

➜  ls -lah ~/pg-devel/data/pg_xact
total 16
drwx------@  3 alex.wang  staff    96B Aug  5 20:10 .
drwx------@ 26 alex.wang  staff   832B Aug  5 20:12 ..
-rw-------@  1 alex.wang  staff   8.0K Aug  5 20:12 0000

3. Prepare a "test.sql" file for pgbench, and then run pgbench to
generate lots of transactions, so that pg_xact/0000 is completely
filled and leads to pg_xact/0001.

> cat test.sql
UPDATE test SET id = 1;

> pgbench -n -c 80 -j 10 -t 15000 -f ~/workspace/test.sql postgres

4. Verify that pg_xact/0001 is created:

➜  ls -lah ~/pg-devel/data/pg_xact
total 560
drwx------@  4 alex.wang  staff   128B Aug  5 20:25 .
drwx------@ 26 alex.wang  staff   832B Aug  5 20:12 ..
-rw-------@  1 alex.wang  staff   256K Aug  5 20:25 0000
-rw-------@  1 alex.wang  staff    16K Aug  5 20:25 0001

5. Execute VACUUM FREEZE on every database on the server to freeze
rows and purge pg_xact/0000.

postgres=# VACUUM FREEZE;
VACUUM

postgres=# \c template1
You are now connected to database "template1" as user "postgres".
template1=# VACUUM FREEZE;
VACUUM

template1=# ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;
ALTER DATABASE
template1=# \c template0
You are now connected to database "template0" as user "postgres".
template0=# VACUUM FREEZE;
VACUUM
postgres=# select datname, datfrozenxid from pg_database;
  datname  | datfrozenxid
-----------+--------------
 postgres  |      1200774
 template0 |      1200775
 template1 |      1200774
(3 rows)

6. Ensure that pg_xact/0000 is gone:

➜   ls -lah ~/pg-devel/data/pg_xact
total 80
drwx------@  3 alex.wang  staff    96B Aug  5 20:29 .
drwx------@ 26 alex.wang  staff   832B Aug  5 20:12 ..
-rw-------@  1 alex.wang  staff    40K Aug  5 20:30 0001

7. Try to listen to any channel from any backend connection on the
same database:

postgres=# listen c1;
ERROR:  58P01: could not access status of transaction 773
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087
postgres=# listen c2;
ERROR:  58P01: could not access status of transaction 773
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

Here's the stack trace from the master branch:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 21.1
  * frame #0: 0x0000000102c798c8 postgres`SlruReportIOError(ctl=0x000000010378e5e0, pageno=0, xid=773) at slru.c:1084:4
    frame #1: 0x0000000102c792b0 postgres`SimpleLruReadPage(ctl=0x000000010378e5e0, pageno=0, write_ok=true, xid=773) at slru.c:603:4
    frame #2: 0x0000000102c79f7c postgres`SimpleLruReadPage_ReadOnly(ctl=0x000000010378e5e0, pageno=0, xid=773) at slru.c:661:9
    frame #3: 0x0000000102c6a6bc postgres`TransactionIdGetStatus(xid=773, lsn=0x000000016d2975e8) at clog.c:745:11
    frame #4: 0x0000000102c7e924 postgres`TransactionLogFetch(transactionId=773) at transam.c:79:14
    frame #5: 0x0000000102c7e74c postgres`TransactionIdDidCommit(transactionId=773) at transam.c:130:14
    frame #6: 0x0000000102dc8b94 postgres`asyncQueueProcessPageEntries(current=0x000000016d297720, stop=QueuePosition @ 0x000000016d297690, page_buffer="\U00000014", snapshot=0x000000012d812398) at async.c:2069:13
    frame #7: 0x0000000102dc8954 postgres`asyncQueueReadAllNotifications at async.c:1981:18
    frame #8: 0x0000000102dc6b5c postgres`Exec_ListenPreCommit at async.c:1127:3
    frame #9: 0x0000000102dc664c postgres`PreCommit_Notify at async.c:881:6
    frame #10: 0x0000000102c8c77c postgres`CommitTransaction at xact.c:2341:2
    frame #11: 0x0000000102c87b2c postgres`CommitTransactionCommandInternal at xact.c:3214:4
    frame #12: 0x0000000102c87a44 postgres`CommitTransactionCommand at xact.c:3175:10
    frame #13: 0x000000010321da94 postgres`finish_xact_command at postgres.c:2833:3
    frame #14: 0x000000010321b64c postgres`exec_simple_query(query_string="listen c1;") at postgres.c:1298:4
    frame #15: 0x000000010321a714 postgres`PostgresMain(dbname="postgres", username="alex.wang") at postgres.c:4767:7
    frame #16: 0x0000000103211a14 postgres`BackendMain(startup_data=0x000000016d299e48, startup_data_len=24) at backend_startup.c:124:2
    frame #17: 0x00000001030e938c postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=56, startup_data=0x000000016d299e48, startup_data_len=24, client_sock=0x000000016d299ed8) at launch_backend.c:290:3
    frame #18: 0x00000001030f0d60 postgres`BackendStartup(client_sock=0x000000016d299ed8) at postmaster.c:3587:8
    frame #19: 0x00000001030eebc4 postgres`ServerLoop at postmaster.c:1702:6
    frame #20: 0x00000001030ed67c postgres`PostmasterMain(argc=3, argv=0x00006000021f14e0) at postmaster.c:1400:11
    frame #21: 0x0000000102f73e40 postgres`main(argc=3, argv=0x00006000021f14e0) at main.c:231:4
    frame #22: 0x00000001940a2b98 dyld`start + 6076

Daniil Davydov has analyzed the root cause in thread [4] and I agree with what he said:

On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:
We have the following logic in the notify queue :
If there are no listeners within all databases, and we are calling
LISTEN, then we must iterate from 'tail' to 'head' of the queue and
check statuses of transactions (see Exec_ListenPreCommit).
If there is a pruned-away xid in the queue, we will try to access its
status and get an error.

Because the tail of the queue is not necessarily always advanced
forward by the listeners, we can get such error without any long lived
transactions.
 
The fix and workarounds were discussed in [5] and [6]: In [5], Daniil
proposed a patch, which I’ve attached. The patch adds a call to
asyncQueueAdvanceTail() in vac_update_datfrozenxid(), so that VACUUM
advances the async queue tail.

Similarly, calling the built-in function pg_notification_queue_usage()
also advances the async queue tail. So when the issue occurs, calling
this function could also make the error go away. However, this doesn’t
prevent the error from happening in the first place.

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.
Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Bug in pg_dump --filter? - Invalid object types can be misinterpreted as valid
Next
From: Chao Li
Date:
Subject: Re: Enhance Makefiles to rebuild objects on map file changes