Home > mailing lists

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers

From	Masahiko Sawada
Subject	Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date	August 22 01:25:34
Msg-id	CAD21AoCD+HXoc2QZCAS9d8ahDeikNqbnU0i6cQzpMFOEurkPPg@mail.gmail.com Whole thread Raw
In response to	Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue ("Matheus Alcantara" <matheusssilv97@gmail.com>)
Responses	Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
List	pgsql-hackers

Tree view

On Wed, Aug 20, 2025 at 2:18 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
>
> On Tue Aug 19, 2025 at 2:37 PM -03, Daniil Davydov wrote:
> > Hi,
> >
> > On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara
> > <matheusssilv97@gmail.com> wrote:
> >>
> >> On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:
> >> > You have started a very long transaction, which holds its xid and prevents
> >> > vacuum from freezing it. But what if the backend is stuck not inside a
> >> > transaction? Maybe we can just hardcode a huge delay (not inside the
> >> > transaction) or stop process execution via breakpoint in gdb. If we will use it
> >> > instead of a long query, I think that this error may be reproducible.
> >> >
> >> But how could this happen in real scenarios? I mean, how the backend
> >> could be stuck outside a transaction?
> >>
> >
> > For now, I cannot come up with a situation where it may be possible.
> > Perhaps, such a lagging may occur during network communication,
> > but I couldn't reproduce it. Maybe other people know how we can achieve
> > this?
> >
> Reading more the code I understand that once the a NOTIFY command is
> received by a backend (and the transaction is committed) it will
> emedialy signal all other listener backends and if the listener backend
> is in idle it will consume the notification and then send it back to the
> client as a PqMsg_NotificationResponse, so if there is a network delay
> to send the notification from the listener backend back to the client I
> don't think that it would be possible to get this error, because the
> message was already dispatched by the backend and it will eventually get
> to the client and once the notification is dispatched the backend
> doesn't need to track it anymore (the queue pointers of the backend are
> advanced after the dispatch).
>
> Assuming that every SQL command is wrapped into a transaction (if it's
> not already inside in) I think a busy listener backend will always
> prevent the vacuum from freezing clog files past from its current xid,
> so any notification that is sent while the backend is busy will not have
> their transaction status removed from clog files anyway.

What about backend processes that don't have any xid or xmin (i.e.,
are read-only query and in idle-in-transaction)?

IIUC we process the notification entries at the beginning of the
server loop (see L4608 in postgres.c) and when reading a command (via
ProcessClientReadInterrupt()), but it seems to me that if a process is
in idle-in-transaction state it doesn't process the entries unless the
transaction is committed. I've reproduced the missing clog entry error
even if we have a notification on the queue with a valid listener,
with the following steps:

1. Initialize the database cluster.

2. Execute "ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;".

3. Start one psql session and execute:

-- Session 1
=# listen s;
LISTEN
=# begin;
BEGIN

(keep this session open to leave the process idle-in-transaction.)

4. Open another psql session and execute:

-- Session 2
=# begin;
BEGIN
=# select txid_current();
 txid_current
--------------
          756
(1 row)
=# notify s;
NOTIFY
=# commit;
COMMIT

The notification to the channel 's' should be available for the
session-1's transaction.

5. Consume enough XIDs to truncate clog entries.

-- Session 2
=# create extension xid_wraparound;
CREATE EXTENSION
=# select consume_xids(10_000_000);
NOTICE:  consumed 10000000 / 10000000 XIDs, latest 0:10000757
 consume_xids
--------------
     10000757
(1 row)
=# select txid_current();
 txid_current
--------------
     10000758
(1 row)

6. Execute vacuum freeze on all databases:

$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"

$ psql -d postgres -c "select datname, datfrozenxid, age(datfrozenxid)
from pg_database"
  datname  | datfrozenxid | age
-----------+--------------+-----
 postgres  |     10000759 |  11
 template0 |     10000759 |  11
 template1 |     10000759 |  11
(3 rows)

7. On the first psql session:

-- Session 1
=# commit;
COMMIT
ERROR:  could not access status of transaction 756
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Robert Treat
Date: 22 August, 01:06:13
Subject: Re: Adding REPACK [concurrently]

From: Alexander Korotkov
Date: 22 August, 01:27:17
Subject: Re: VM corruption on standby

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers

Previous

Next