Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
Date | |
Msg-id | CAD21AoCD+HXoc2QZCAS9d8ahDeikNqbnU0i6cQzpMFOEurkPPg@mail.gmail.com Whole thread Raw |
In response to | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue ("Matheus Alcantara" <matheusssilv97@gmail.com>) |
Responses |
Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
|
List | pgsql-hackers |
On Wed, Aug 20, 2025 at 2:18 PM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > > On Tue Aug 19, 2025 at 2:37 PM -03, Daniil Davydov wrote: > > Hi, > > > > On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara > > <matheusssilv97@gmail.com> wrote: > >> > >> On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote: > >> > You have started a very long transaction, which holds its xid and prevents > >> > vacuum from freezing it. But what if the backend is stuck not inside a > >> > transaction? Maybe we can just hardcode a huge delay (not inside the > >> > transaction) or stop process execution via breakpoint in gdb. If we will use it > >> > instead of a long query, I think that this error may be reproducible. > >> > > >> But how could this happen in real scenarios? I mean, how the backend > >> could be stuck outside a transaction? > >> > > > > For now, I cannot come up with a situation where it may be possible. > > Perhaps, such a lagging may occur during network communication, > > but I couldn't reproduce it. Maybe other people know how we can achieve > > this? > > > Reading more the code I understand that once the a NOTIFY command is > received by a backend (and the transaction is committed) it will > emedialy signal all other listener backends and if the listener backend > is in idle it will consume the notification and then send it back to the > client as a PqMsg_NotificationResponse, so if there is a network delay > to send the notification from the listener backend back to the client I > don't think that it would be possible to get this error, because the > message was already dispatched by the backend and it will eventually get > to the client and once the notification is dispatched the backend > doesn't need to track it anymore (the queue pointers of the backend are > advanced after the dispatch). > > Assuming that every SQL command is wrapped into a transaction (if it's > not already inside in) I think a busy listener backend will always > prevent the vacuum from freezing clog files past from its current xid, > so any notification that is sent while the backend is busy will not have > their transaction status removed from clog files anyway. What about backend processes that don't have any xid or xmin (i.e., are read-only query and in idle-in-transaction)? IIUC we process the notification entries at the beginning of the server loop (see L4608 in postgres.c) and when reading a command (via ProcessClientReadInterrupt()), but it seems to me that if a process is in idle-in-transaction state it doesn't process the entries unless the transaction is committed. I've reproduced the missing clog entry error even if we have a notification on the queue with a valid listener, with the following steps: 1. Initialize the database cluster. 2. Execute "ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;". 3. Start one psql session and execute: -- Session 1 =# listen s; LISTEN =# begin; BEGIN (keep this session open to leave the process idle-in-transaction.) 4. Open another psql session and execute: -- Session 2 =# begin; BEGIN =# select txid_current(); txid_current -------------- 756 (1 row) =# notify s; NOTIFY =# commit; COMMIT The notification to the channel 's' should be available for the session-1's transaction. 5. Consume enough XIDs to truncate clog entries. -- Session 2 =# create extension xid_wraparound; CREATE EXTENSION =# select consume_xids(10_000_000); NOTICE: consumed 10000000 / 10000000 XIDs, latest 0:10000757 consume_xids -------------- 10000757 (1 row) =# select txid_current(); txid_current -------------- 10000758 (1 row) 6. Execute vacuum freeze on all databases: $ vacuumdb --all --freeze vacuumdb: vacuuming database "postgres" vacuumdb: vacuuming database "template0" vacuumdb: vacuuming database "template1" $ psql -d postgres -c "select datname, datfrozenxid, age(datfrozenxid) from pg_database" datname | datfrozenxid | age -----------+--------------+----- postgres | 10000759 | 11 template0 | 10000759 | 11 template1 | 10000759 | 11 (3 rows) 7. On the first psql session: -- Session 1 =# commit; COMMIT ERROR: could not access status of transaction 756 DETAIL: Could not open file "pg_xact/0000": No such file or directory. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: