Re: BUG #16961: Could not access status of transaction - Mailing list pgsql-bugs
From | Alexandra Wang |
---|---|
Subject | Re: BUG #16961: Could not access status of transaction |
Date | |
Msg-id | CAK98qZ1Xvu-FketrQGiYWCLosSBf3S8POYyZ_Uh+no1K00zv+Q@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #16961: Could not access status of transaction (Daniil Davydov <3danissimo@gmail.com>) |
Responses |
Re: BUG #16961: Could not access status of transaction
|
List | pgsql-bugs |
Hello,
A customer encountered this issue, and thanks to Andrei Varashen's
very well described steps in the duplicate thread[1], I can reproduce
it on 16.3 and on the master branch.
On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:
One of our clients faced this problem, and I found out that long
transactions may have nothing to do with it.
We have the following logic in the notify queue :
If there are no listeners within all databases, and we are calling
LISTEN, then we must iterate from 'tail' to 'head' of the queue and
check statuses of transactions (see Exec_ListenPreCommit).
If there is a pruned-away xid in the queue, we will try to access its
status and get an error.
Because the tail of the queue is not necessarily always advanced
forward by the listeners, we can get such error without any long lived
transactions.
Thank you Daniil, you are exactly correct!
Here's the stack trace on master branch:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
* frame #0: 0x0000000102c958c8 postgres`SlruReportIOError(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:1084:4
frame #1: 0x0000000102c952b0 postgres`SimpleLruReadPage(ctl=0x00000001037aa5e0, pageno=0, write_ok=true, xid=757) at slru.c:603:4
frame #2: 0x0000000102c95f7c postgres`SimpleLruReadPage_ReadOnly(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:661:9
frame #3: 0x0000000102c866bc postgres`TransactionIdGetStatus(xid=757, lsn=0x000000016d27b5e8) at clog.c:745:11
frame #4: 0x0000000102c9a924 postgres`TransactionLogFetch(transactionId=757) at transam.c:79:14
frame #5: 0x0000000102c9a74c postgres`TransactionIdDidCommit(transactionId=757) at transam.c:130:14
frame #6: 0x0000000102de4a2c postgres`asyncQueueProcessPageEntries(current=0x000000016d27b720, stop=QueuePosition @ 0x000000016d27b690, page_buffer="\U0000001c", snapshot=0x00000001310439a0) at async.c:2069:13
frame #7: 0x0000000102de47ec postgres`asyncQueueReadAllNotifications at async.c:1981:18
frame #8: 0x0000000102de29f4 postgres`Exec_ListenPreCommit at async.c:1127:3
frame #9: 0x0000000102de24e4 postgres`PreCommit_Notify at async.c:881:6
frame #10: 0x0000000102ca877c postgres`CommitTransaction at xact.c:2341:2
frame #11: 0x0000000102ca3b2c postgres`CommitTransactionCommandInternal at xact.c:3214:4
frame #12: 0x0000000102ca3a44 postgres`CommitTransactionCommand at xact.c:3175:10
frame #13: 0x0000000103238d1c postgres`finish_xact_command at postgres.c:2833:3
frame #14: 0x00000001032368d4 postgres`exec_simple_query(query_string="LISTEN test_chan;") at postgres.c:1298:4
frame #15: 0x000000010323599c postgres`PostgresMain(dbname="test", username="alex.wang") at postgres.c:4767:7
frame #16: 0x000000010322cca8 postgres`BackendMain(startup_data=0x000000016d27de48, startup_data_len=24) at backend_startup.c:124:2
frame #17: 0x0000000103104bcc postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=78, startup_data=0x000000016d27de48, startup_data_len=24, client_sock=0x000000016d27ded8) at launch_backend.c:290:3
frame #18: 0x000000010310c5a0 postgres`BackendStartup(client_sock=0x000000016d27ded8) at postmaster.c:3587:8
frame #19: 0x000000010310a404 postgres`ServerLoop at postmaster.c:1702:6
frame #20: 0x0000000103108ebc postgres`PostmasterMain(argc=3, argv=0x00006000018854e0) at postmaster.c:1400:11
frame #21: 0x0000000102f8f6b8 postgres`main(argc=3, argv=0x00006000018854e0) at main.c:231:4
frame #22: 0x00000001940a2b98 dyld`start + 6076
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
* frame #0: 0x0000000102c958c8 postgres`SlruReportIOError(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:1084:4
frame #1: 0x0000000102c952b0 postgres`SimpleLruReadPage(ctl=0x00000001037aa5e0, pageno=0, write_ok=true, xid=757) at slru.c:603:4
frame #2: 0x0000000102c95f7c postgres`SimpleLruReadPage_ReadOnly(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:661:9
frame #3: 0x0000000102c866bc postgres`TransactionIdGetStatus(xid=757, lsn=0x000000016d27b5e8) at clog.c:745:11
frame #4: 0x0000000102c9a924 postgres`TransactionLogFetch(transactionId=757) at transam.c:79:14
frame #5: 0x0000000102c9a74c postgres`TransactionIdDidCommit(transactionId=757) at transam.c:130:14
frame #6: 0x0000000102de4a2c postgres`asyncQueueProcessPageEntries(current=0x000000016d27b720, stop=QueuePosition @ 0x000000016d27b690, page_buffer="\U0000001c", snapshot=0x00000001310439a0) at async.c:2069:13
frame #7: 0x0000000102de47ec postgres`asyncQueueReadAllNotifications at async.c:1981:18
frame #8: 0x0000000102de29f4 postgres`Exec_ListenPreCommit at async.c:1127:3
frame #9: 0x0000000102de24e4 postgres`PreCommit_Notify at async.c:881:6
frame #10: 0x0000000102ca877c postgres`CommitTransaction at xact.c:2341:2
frame #11: 0x0000000102ca3b2c postgres`CommitTransactionCommandInternal at xact.c:3214:4
frame #12: 0x0000000102ca3a44 postgres`CommitTransactionCommand at xact.c:3175:10
frame #13: 0x0000000103238d1c postgres`finish_xact_command at postgres.c:2833:3
frame #14: 0x00000001032368d4 postgres`exec_simple_query(query_string="LISTEN test_chan;") at postgres.c:1298:4
frame #15: 0x000000010323599c postgres`PostgresMain(dbname="test", username="alex.wang") at postgres.c:4767:7
frame #16: 0x000000010322cca8 postgres`BackendMain(startup_data=0x000000016d27de48, startup_data_len=24) at backend_startup.c:124:2
frame #17: 0x0000000103104bcc postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=78, startup_data=0x000000016d27de48, startup_data_len=24, client_sock=0x000000016d27ded8) at launch_backend.c:290:3
frame #18: 0x000000010310c5a0 postgres`BackendStartup(client_sock=0x000000016d27ded8) at postmaster.c:3587:8
frame #19: 0x000000010310a404 postgres`ServerLoop at postmaster.c:1702:6
frame #20: 0x0000000103108ebc postgres`PostmasterMain(argc=3, argv=0x00006000018854e0) at postmaster.c:1400:11
frame #21: 0x0000000102f8f6b8 postgres`main(argc=3, argv=0x00006000018854e0) at main.c:231:4
frame #22: 0x00000001940a2b98 dyld`start + 6076
This stack trace is from a LISTEN command issued by a new listener on
the channel. This new listener needs to process previously committed
notifications. If no other active connections are listening to the
same channel, it starts processing from the tail of the queue.
This problem occurs when an AsyncQueueEntry is still present, but the
sender's xid it contains is older than the database's frozenxid. This
can happen if a previous listener disconnects, not enough messages for
the notifier to advance the async queue's tail, but enough transactions
for VACUUM to truncate the pg_xact logs.
the channel. This new listener needs to process previously committed
notifications. If no other active connections are listening to the
same channel, it starts processing from the tail of the queue.
This problem occurs when an AsyncQueueEntry is still present, but the
sender's xid it contains is older than the database's frozenxid. This
can happen if a previous listener disconnects, not enough messages for
the notifier to advance the async queue's tail, but enough transactions
for VACUUM to truncate the pg_xact logs.
On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:
> > > We could prevent the trouble if vac_truncate_clog() had access to the oldest
> > > xid in the notification queue; it would set frozenXID to that value if
> > > frozenXID would otherwise be older.
> >
> > Perhaps. I'm not sure how hard it is to extract the oldest xid in the
> > queue (where "oldest" is defined as "numerically smallest"). The entries
> > are in xid commit order which is a different thing.
>
> Yeah, it wouldn't be cheap in the general case. The value could be a field in
> pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also
> run roughly as often as autovacuum visits template0.
As a temporary solution, which almost completely eliminates the
possibility of such a situation, I suggest adding a
AsyncQueueAdvanceTail call to vacuum (inside vac_update_datfrozenxid).
I mean 'static asyncQueueAdvanceTail()' that should be made
'external'. Thus, we can get rid of the problem that I described
above.
But there is one more : if the listener client is lagging, the
AsyncQueueAdvanceTail call will not save us (because it cannot advance
tail any further than the listener's position in the queue).
Again, this may be due to a very high load, but not because the client
keeps the transaction open for a very long time.
The best solution is to teach vacuum to recognize the minimum xid in
constant time, but I didn't come up with any sane implementations.
What do you think?
determines the cutoff xid, it doesn't check asyncQueueControl for the
sender's xids stored in each AsyncQueueEntry. It make sense that the
async queue itself isn't expected to survive a database restart, but
should the sender's xids it stores be considered?
What are your thoughts?
Best,
Alex
pgsql-bugs by date: