Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers

From Matheus Alcantara
Subject Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date
Msg-id DDQZB2AD34V4.3RH2USCA72AS8@gmail.com
Whole thread Raw
In response to Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Thanks for the review!

On Thu Oct 23, 2025 at 8:42 PM -03, Masahiko Sawada wrote:
> On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara
> <matheusssilv97@gmail.com> wrote:
>>
>> On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:
>> > On 2025-Oct-22, Matheus Alcantara wrote:
>> >
>> >> I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
>> >> this architecture being used. I think that it's not, but perhaps is a
>> >> good test to keep it?
>> >
>> > I'd rather have tests than not, but I'd think it needs to be behind
>> > PG_TEST_EXTRA because of things like
>> >
>> > +$node->safe_psql('postgres', 'select consume_xids(10000000);');
>> >
>> Attached v10 with wrapping into PG_TEST_EXTRA.
>
> Thank you for updating the patches!
>
> I've reviewed the patches and here are the comments.
>
> v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch:
>
> + *
> + * Returns InvalidTransactionId if there are no unprocessed notifications.
>
> This comment is not accurate since the function returns
> InvalidTransactionId even if all unprocessed notifications are created
> on other databases.
>
Fixed

> ---
> +TransactionId
> +GetOldestQueuedNotifyXid(void)
> +{
>
> How about renaming it to something like GetOldestNotifyTransactionId()?
>
Yeap, sounds better, fixed.

> ---
> +   /* page_buffer must be adequately aligned, so use a union */
> +   union
> +   {
> +       char        buf[QUEUE_PAGESIZE];
> +       AsyncQueueEntry align;
> +   }           page_buffer;
>
> asyncQueueReadAllNotifications() uses this union too, so how about
> define this type as AlignedQueueEntryPage or something and use it in
> both functions?
>
Good point, fixed.

> ---
> +
> +   LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
> +
>
> I don't think we need an exclusive lock here.
>
Fixed.

> ---
> In GetOldestQueuedNotifyXid() why do we keep holding NotifyQueueLock
> while calculating the oldest XID in the queue?
>
Yeah, I don't think that it's necessary. The
asyncQueueReadAllNotifications() for example only hold the lock when
reading the QUEUE_BACKEND_POS and QUEUE_HEAD. I think that it's a
similar case here, fixed.

> ---
> +
> +       /* Process all entries on this page up to head */
> +       while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
> QUEUE_PAGESIZE &&
> +              !QUEUE_POS_EQUAL(pos, head))
> +       {
> +           qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
> (snip)
> +           /* Advance to next entry */
> +           if (asyncQueueAdvance(&pos, qe->length))
> +               break;          /* advanced to next page */
> +
>
> I think the check "curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
> QUEUE_PAGESIZE" is essentially the same as the one we do in
> asyncQueueAdvance(). I think we can refactor the inner loop in
> GetOldestQueuedNotifyXid() to something like:
>
> for (;;)
> {
>     bool reachedEndOfPage;
>
>     qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
>
>     // check qe->xid here...
>
>     reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);
>
>     if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
>         break;
> }
>
Fixed

> ---
> v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patch:
>
> This new test directory needs to have .gitignore.
>
Fixed

> ---
> +# Copyright (c) 2024-2025, PostgreSQL Global Development Group
> +
>
> It's better to have a short description of this test here.
>
Fixed

> ---
> +use File::Path qw(mkpath);
>
> It seems not necessary.
>
Fixed

> ---
> +$node->safe_psql('postgres',
> +   'CREATE TABLE t AS SELECT g AS a, g+2 AS b from
> generate_series(1,100000) g;'
>
> Why does it need to insert many rows?
>
I think that this value was left when I was writing the first versions of
this test, I don't think that it's necessary. I reduced to just 10 rows.

> ---
> +# --- Session 2, multiple notify's, and commit ---
> +for my $i (1 .. 10)
> +{
> +   $node->safe_psql(
> +       'postgres', "
> +       BEGIN;
> +       NOTIFY s, '$i';
> +       COMMIT;");
> +}
>
> Why does it need to send a notification 10 times?
>
I just wanted to test with multiple notifications, we can reduce to two
or three, there is not specific reason to send 10 notifications.

> ---
> +# Consume enough XIDs to trigger truncation
> +$node->safe_psql('postgres', 'select consume_xids(10000000);');
> +
>
> I guess it consumes XID too much to trigger truncation. Given the one
> clog segment file is 256kB in size, it's enough to consume 1,050,000
> XIDs to move to the next segment file.
>
Fixed

> ---
> +# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
> +# we can still get the notification status of the notification
> +my $datafronzenxid_freeze = $node->safe_psql('postgres', "select
> datfrozenxid from pg_database where datname = 'postgres'");
> +ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
>
> I think this test passes in all cases, so it is meaningless. Instead,
> what we need to check in terms of datfrozenxid is that its value
> doesn't get greater than the XID we used for the notification, even
> after vacuum freeze. I think we remember the XID used for NOTIFY and
> check if the old/new datfrozenxid values are older than that value.
>
I think that was the goal of this test. I've swap the values to make it
more clear. Please let me know if I misunderstood your point here.

> ---
> +is($notifications_count, 10, 'received all committed notifications');
> +
> +done_testing();
>
> After consuming the unconsumed notifications, let's do vacuum freeze
> and check if datfrozenxid now can be advanced.
>
Good point, during this I realize that the datafrozenxid was not being
advanced even after the queue being consumed. This was because when a
backend listener is consuming the notifications it only update their own
queue tail pointer and not the global shared tail pointer, so I've
included a call to asyncQueueAdvanceTail() at the beginning of
GetOldestNotifyTransactionId() to do this.

> ---
> I would expect to add 002_aborted_tx_notifies.pl in a separate patch
> since it's not related to this bug fix.
>
On the new attached version I've moved this test to a separated patch.

> ---
> +# Test checks that listeners do not receive notifications from aborted
> +# transaction even if notifications have been added to the listen/notify
> +# queue. To reproduce it we use the fact that serializable conflicts
> +# are checked after tx adds notifications to the queue.
>
> I wonder if we could implement this test using the isolation test
> instead of the tap test. Is there any reason why you used a tap test
> for that?
>
On this new version I just moved the test into a separated patch but I
agree that we can implement this using the isolation test,

I don't know how we should handle the follow-up patches that Joel have
sent on [1], if we should incoportate on 0001 or not. I still need to
review and test them.

[1] https://www.postgresql.org/message-id/b6f6cbb8-1903-4ca0-ae64-01d84d1e12a3%40app.fastmail.com

--
Matheus Alcantara

Attachment

pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Bug in pg_stat_statements
Next
From: "Matheus Alcantara"
Date:
Subject: Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue