Home > mailing lists
Re: [PATCH] Write Notifications Through WAL - Mailing list pgsql-hackers

From	Arseniy Mukhin
Subject	Re: [PATCH] Write Notifications Through WAL
Date	October 17 16:33:01
Msg-id	CAE7r3MLwRNyvmMd2kaefeNYTRjGD8wOxLzyGn8vexaeTJdASaw@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Write Notifications Through WAL (Rishu Bagga <rishu.postgres@gmail.com>)
List	pgsql-hackers
Tree view
Hi,

Thank you for the new patch version!

On Tue, Oct 14, 2025 at 12:14 AM Rishu Bagga <rishu.postgres@gmail.com> wrote:
>
> Thank you Arseniy for the thoughtful and detailed feedback, I have
> addressed the concerns in the new patch.
>
> On Wed, Oct 8, 2025 at 3:07 AM Arseniy Mukhin
> <arseniy.mukhin.dev@gmail.com> wrote:
>
> > Yes, we hold the lock while we are writing the commit record to wal,
> > but other transactions don't see our transaction as committed and
> > can't see changes that our transaction done besides notifications
> > (inserts/updates/deletes) right after we wrote the commit record.
> > There are a few more things that need to be done before it. IIUC the
> > moment when a fresh snapshot will show our transaction as 'not in
> > progress' is right after updating procarray (ProcArrayEndTransaction()
> > call in CommitTransaction()). So I think we still need
> > XidInMVCCSnapshot() check in reading path. Without it we can end up in
> > a situation where the listener got the notification but still can't
> > see changes notify transaction done because notification was processed
> > too early.
>
> Good catch, I've fixed this in the new patch.

Hmmm, looks like we still have the same problem in the new version if
I'm not missing something:

    Snapshot    snap = ActiveSnapshotSet() ? GetActiveSnapshot() : NULL;

    ...

    if (qe->dboid == MyDatabaseId && snap != NULL &&
              XidInMVCCSnapshot(qe->xid, snap))

Here we almost always have snap = NULL (except when we are in
Exec_ListenPreCommit maybe), since listeners read notifications when
they are out of active transaction. Why do we need snap nullability
here? IIUC we can do XidInMVCCSnapshot check the same way as we do it
in master.

Also the comment

    /*----------
     * Get snapshot we'll use to decide which xacts are still in progress.
     * This is trickier than it might seem, because of race conditions.

seems to still be relevant, so I think it's worth keeping.


>
> > I think reserving slots is a great idea. But I have a question about
> > checking if the queue is full. Why does PreCommit_Notify need three
> > checks? Isn't it enough to simply check that we don't exceed
> > max_total_entries?
>
> After taking a closer look, you are right in that we don't need all three
> checks; however, it isn't enough to check that we don't exceed
> max_total_entries.
> The reason for this is that we can exceed max_total_pages without necessarily
> exceeding max_total_entries, in the case that the oldest active entry is not the
> first entry on the tail page. So actually the only test we need is to
> confirm that
> when we zero a new page, that the total number of pages is still <=
> max_total_pages.
> I have updated this in the new patch.

LGTM

> > >
>
> > I don't have enough knowledge to say if it's the right approach to
> > prevent wal truncating with unread notifications. But there are two
> > things about current implementation I noticed. Not sure if it's worth
> > doing something with it before you know it's the right approach in
> > general. But anyway, there are two points about it:
> >
> > We hold NotifyQueueLock while adding the notifications data record to
> > wal. IIUC one of the points of using wal here was that it lets us to
> > write notification data in parallel as wal is good for parallel
> > writes. But with lock it seems like we don't take advantage of it.
> > IIUC we hold the lock here to prevent truncation of the new record
> > after we wrote it and before we save its lsn. Is this correct? Maybe
> > we can solve it somehow differently? For example before writing the
> > notification data record we can save the current lsn (something like
> > XactLastRecEnd) in UncommittedNotifies so we can be sure anything
> > after it will not be truncated, including our record that we are gonna
> > write. This way we don't need to hold the lock.
> >
> > We iterate over all entries in UncommittedNotifies every time we want
> > to add/remove data. What do you think about using MyProcNumber as an
> > array index instead of iteration as we do with listener positions for
> > instance? Sounds possible as every backend can't have more than one
> > uncommitted lsn. Also maybe backends can remove entries from
> > UncommittedNotifies in AtAbort_Notify() the same way committed
> > transactions do it in asyncQueueAddCompactEntry()? This way
> > AsyncNotifyOldestRequiredLSN() will not need to be bothered with
> > cleaning UncommittedNotifies and checking transaction statuses.
>
> After taking another look, I realized there is a cleaner approach that
> does not need the use of UncommittedNotifiesArray in the first place,
> and also allows us to keep the parallel WAL writes for notifications.
> We can simply have the recycler iterate through the procarray instead, and find
> the lowest value of a proc's notifyDataLsn. (renamed from
> notifyCommitLsn in prev patch)
> There is a small window where the notify data WAL record has been written,
> and its lsn hasn't been assigned to MyProc->notifyDataLsn, so to make
> sure we don't
> lose any records, before we write the data record, we set
> MyProc->notifyDataLsn to
> the current lsn pointer using GetXLogInsertRecPtr(). If a Proc has a
> non-invalid value for
> notifyDataLsn, then we know that it has uncommitted notification data
> written to the WAL,
> at or above that value.

Looks like you have solved both issues with it. Several points about
wal pin logic:

1) I noticed that we can block wal truncating sometimes.
NotifyPageMins array works great when we have a consistent flow of
notifications and can truncate old queue segments regularly. But what
if we don't have it? Please try the next example:

Send a notification first:

    NOTIFY ch, 'a';

Then do something that makes wal grows.

    create table if not exists t1 (a text);

    insert into t1 (a) select gen_random_uuid()
    from generate_series(1,10000000);

You will see that wal grows without truncation. It seems that to solve
the issue we should not consider queue entries before the queue tail
in AsyncNotifyOldestRequiredLSN as we don't need their data anymore.


2) I'm worried if it's ok to iterate like this:

    /* Scan per-backend pins under ProcArrayLock */
        LWLockAcquire(ProcArrayLock, LW_SHARED);
        for (int i = 0; i < MaxBackends; i++)
        {
           PGPROC *proc = GetPGProcByNumber(i);

We use MaxBackends here, so it looks like we can access PGPROC that
isn't assigned to the existing backend. I failed to find any examples
in the code that would do the same. Maybe we don't need to deal with
the MyProc structure here at all? We can do the same thing that we do
for listener positions. Store it in QueueBackendStatus and directly
access it with MyProcNumber when we want to update it for our backend.
And in AsyncNotifyOldestRequiredLSN we can iterate over backends the
same way we do it asyncQueueAdvanceTail. What do you think?


3) It seems to me that we can miss some notification data record's lsn
here AsyncNotifyOldestRequiredLSN. Let's consider the next schedule:

Transaction                                     Checkpointer

Notify transaction writes notification data wal
record and updates its notifyDataLsn

                                    AsyncNotifyOldestRequiredLSN call.
                                    We iterate over NotifyPageMins. It
                                    doesn't contain transaction's
                                    notifyDataLsn yet.

Transaction is committed, NotifyPageMins is updated
notifyDataLsn is cleared.

                                    We iterate over notifyDataLsns. Again
                                    we don't see notifyDataLsn here as it
                                    has already been removed.

Is it real or I missed something? I see there was the lock in v6 that
could prevent it, but probably we can just swap NotifyPageMins and
notifyDataLsns parts in AsyncNotifyOldestRequiredLSN to solve the
problem?

>
> The performance remains similar with these changes. I have also
> rebased the patch on the latest HEAD.
>

It was surprising that exclusive locking when writing notification
data to wal doesn't affect the result. I tried to do some benchmarking
too:


Tests:

All benchmarks were done without listeners, as we try to improve the
notify path.

insert_only:

            BEGIN;
            INSERT INTO t1 VALUES(1);
            COMMIT;

insert_with_notify:

            BEGIN;
            INSERT INTO t1 VALUES(1);
            NOTIFY benchmark_channel, :payload;
            COMMIT;

notify_only:
            BEGIN;
            NOTIFY benchmark_channel, :payload;
            COMMIT;

pgbench -n -c 100 -j 10 -f $script.sql postgres -T 100


Results:

version       test         payload_len      tps

master     insert_only          10      30981.603164
v6         insert_only          10      34940.927688
v7         insert_only          10      30933.015773

master     insert_with_notify   10      817.851691
v6         insert_with_notify   10      33761.706615
v7         insert_with_notify   10      33870.474923

master     notify_only          10      87750.506562
v6         notify_only          10      36141.169978
v7         notify_only          10      34261.854909


I think insert_only is interesting as some reference point that we
want to get as close as possible with the patch when we measure
insert_with_notify. insert_with_notify shows stable 30-40x boost with
the patch. And it's quite close to insert_only tps for v6 and v7.
Which is great.

I haven't seen differences between v6 and v7. I thought maybe payload
size makes the difference, but there was no difference  when payload
length was 2000. It's surprising.

Also an interesting thing is degradation in notify_only test. master
is 2.5 times faster than the patched version. Don't know why it
happens.

Also there was some instability in results (sometimes master
insert_only shows 30k, sometimes 38k), I don't know why, but I think
it doesn't affect conclusions in general.


Thank you!


Best regards,
Arseniy Mukhin
pgsql-hackers by date:
From: Nazir Bilal Yavuz
Date: 17 October, 15:38:03
Subject: Re: ci: Skip minfree file in the cores_backtrace.sh
From: Álvaro Herrera
Date: 17 October, 16:46:47
Subject: Re: ci: Skip minfree file in the cores_backtrace.sh
Re: [PATCH] Write Notifications Through WAL - Mailing list pgsql-hackers

Previous

Next