Re: [PATCH] Write Notifications Through WAL - Mailing list pgsql-hackers
From | Arseniy Mukhin |
---|---|
Subject | Re: [PATCH] Write Notifications Through WAL |
Date | |
Msg-id | CAE7r3MLwRNyvmMd2kaefeNYTRjGD8wOxLzyGn8vexaeTJdASaw@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Write Notifications Through WAL (Rishu Bagga <rishu.postgres@gmail.com>) |
List | pgsql-hackers |
Hi, Thank you for the new patch version! On Tue, Oct 14, 2025 at 12:14 AM Rishu Bagga <rishu.postgres@gmail.com> wrote: > > Thank you Arseniy for the thoughtful and detailed feedback, I have > addressed the concerns in the new patch. > > On Wed, Oct 8, 2025 at 3:07 AM Arseniy Mukhin > <arseniy.mukhin.dev@gmail.com> wrote: > > > Yes, we hold the lock while we are writing the commit record to wal, > > but other transactions don't see our transaction as committed and > > can't see changes that our transaction done besides notifications > > (inserts/updates/deletes) right after we wrote the commit record. > > There are a few more things that need to be done before it. IIUC the > > moment when a fresh snapshot will show our transaction as 'not in > > progress' is right after updating procarray (ProcArrayEndTransaction() > > call in CommitTransaction()). So I think we still need > > XidInMVCCSnapshot() check in reading path. Without it we can end up in > > a situation where the listener got the notification but still can't > > see changes notify transaction done because notification was processed > > too early. > > Good catch, I've fixed this in the new patch. Hmmm, looks like we still have the same problem in the new version if I'm not missing something: Snapshot snap = ActiveSnapshotSet() ? GetActiveSnapshot() : NULL; ... if (qe->dboid == MyDatabaseId && snap != NULL && XidInMVCCSnapshot(qe->xid, snap)) Here we almost always have snap = NULL (except when we are in Exec_ListenPreCommit maybe), since listeners read notifications when they are out of active transaction. Why do we need snap nullability here? IIUC we can do XidInMVCCSnapshot check the same way as we do it in master. Also the comment /*---------- * Get snapshot we'll use to decide which xacts are still in progress. * This is trickier than it might seem, because of race conditions. seems to still be relevant, so I think it's worth keeping. > > > I think reserving slots is a great idea. But I have a question about > > checking if the queue is full. Why does PreCommit_Notify need three > > checks? Isn't it enough to simply check that we don't exceed > > max_total_entries? > > After taking a closer look, you are right in that we don't need all three > checks; however, it isn't enough to check that we don't exceed > max_total_entries. > The reason for this is that we can exceed max_total_pages without necessarily > exceeding max_total_entries, in the case that the oldest active entry is not the > first entry on the tail page. So actually the only test we need is to > confirm that > when we zero a new page, that the total number of pages is still <= > max_total_pages. > I have updated this in the new patch. LGTM > > > > > > I don't have enough knowledge to say if it's the right approach to > > prevent wal truncating with unread notifications. But there are two > > things about current implementation I noticed. Not sure if it's worth > > doing something with it before you know it's the right approach in > > general. But anyway, there are two points about it: > > > > We hold NotifyQueueLock while adding the notifications data record to > > wal. IIUC one of the points of using wal here was that it lets us to > > write notification data in parallel as wal is good for parallel > > writes. But with lock it seems like we don't take advantage of it. > > IIUC we hold the lock here to prevent truncation of the new record > > after we wrote it and before we save its lsn. Is this correct? Maybe > > we can solve it somehow differently? For example before writing the > > notification data record we can save the current lsn (something like > > XactLastRecEnd) in UncommittedNotifies so we can be sure anything > > after it will not be truncated, including our record that we are gonna > > write. This way we don't need to hold the lock. > > > > We iterate over all entries in UncommittedNotifies every time we want > > to add/remove data. What do you think about using MyProcNumber as an > > array index instead of iteration as we do with listener positions for > > instance? Sounds possible as every backend can't have more than one > > uncommitted lsn. Also maybe backends can remove entries from > > UncommittedNotifies in AtAbort_Notify() the same way committed > > transactions do it in asyncQueueAddCompactEntry()? This way > > AsyncNotifyOldestRequiredLSN() will not need to be bothered with > > cleaning UncommittedNotifies and checking transaction statuses. > > After taking another look, I realized there is a cleaner approach that > does not need the use of UncommittedNotifiesArray in the first place, > and also allows us to keep the parallel WAL writes for notifications. > We can simply have the recycler iterate through the procarray instead, and find > the lowest value of a proc's notifyDataLsn. (renamed from > notifyCommitLsn in prev patch) > There is a small window where the notify data WAL record has been written, > and its lsn hasn't been assigned to MyProc->notifyDataLsn, so to make > sure we don't > lose any records, before we write the data record, we set > MyProc->notifyDataLsn to > the current lsn pointer using GetXLogInsertRecPtr(). If a Proc has a > non-invalid value for > notifyDataLsn, then we know that it has uncommitted notification data > written to the WAL, > at or above that value. Looks like you have solved both issues with it. Several points about wal pin logic: 1) I noticed that we can block wal truncating sometimes. NotifyPageMins array works great when we have a consistent flow of notifications and can truncate old queue segments regularly. But what if we don't have it? Please try the next example: Send a notification first: NOTIFY ch, 'a'; Then do something that makes wal grows. create table if not exists t1 (a text); insert into t1 (a) select gen_random_uuid() from generate_series(1,10000000); You will see that wal grows without truncation. It seems that to solve the issue we should not consider queue entries before the queue tail in AsyncNotifyOldestRequiredLSN as we don't need their data anymore. 2) I'm worried if it's ok to iterate like this: /* Scan per-backend pins under ProcArrayLock */ LWLockAcquire(ProcArrayLock, LW_SHARED); for (int i = 0; i < MaxBackends; i++) { PGPROC *proc = GetPGProcByNumber(i); We use MaxBackends here, so it looks like we can access PGPROC that isn't assigned to the existing backend. I failed to find any examples in the code that would do the same. Maybe we don't need to deal with the MyProc structure here at all? We can do the same thing that we do for listener positions. Store it in QueueBackendStatus and directly access it with MyProcNumber when we want to update it for our backend. And in AsyncNotifyOldestRequiredLSN we can iterate over backends the same way we do it asyncQueueAdvanceTail. What do you think? 3) It seems to me that we can miss some notification data record's lsn here AsyncNotifyOldestRequiredLSN. Let's consider the next schedule: Transaction Checkpointer Notify transaction writes notification data wal record and updates its notifyDataLsn AsyncNotifyOldestRequiredLSN call. We iterate over NotifyPageMins. It doesn't contain transaction's notifyDataLsn yet. Transaction is committed, NotifyPageMins is updated notifyDataLsn is cleared. We iterate over notifyDataLsns. Again we don't see notifyDataLsn here as it has already been removed. Is it real or I missed something? I see there was the lock in v6 that could prevent it, but probably we can just swap NotifyPageMins and notifyDataLsns parts in AsyncNotifyOldestRequiredLSN to solve the problem? > > The performance remains similar with these changes. I have also > rebased the patch on the latest HEAD. > It was surprising that exclusive locking when writing notification data to wal doesn't affect the result. I tried to do some benchmarking too: Tests: All benchmarks were done without listeners, as we try to improve the notify path. insert_only: BEGIN; INSERT INTO t1 VALUES(1); COMMIT; insert_with_notify: BEGIN; INSERT INTO t1 VALUES(1); NOTIFY benchmark_channel, :payload; COMMIT; notify_only: BEGIN; NOTIFY benchmark_channel, :payload; COMMIT; pgbench -n -c 100 -j 10 -f $script.sql postgres -T 100 Results: version test payload_len tps master insert_only 10 30981.603164 v6 insert_only 10 34940.927688 v7 insert_only 10 30933.015773 master insert_with_notify 10 817.851691 v6 insert_with_notify 10 33761.706615 v7 insert_with_notify 10 33870.474923 master notify_only 10 87750.506562 v6 notify_only 10 36141.169978 v7 notify_only 10 34261.854909 I think insert_only is interesting as some reference point that we want to get as close as possible with the patch when we measure insert_with_notify. insert_with_notify shows stable 30-40x boost with the patch. And it's quite close to insert_only tps for v6 and v7. Which is great. I haven't seen differences between v6 and v7. I thought maybe payload size makes the difference, but there was no difference when payload length was 2000. It's surprising. Also an interesting thing is degradation in notify_only test. master is 2.5 times faster than the patched version. Don't know why it happens. Also there was some instability in results (sometimes master insert_only shows 30k, sometimes 38k), I don't know why, but I think it doesn't affect conclusions in general. Thank you! Best regards, Arseniy Mukhin
pgsql-hackers by date: