[PATCH] Write Notifications Through WAL - Mailing list pgsql-hackers
From | Rishu Bagga |
---|---|
Subject | [PATCH] Write Notifications Through WAL |
Date | |
Msg-id | CAK80=jjvgWJL5EdzH_Bxc37gn53PJaB5sDXB3VPfJuN7G1xL5Q@mail.gmail.com Whole thread Raw |
Responses |
Re: [PATCH] Write Notifications Through WAL
|
List | pgsql-hackers |
Hello all, creating a new thread here in continuation of [1], as the patch has changed direction significantly from the subject there. TLDR: The attached patch here aims to increase performance of the NOTIFY codepath, in the case of concurrent notifying transactions, by eliminating the need for a global lock to be held from PreCommit_Notify to the end of transaction. We accomplish this through the method inspired by Tom Lane here [2]. The high level is as follows: Write Path: 1. Write notification data into WAL in PreCommit_Notify. Store lsn in MyProc. 2. Acquire NotifyQueueLock in exclusive mode. 3. Write a compact entry in the notify queue, referencing the lsn in step 1. 4. Commit the transaction, and write the notify lsn in step 1, inside the commit record. 5. Release NotifyQueueLock. Read Path: Readers look through the notification queue where they see just the notify_data_lsn, And then get the actual notify data from the WAL. (End TLDR) Addressing the concerns from the previous patch, and explaining the latest patch: On Wed, Sep 10, 2025 at 6:05 AM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > There is also some other tests failing, like isolation, regress and > others. These are passing in the new patch, after removing the extraneous SignalBackends call. On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: > I agree that listeners don't need to check if the notify transaction > was committed or not anymore, but it seems that listeners still need > to wait until the notify transaction is completed before sending its > notifications. I believe we should continue using XidInMVCCSnapshot as > the current version does. Since we hold NotifyQueueLock in exclusive mode while writing the compact entry in the notification queue, and don’t release it until after commit - it is impossible for a listener to come across a notification that is not yet committed. In the case of a backend crash after writing the notification into the queue and before committing, we would be in a critical section, so even this is not a scenario where we might read an uncommitted notification. > 2) Now we have two calls to SignalBackends (is it intentional?). The > first in AtCommit_Notify() which seems correct to me. The second in > XactLogCommitRecord() seems too early, because other backends can't > process new notifications at this point (if the point about > XidInMVCCSnapshot is correct). And there is Assert failure (Matheus' > report is about it). fixed this in the new patch (removed SignalBackends call in xact.c > 3) I think we still need to check if the queue is full or not while > adding new entries and ereport if it is (entries are smaller now, but > we still can run out of space). And it seems to me that we should do > this check before entering the critical section, because in the > critical section it will result in PANIC. Master results in ERROR in > case of the full queue, so should we do the same here? Thanks for bringing this up - in the new patch I have added a check to make sure we have enough space in the queue before entering the critical section. The logic is as follows - in PreCommit_Notify, before writing the WAL for notify data, we do the following: - Take NotifyQueueLock EXCLUSIVE. - Look at the current head, tail, and how many entries are already reserved by other backends. - If adding one more would exceed: - the allowed page window (max_notify_queue_pages), or - the total entry cap (pages × entries per page), then ERROR out: “queue is full”. - If our reserved slot would be the last on the page, pre‑create the next page so commit won’t stall. - Increment reservedEntries (our claim) and drop the lock. I’ve included a TAP test to ensure we error out if we exceed the max notifications: src/test/modules/test_listen_notify/t/002_queue_full.pl > 4) I don't know, but It seems to me that XactLogCommitRecord is not > the right place for adding listen/notify logic, it seems to be only > about wal stuff. Hmm, I understand the concern, but we do need to ensure the commit WAL record and the notify queue entry are done atomically, and we need to hold the lock for the shortest time possible. So I think maybe we can make a tradeoff here for performance over code compartmentalization? > 5) Do we want to use NotifyQueueLock as a lock that provides the > commit order? Maybe we can introduce another lock to avoid unnecessary > contantion? For example, if we use NotifyQueueLock, we block listeners > that want to read new notifications while we are inserting a commit > record, which seems unnecessary. I think we should stick to using this lock, as we write to the notify queue while holding this lock. We aren’t holding it for a particularly long time, so this should be okay. Additionally this allows listeners to assume entries in the queue are from committed transactions. In the current implementation we also use this lock for writing to the queue. > 6) We have fixed size queue entries, so I think we don't need this > "padding" logic at the end of the page anymore, because we know how > many entries we can have on each page. Thanks, I've removed this in the new patch. On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote > With your patch, since the backends get the notification by reading > WAL records do we need to prevent WAL records that potentially have > unconsumed notification from being removed by the checkpointer? Or we > can save unconsumed notifications in WAL records to somewhere during > the checkpoint as we do for 2PC transactions. Yes we will need to prevent notify data in the WAL from being removed by checkpointer. I have come up with the following “WAL pinning” solution to make sure we do not remove necessary WALs: 1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted notify data. It is possible that the LSN for the data of an uncommitted notification is lower than the min lsn across the committed entries. If we only look at the committed data, and remove WAL records below the min lsn there, and we eventually commit the uncommitted transaction, then we won’t be able to read its notification data. When we enter PreCommit_Notify, - grab NotifyQueueLock in exclusive mode, - Log the notification record and get the notify data lsn. - Find a slot in theUncommittedNotifies array, and record the notify data lsn, along with xid. - If we come across any entries where the transaction is no longer in progress, set their slots as not in use, and the lsn as InvalidXlogRecPtr. - Release the NotifyQueueLock 2. Maintain a min notify_data LSN array of length max_notify_pages, to track the Minimum lsn and corresponding xact needed for notify data across currently active pages in the notify queue. - When we advance the tail past a page, we set the min lsn for the page to be InvalidXlogRecPtr. - When we add an entry into the notify queue, check against the current min for the page, and if lower, then update the min for the page. - Remove the current lsn / xid from the Uncommitted LSNs list. 3. In RemoveOldXlogFiles (where we recycle WAL logs) - call AsyncNotifyOldestRequiredLSN(¬ify_oldest), which will do the following: - grab NotifiyQueueLock in exclusive mode - check the min lsn across both committed and uncommitted in flight notifications - store the min lsn needed in notify_oldest - prune any invalid entries in both lists while iterating through them. - then we get the segment of notify_oldest, and if it is > 0, and less than the currently computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain the segment with the oldest needed notify data lsn. I’ve included a TAP test that ensures we pin necessary WALs: src/test/modules/test_listen_notify/t/003_wal_pin_test.pl. — PERFORMANCE: Running the following, (pgbench_transaction_notify.sql is attached.) pgbench -c 100 -T 100 -j 8 -f pgbench_transaction_notify.sql -d postgres I consistently see TPS is 60k - 70k. Without the patch, the TPS is ~30k. Additionally with pgbench -c 99 -T 100 -j 8 -f pgbench_transaction_notify.sql -d postgres and 1 listener, we see similar TPS, and no missing WAL errors, so the listen and WAL retention logic seems correct. > Also, could you add this patch to the next commit fest if not yet? Will do! Thanks for all the feedback on this patch - and apologies for the delay in coming up with the new patch, as WAL retention was a bit more complicated than I initially imagined it to be. The next steps for me are to also test replication and recovery of notifications. Please let me know what I need to do to get this into a committable state. Thanks, Rishu [1] https://www.postgresql.org/message-id/flat/CAK80%3DjiWjMAfjwJymPoG3%3DFNE%3DhHygn5dXaoax1BWQT1rJrCCw%40mail.gmail.com [2] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us
Attachment
pgsql-hackers by date: