Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables) - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables) |
Date | |
Msg-id | 12156.1692275140@antos Whole thread Raw |
In response to | Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables) (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska <ah@cybertec.at> wrote: > > I try to understand this patch (commit 5ffb7c7750) because I use condition > > variable in an extension. One particular problem occured to me, please > > consider: > > > > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls > > ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't > > at least AbortTransaction() and AbortSubTransaction() check the return value > > of ConditionVariableCancelSleep(), and re-send the signal if needed? > > I wondered about that in the context of our only in-tree user of > ConditionVariableSignal(), in parallel btree index creation, but since > it's using the parallel executor infrastructure, any error would be > propagated everywhere so all waits would be aborted. I see, ConditionVariableSignal() is currently used only to signal other workers running in the same transactions. The other parts use ConditionVariableBroadcast(), so no consumer should miss its signal. > > Note that I'm just thinking about such a problem, did not try to reproduce it. > > Hmm. I looked for users of ConditionVariableSignal() in the usual web > tools and didn't find anything, so I guess your extension is not > released yet or not open source. I'm curious: what does it actually > do if there is an error in a CV-wakeup-consuming backend? I guess it > might be some kind of work-queue processing system... it seems > inevitable that if backends are failing with errors, and you don't > respond by retrying/respawning, you'll lose or significantly delay > jobs/events/something anyway (imagine only slightly different timing: > you consume the signal and start working on a job and then ereport, > which amounts to the same thing in the end now that your transaction > is rolled back?), and when you retry you'll see whatever condition was > waited for anyway. But that's just me imagining what some > hypothetical strawman system might look like... what does it really > do? If you're interested, the extension is pg_squeeze [1]. I think the use case is rather special. All the work is done by a background worker, but an user function can be called to submit a "task" for the worker and wait for its completion. So the function sleeps on a CV and the worker uses the CV to wake it up. If this function ends due to ERROR, the user is supposed to find a log message in the worker output sooner or later. It may sound weird, but that function exists primarily for regression tests, so ERROR is a problem anyway. > (FWIW when I worked on a couple of different work queue-like systems > and tried to use ConditionVariableSignal() I eventually concluded that > it was the wrong tool for the job because its wakeup order is > undefined. It's actually FIFO, but I wanted LIFO so that workers have > a chance to become idle and reduce the pool size, but I started to > think that once you want that level of control you really want to > build a bespoke wait list system, so I never got around to proposing > that we consider changing that.) > > Our condition variables are weird. They're not associated with a > lock, so we made start-of-wait non-atomic: prepare first, then return > control and let the caller check its condition, then sleep. Typical > user space condition variable APIs force you to acquire some kind of > lock that protects the condition first, then check the condition, then > atomically release-associated-lock-and-start-sleeping, so there is no > data race but also no time where control is returned to the caller but > the thread is on the wait list consuming signals. That choice has > some pros (you can choose whatever type of lock you want to protect > your condition, or none at all if you can get away with memory > barriers and magic) and cons.. However, as I think Andres was getting > at, having a non-atomic start-of-wait doesn't seem to require us to > have a non-atomic end-of-wait and associated extra contention. So > maybe we should figure out how to fix that in 17. Thanks for sharing your point of view. I'm fine with this low-level approach: it's well documented and there are examples in the PG code showing how it should be used :-) -- Antonin Houska Web: https://www.cybertec-postgresql.com https://github.com/cybertec-postgresql/pg_squeeze/
pgsql-hackers by date: