Thread: many sessions wait on LWlock WALWrite suddenly
pgv14.8 , during peak time, we suddenly see hundreds of active sessions waiting on LWlock WALWrite at the same time, but we did not find any issue on storage .
any suggestions ?
Thanks,
James
On Fri, 2025-04-11 at 22:36 +0800, James Pang wrote: > pgv14.8 , during peak time, we suddenly see hundreds of active sessions waiting on LWlock > WALWrite at the same time, but we did not find any issue on storage . > any suggestions ? You should get a reasonably sized (much smaller) connection pool. That will probably take care of the problem and will probably improve your overall performance. Yours, Laurenz Albe
LWLock always shows up in the case where you have too many concurrent active connections. Do a select from the pg_stat_activity table where state in ('idle in transaction','active'); Then count how many CPUs you have. If the sql query count returned is greater than 2-3 times the number of CPUs, you probably have a CPU overload problem and your solution may be to add a connection pooler between the client and the DB server. This is all due to the nature of how PG is architected: every connection is a process, not a thread.
Regards,
Michael Vitale
James Pang wrote on 4/11/2025 10:36 AM:

Regards,
Michael Vitale
James Pang wrote on 4/11/2025 10:36 AM:
pgv14.8 , during peak time, we suddenly see hundreds of active sessions waiting on LWlock WALWrite at the same time, but we did not find any issue on storage .any suggestions ?Thanks,James
Regards,
Michael Vitale
703-600-9343

Attachment
11.04.2025 17:36, James Pang пишет: > pgv14.8 , during peak time, we suddenly see hundreds of active sessions > waiting on LWlock WALWrite at the same time, but we did not find any issue > on storage . > any suggestions ? No real suggestions... There is single WALWrite lock. So only single process may actually write WAL to disk. There're no any parallelism. And the process also does fsync. One file after other. There're no any parallelism. And all other backend that needs to be sure their transactions settled on the disk waits for this process. And this process is gready: it collects farest position in WAL buffers ready to be written, and write and fsync whole those buffers. And only after than it releases WALWrite lock and other backends awoken. So when many backends wrote to WAL buffers and now need to wait buffers are settled on disc, they wait for this single process, which will write buffers for all that backends. So: - many backends wrote WAL buffers, - one backend calculated how many buffers were written, - and then the backend write and fsync those buffers serially, - all other backends waits for this backend, In the results, backends waits each other, or, in other words, they waits latest of them!!! All backends waits until WAL record written by latest of them will be written and fsynced to disk. (Andres, iiuc it looks to be main bottleneck on the way of increasing NUM_XLOGINSERT_LOCKS. Right?) -- regards Yura Sokolov aka funny-falcon
Hi, On 2025-04-15 12:16:40 +0300, Yura Sokolov wrote: > 11.04.2025 17:36, James Pang пишет: > > pgv14.8 , during peak time, we suddenly see hundreds of active sessions > > waiting on LWlock WALWrite at the same time, but we did not find any issue > > on storage . > > any suggestions ? > > No real suggestions... > > There is single WALWrite lock. That's true - but it's worth specifically calling out that the reason you'd see a lot of WALWrite lock wait events isn't typically due to real lock contention. Very often we'll flush WAL for many sessions at once, in those cases the WALWrite lock wait events just indicate that all those sessions are actually waiting for the WAL IO to complete. It'd be good if we could report a different wait event for the case of just waiting for WAL IO to complete, but right now that's not entirely trivial to do reliably. But we could perhaps do at least the minimal thing and report a different wait event if we reach XLogFlush() with an LSN that's already in the process of being written out? > In the results, backends waits each other, or, in other words, they waits > latest of them!!! All backends waits until WAL record written by latest of > them will be written and fsynced to disk. They don't necessarily wait for the *latest* write, they just write for the latest write from the time they started writing. FWIW, in the v1 AIO prototype I had split up the locking for this so that we'd not unnnecessarily need to wait previous writes in many cases - unfortunately for *many* types of storage that turns out to be a significant loss (most extremely on non-enterprise Samsung SSDs). The "maximal" group commit behaviour minimizes the number of durable writes that need to be done, and that is a significant benefit on many forms of storage. On other storage it's a significant benefit to have multiple concurrent flushes, but it's a hard hard tuning problem - I spent many months trying to get it right, and I never fully got there. > (Andres, iiuc it looks to be main bottleneck on the way of increasing > NUM_XLOGINSERT_LOCKS. Right?) I don't think that the "single" WALWriteLock is a blocker to increasing NUM_XLOGINSERT_LOCKS to a meaningful degree. However, I think there's somewhat of an *inverse* relationship. To efficiently flush WAL in smaller increments, we need a cheap way of identifying the number of backends that need to wait up to a certain LSN. For that I think we may need a refinement of the WALInsertLock infrastructure. I think the main blockers for increasing NUM_XLOGINSERT_LOCKS are: 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and spinlocks scale really badly under heavy contention 2) There are common codepaths where we need to iterate over all NUM_XLOGINSERT_LOCKS slots, that turns out to become rather expensive, the relevant cachelines are very commonly not going to be in the local CPU cache. I think we can redesign the mechanism so that there's an LSN ordered ringbuffer of in-progress insertions, with the reservation being a single 64bit atomic increment, without the need for a low limit like NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a disadvantage with using something like MaxConnections * 2). Greetings, Andres Freund
15.04.2025 13:00, Andres Freund пишет: > 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and > spinlocks scale really badly under heavy contention > > I think we can redesign the mechanism so that there's an LSN ordered > ringbuffer of in-progress insertions, with the reservation being a single > 64bit atomic increment, without the need for a low limit like > NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a > disadvantage with using something like MaxConnections * 2). There is such attempt at [1]. And Zhiguo tells it really shows promising results. No, I did it not with "ring-buffer", but rather with hash-table. But it is still lock-free. But after implementing that I found WALBufMappingLock [2] (which is already removed). And then all stuck in WALWrite lock. > However, I think there's somewhat of an *inverse* relationship. To > efficiently flush WAL in smaller increments, we need a cheap way of > identifying the number of backends that need to wait up to a certain LSN. I believe, LWLockWaitForVar should be redone: - currently it waits for variable to change (ie to be disctinct from provided value). - but I believe, it should wait for variable to be greater than provided value. This way: - WALInsertLock waiter will not awake for every change of insertingAt - process, which writes and fsync WAL, will be able to awake waiters on every fsync, instead of end of whole write. It will reduce overhead of waiting WALInsertLock a lot, and will greately reduce time spend on waiting WALWrite lock. Btw, insertingAt have to be filled at the start of copying wal record to wal buffers. Yes, we believe copying of small wal record is fast, but when a lot of wal inserters does their job, we needlessly sleep on their WALInsertLock although they are already in the future. [1] https://commitfest.postgresql.org/patch/5633/ [2] https://commitfest.postgresql.org/patch/5511/ -- regards Yura Sokolov aka funny-falcon
Hi, On 2025-04-15 13:44:09 +0300, Yura Sokolov wrote: > 15.04.2025 13:00, Andres Freund пишет: > > 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and > > spinlocks scale really badly under heavy contention > > > > I think we can redesign the mechanism so that there's an LSN ordered > > ringbuffer of in-progress insertions, with the reservation being a single > > 64bit atomic increment, without the need for a low limit like > > NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a > > disadvantage with using something like MaxConnections * 2). > > There is such attempt at [1]. And Zhiguo tells it really shows promising > results. > > No, I did it not with "ring-buffer", but rather with hash-table. But it is > still lock-free. I don't find that approach particularly promising - I do think we want this to be an ordered datastructure, not something as fundamentally unordered as a hashtable. > And then all stuck in WALWrite lock. That will often, but not always, mean that you're just hitting the IO throughput of the storage device. Right now it's too hard to tell the difference, hence the suggestion to make the wait events more informative. > > However, I think there's somewhat of an *inverse* relationship. To > > efficiently flush WAL in smaller increments, we need a cheap way of > > identifying the number of backends that need to wait up to a certain LSN. > > I believe, LWLockWaitForVar should be redone: > - currently it waits for variable to change (ie to be disctinct from > provided value). > - but I believe, it should wait for variable to be greater than provided value. I think we should simply get rid of the mechanism alltogether :) > This way: > - WALInsertLock waiter will not awake for every change of insertingAt > - process, which writes and fsync WAL, will be able to awake waiters on > every fsync, instead of end of whole write. > > It will reduce overhead of waiting WALInsertLock a lot, and will greately > reduce time spend on waiting WALWrite lock. > Btw, insertingAt have to be filled at the start of copying wal record to > wal buffers. Yes, we believe copying of small wal record is fast, but when > a lot of wal inserters does their job, we needlessly sleep on their > WALInsertLock although they are already in the future. Yes, that's a problem - but it also adds some overhead. I think we'll be better off going with the ringbuffer approach where insertions are naturally ordered and we can wait for precisely the insertions that we need to. Greetings, Andres Freund
15.04.2025 13:53, Andres Freund пишет: > Hi, > > On 2025-04-15 13:44:09 +0300, Yura Sokolov wrote: >> 15.04.2025 13:00, Andres Freund пишет: >>> 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and >>> spinlocks scale really badly under heavy contention >>> >>> I think we can redesign the mechanism so that there's an LSN ordered >>> ringbuffer of in-progress insertions, with the reservation being a single >>> 64bit atomic increment, without the need for a low limit like >>> NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a >>> disadvantage with using something like MaxConnections * 2). >> >> There is such attempt at [1]. And Zhiguo tells it really shows promising >> results. >> >> No, I did it not with "ring-buffer", but rather with hash-table. But it is >> still lock-free. > > I don't find that approach particularly promising - I do think we want this to > be an ordered datastructure, not something as fundamentally unordered as a > hashtable. I've tried to construct such thing. But "Switch WAL" record thing didn't allow me to finish the design. Because "Switch WAL" have no fixed size, and it is allowed to not be inserted. It breaks ordering. Probably, I just didn't think hard enough to work around. And certainly I though about it only for log reservation, not for waiting on insertion to complete, nor for waiting writing to complete. >> And then all stuck in WALWrite lock. > > That will often, but not always, mean that you're just hitting the IO > throughput of the storage device. Right now it's too hard to tell the > difference, hence the suggestion to make the wait events more informative. > > >>> However, I think there's somewhat of an *inverse* relationship. To >>> efficiently flush WAL in smaller increments, we need a cheap way of >>> identifying the number of backends that need to wait up to a certain LSN. >> >> I believe, LWLockWaitForVar should be redone: >> - currently it waits for variable to change (ie to be disctinct from >> provided value). >> - but I believe, it should wait for variable to be greater than provided value. > > I think we should simply get rid of the mechanism alltogether :) > > >> This way: >> - WALInsertLock waiter will not awake for every change of insertingAt >> - process, which writes and fsync WAL, will be able to awake waiters on >> every fsync, instead of end of whole write. >> >> It will reduce overhead of waiting WALInsertLock a lot, and will greately >> reduce time spend on waiting WALWrite lock. > >> Btw, insertingAt have to be filled at the start of copying wal record to >> wal buffers. Yes, we believe copying of small wal record is fast, but when >> a lot of wal inserters does their job, we needlessly sleep on their >> WALInsertLock although they are already in the future. > > Yes, that's a problem - but it also adds some overhead. I think we'll be > better off going with the ringbuffer approach where insertions are naturally > ordered and we can wait for precisely the insertions that we need to. -- regards Yura Sokolov aka funny-falcon