Re: many sessions wait on LWlock WALWrite suddenly - Mailing list pgsql-performance
| From | Andres Freund | 
|---|---|
| Subject | Re: many sessions wait on LWlock WALWrite suddenly | 
| Date | |
| Msg-id | l23ybh777m5u7goxpzznqtobbgxanm65k7yxqvstwh3lze5mn4@3x5dkdtwzpj6 Whole thread Raw | 
| In response to | Re: many sessions wait on LWlock WALWrite suddenly (Yura Sokolov <y.sokolov@postgrespro.ru>) | 
| Responses | Re: many sessions wait on LWlock WALWrite suddenly | 
| List | pgsql-performance | 
Hi, On 2025-04-15 12:16:40 +0300, Yura Sokolov wrote: > 11.04.2025 17:36, James Pang пишет: > > pgv14.8 , during peak time, we suddenly see hundreds of active sessions > > waiting on LWlock WALWrite at the same time, but we did not find any issue > > on storage . > > any suggestions ? > > No real suggestions... > > There is single WALWrite lock. That's true - but it's worth specifically calling out that the reason you'd see a lot of WALWrite lock wait events isn't typically due to real lock contention. Very often we'll flush WAL for many sessions at once, in those cases the WALWrite lock wait events just indicate that all those sessions are actually waiting for the WAL IO to complete. It'd be good if we could report a different wait event for the case of just waiting for WAL IO to complete, but right now that's not entirely trivial to do reliably. But we could perhaps do at least the minimal thing and report a different wait event if we reach XLogFlush() with an LSN that's already in the process of being written out? > In the results, backends waits each other, or, in other words, they waits > latest of them!!! All backends waits until WAL record written by latest of > them will be written and fsynced to disk. They don't necessarily wait for the *latest* write, they just write for the latest write from the time they started writing. FWIW, in the v1 AIO prototype I had split up the locking for this so that we'd not unnnecessarily need to wait previous writes in many cases - unfortunately for *many* types of storage that turns out to be a significant loss (most extremely on non-enterprise Samsung SSDs). The "maximal" group commit behaviour minimizes the number of durable writes that need to be done, and that is a significant benefit on many forms of storage. On other storage it's a significant benefit to have multiple concurrent flushes, but it's a hard hard tuning problem - I spent many months trying to get it right, and I never fully got there. > (Andres, iiuc it looks to be main bottleneck on the way of increasing > NUM_XLOGINSERT_LOCKS. Right?) I don't think that the "single" WALWriteLock is a blocker to increasing NUM_XLOGINSERT_LOCKS to a meaningful degree. However, I think there's somewhat of an *inverse* relationship. To efficiently flush WAL in smaller increments, we need a cheap way of identifying the number of backends that need to wait up to a certain LSN. For that I think we may need a refinement of the WALInsertLock infrastructure. I think the main blockers for increasing NUM_XLOGINSERT_LOCKS are: 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and spinlocks scale really badly under heavy contention 2) There are common codepaths where we need to iterate over all NUM_XLOGINSERT_LOCKS slots, that turns out to become rather expensive, the relevant cachelines are very commonly not going to be in the local CPU cache. I think we can redesign the mechanism so that there's an LSN ordered ringbuffer of in-progress insertions, with the reservation being a single 64bit atomic increment, without the need for a low limit like NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a disadvantage with using something like MaxConnections * 2). Greetings, Andres Freund
pgsql-performance by date: