Thread: many sessions wait on LWlock WALWrite suddenly

many sessions wait on LWlock WALWrite suddenly

From
James Pang
Date:
   pgv14.8 , during peak time, we suddenly see hundreds of active sessions waiting on LWlock  WALWrite at the same time, but we did not find any issue on storage .
any suggestions ? 

Thanks,

James

Re: many sessions wait on LWlock WALWrite suddenly

From
Laurenz Albe
Date:
On Fri, 2025-04-11 at 22:36 +0800, James Pang wrote:
> pgv14.8 , during peak time, we suddenly see hundreds of active sessions waiting on LWlock
> WALWrite at the same time, but we did not find any issue on storage .
> any suggestions ? 

You should get a reasonably sized (much smaller) connection pool.
That will probably take care of the problem and will probably
improve your overall performance.

Yours,
Laurenz Albe



Re: many sessions wait on LWlock WALWrite suddenly

From
MichaelDBA
Date:
LWLock always shows up in the case where you have too many concurrent active connections.  Do a select from the pg_stat_activity table where state in ('idle in transaction','active'); Then count how many CPUs you have.  If the sql query count returned is greater than 2-3 times the number of CPUs, you probably have a CPU overload problem and your solution may be to add a connection pooler between the client and the DB server.  This is all due to the nature of how PG is architected: every connection is a process, not a thread.

Regards,
Michael Vitale

James Pang wrote on 4/11/2025 10:36 AM:
   pgv14.8 , during peak time, we suddenly see hundreds of active sessions waiting on LWlock  WALWrite at the same time, but we did not find any issue on storage .
any suggestions ? 

Thanks,

James


Regards,

Michael Vitale

Michaeldba@sqlexec.com

703-600-9343 


Attachment

Re: many sessions wait on LWlock WALWrite suddenly

From
Yura Sokolov
Date:
11.04.2025 17:36, James Pang пишет:
>    pgv14.8 , during peak time, we suddenly see hundreds of active sessions
> waiting on LWlock  WALWrite at the same time, but we did not find any issue
> on storage .
> any suggestions ?

No real suggestions...

There is single WALWrite lock.
So only single process may actually write WAL to disk. There're no any
parallelism.

And the process also does fsync. One file after other. There're no any
parallelism.

And all other backend that needs to be sure their transactions settled on
the disk waits for this process.

And this process is gready: it collects farest position in WAL buffers
ready to be written, and write and fsync whole those buffers. And only
after than it releases WALWrite lock and other backends awoken.

So when many backends wrote to WAL buffers and now need to wait buffers are
settled on disc, they wait for this single process, which will write
buffers for all that backends.

So:
- many backends wrote WAL buffers,
- one backend calculated how many buffers were written,
- and then the backend write and fsync those buffers serially,
- all other backends waits for this backend,
In the results, backends waits each other, or, in other words, they waits
latest of them!!! All backends waits until WAL record written by latest of
them will be written and fsynced to disk.

(Andres, iiuc it looks to be main bottleneck on the way of increasing
NUM_XLOGINSERT_LOCKS. Right?)

-- 
regards
Yura Sokolov aka funny-falcon



Re: many sessions wait on LWlock WALWrite suddenly

From
Andres Freund
Date:
Hi,

On 2025-04-15 12:16:40 +0300, Yura Sokolov wrote:
> 11.04.2025 17:36, James Pang пишет:
> >    pgv14.8 , during peak time, we suddenly see hundreds of active sessions
> > waiting on LWlock  WALWrite at the same time, but we did not find any issue
> > on storage .
> > any suggestions ?
> 
> No real suggestions...
> 
> There is single WALWrite lock.

That's true - but it's worth specifically calling out that the reason you'd
see a lot of WALWrite lock wait events isn't typically due to real lock
contention. Very often we'll flush WAL for many sessions at once, in those
cases the WALWrite lock wait events just indicate that all those sessions are
actually waiting for the WAL IO to complete.

It'd be good if we could report a different wait event for the case of just
waiting for WAL IO to complete, but right now that's not entirely trivial to
do reliably. But we could perhaps do at least the minimal thing and report a
different wait event if we reach XLogFlush() with an LSN that's already in the
process of being written out?


> In the results, backends waits each other, or, in other words, they waits
> latest of them!!! All backends waits until WAL record written by latest of
> them will be written and fsynced to disk.

They don't necessarily wait for the *latest* write, they just write for the
latest write from the time they started writing.


FWIW, in the v1 AIO prototype I had split up the locking for this so that we'd
not unnnecessarily need to wait previous writes in many cases - unfortunately
for *many* types of storage that turns out to be a significant loss (most
extremely on non-enterprise Samsung SSDs). The "maximal" group commit
behaviour minimizes the number of durable writes that need to be done, and
that is a significant benefit on many forms of storage. On other storage it's
a significant benefit to have multiple concurrent flushes, but it's a hard
hard tuning problem - I spent many months trying to get it right, and I never
fully got there.


> (Andres, iiuc it looks to be main bottleneck on the way of increasing
> NUM_XLOGINSERT_LOCKS. Right?)

I don't think that the "single" WALWriteLock is a blocker to increasing
NUM_XLOGINSERT_LOCKS to a meaningful degree.

However, I think there's somewhat of an *inverse* relationship.  To
efficiently flush WAL in smaller increments, we need a cheap way of
identifying the number of backends that need to wait up to a certain LSN. For
that I think we may need a refinement of the WALInsertLock infrastructure.


I think the main blockers for increasing NUM_XLOGINSERT_LOCKS are:

1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and
   spinlocks scale really badly under heavy contention

2) There are common codepaths where we need to iterate over all
   NUM_XLOGINSERT_LOCKS slots, that turns out to become rather expensive, the
   relevant cachelines are very commonly not going to be in the local CPU
   cache.

I think we can redesign the mechanism so that there's an LSN ordered
ringbuffer of in-progress insertions, with the reservation being a single
64bit atomic increment, without the need for a low limit like
NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a
disadvantage with using something like MaxConnections * 2).

Greetings,

Andres Freund



Re: many sessions wait on LWlock WALWrite suddenly

From
Yura Sokolov
Date:
15.04.2025 13:00, Andres Freund пишет:
> 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and
>    spinlocks scale really badly under heavy contention
> 
> I think we can redesign the mechanism so that there's an LSN ordered
> ringbuffer of in-progress insertions, with the reservation being a single
> 64bit atomic increment, without the need for a low limit like
> NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a
> disadvantage with using something like MaxConnections * 2).

There is such attempt at [1]. And Zhiguo tells it really shows promising
results.

No, I did it not with "ring-buffer", but rather with hash-table. But it is
still lock-free.

But after implementing that I found WALBufMappingLock [2] (which is already
removed).

And then all stuck in WALWrite lock.

> However, I think there's somewhat of an *inverse* relationship.  To
> efficiently flush WAL in smaller increments, we need a cheap way of
> identifying the number of backends that need to wait up to a certain LSN.

I believe, LWLockWaitForVar should be redone:
- currently it waits for variable to change (ie to be disctinct from
provided value).
- but I believe, it should wait for variable to be greater than provided value.

This way:
- WALInsertLock waiter will not awake for every change of insertingAt
- process, which writes and fsync WAL, will be able to awake waiters on
every fsync, instead of end of whole write.

It will reduce overhead of waiting WALInsertLock a lot, and will greately
reduce time spend on waiting WALWrite lock.

Btw, insertingAt have to be filled at the start of copying wal record to
wal buffers. Yes, we believe copying of small wal record is fast, but when
a lot of wal inserters does their job, we needlessly sleep on their
WALInsertLock although they are already in the future.

[1] https://commitfest.postgresql.org/patch/5633/
[2] https://commitfest.postgresql.org/patch/5511/

-- 
regards
Yura Sokolov aka funny-falcon



Re: many sessions wait on LWlock WALWrite suddenly

From
Andres Freund
Date:
Hi,

On 2025-04-15 13:44:09 +0300, Yura Sokolov wrote:
> 15.04.2025 13:00, Andres Freund пишет:
> > 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and
> >    spinlocks scale really badly under heavy contention
> > 
> > I think we can redesign the mechanism so that there's an LSN ordered
> > ringbuffer of in-progress insertions, with the reservation being a single
> > 64bit atomic increment, without the need for a low limit like
> > NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a
> > disadvantage with using something like MaxConnections * 2).
> 
> There is such attempt at [1]. And Zhiguo tells it really shows promising
> results.
> 
> No, I did it not with "ring-buffer", but rather with hash-table. But it is
> still lock-free.

I don't find that approach particularly promising - I do think we want this to
be an ordered datastructure, not something as fundamentally unordered as a
hashtable.


> And then all stuck in WALWrite lock.

That will often, but not always, mean that you're just hitting the IO
throughput of the storage device.  Right now it's too hard to tell the
difference, hence the suggestion to make the wait events more informative.


> > However, I think there's somewhat of an *inverse* relationship.  To
> > efficiently flush WAL in smaller increments, we need a cheap way of
> > identifying the number of backends that need to wait up to a certain LSN.
> 
> I believe, LWLockWaitForVar should be redone:
> - currently it waits for variable to change (ie to be disctinct from
> provided value).
> - but I believe, it should wait for variable to be greater than provided value.

I think we should simply get rid of the mechanism alltogether :)


> This way:
> - WALInsertLock waiter will not awake for every change of insertingAt
> - process, which writes and fsync WAL, will be able to awake waiters on
> every fsync, instead of end of whole write.
> 
> It will reduce overhead of waiting WALInsertLock a lot, and will greately
> reduce time spend on waiting WALWrite lock.

> Btw, insertingAt have to be filled at the start of copying wal record to
> wal buffers. Yes, we believe copying of small wal record is fast, but when
> a lot of wal inserters does their job, we needlessly sleep on their
> WALInsertLock although they are already in the future.

Yes, that's a problem - but it also adds some overhead.  I think we'll be
better off going with the ringbuffer approach where insertions are naturally
ordered and we can wait for precisely the insertions that we need to.

Greetings,

Andres Freund



Re: many sessions wait on LWlock WALWrite suddenly

From
Yura Sokolov
Date:
15.04.2025 13:53, Andres Freund пишет:
> Hi,
> 
> On 2025-04-15 13:44:09 +0300, Yura Sokolov wrote:
>> 15.04.2025 13:00, Andres Freund пишет:
>>> 1) Increasing NUM_XLOGINSERT_LOCKS allows more contention on insertpos_lck and
>>>    spinlocks scale really badly under heavy contention
>>>
>>> I think we can redesign the mechanism so that there's an LSN ordered
>>> ringbuffer of in-progress insertions, with the reservation being a single
>>> 64bit atomic increment, without the need for a low limit like
>>> NUM_XLOGINSERT_LOCKS (the ring size needs to be limited, but I didn't see a
>>> disadvantage with using something like MaxConnections * 2).
>>
>> There is such attempt at [1]. And Zhiguo tells it really shows promising
>> results.
>>
>> No, I did it not with "ring-buffer", but rather with hash-table. But it is
>> still lock-free.
> 
> I don't find that approach particularly promising - I do think we want this to
> be an ordered datastructure, not something as fundamentally unordered as a
> hashtable.

I've tried to construct such thing. But "Switch WAL" record thing didn't
allow me to finish the design. Because "Switch WAL" have no fixed size, and
it is allowed to not be inserted. It breaks ordering.

Probably, I just didn't think hard enough to work around.

And certainly I though about it only for log reservation, not for waiting
on insertion to complete, nor for waiting writing to complete.

>> And then all stuck in WALWrite lock.
> 
> That will often, but not always, mean that you're just hitting the IO
> throughput of the storage device.  Right now it's too hard to tell the
> difference, hence the suggestion to make the wait events more informative.
> 
> 
>>> However, I think there's somewhat of an *inverse* relationship.  To
>>> efficiently flush WAL in smaller increments, we need a cheap way of
>>> identifying the number of backends that need to wait up to a certain LSN.
>>
>> I believe, LWLockWaitForVar should be redone:
>> - currently it waits for variable to change (ie to be disctinct from
>> provided value).
>> - but I believe, it should wait for variable to be greater than provided value.
> 
> I think we should simply get rid of the mechanism alltogether :)
> 
> 
>> This way:
>> - WALInsertLock waiter will not awake for every change of insertingAt
>> - process, which writes and fsync WAL, will be able to awake waiters on
>> every fsync, instead of end of whole write.
>>
>> It will reduce overhead of waiting WALInsertLock a lot, and will greately
>> reduce time spend on waiting WALWrite lock.
> 
>> Btw, insertingAt have to be filled at the start of copying wal record to
>> wal buffers. Yes, we believe copying of small wal record is fast, but when
>> a lot of wal inserters does their job, we needlessly sleep on their
>> WALInsertLock although they are already in the future.
> 
> Yes, that's a problem - but it also adds some overhead.  I think we'll be
> better off going with the ringbuffer approach where insertions are naturally
> ordered and we can wait for precisely the insertions that we need to.

-- 
regards
Yura Sokolov aka funny-falcon