Thread: Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum

Just rebased the patch.

-------
regards
Yura Sokolov aka funny-falcon
Attachment
Hi,

On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
> From: Yura Sokolov <y.sokolov@postgrespro.ru>
> Date: Mon, 3 Feb 2025 11:58:33 +0300
> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>
> msgnumLock spinlock could be highly contended.
> Comment states it was used as memory barrier.
> Lets use atomic ops with memory barriers directly instead.
>
> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
> no full barrier semantic is required, and explicit read/write barriers
> are cheaper at least on x86_64.

Is it actually true that full barriers aren't required?  I think we might
actually rely on a stronger ordering.


> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
>       */
>      stateP->hasMessages = false;
>
> -    /* Fetch current value of maxMsgNum using spinlock */
> -    SpinLockAcquire(&segP->msgnumLock);
> -    max = segP->maxMsgNum;
> -    SpinLockRelease(&segP->msgnumLock);
> +    /* Fetch current value of maxMsgNum using atomic with memory barrier */
> +    max = pg_atomic_read_u32(&segP->maxMsgNum);
> +    pg_read_barrier();
>
>      if (stateP->resetState)
>      {
>        /*
>         * Force reset.  We can say we have dealt with any messages added
>         * since the reset, as well; and that means we should clear the
>         * signaled flag, too.
>         */
>        stateP->nextMsgNum = max;
>        stateP->resetState = false;
>        stateP->signaled = false;
>        LWLockRelease(SInvalReadLock);
>        return -1;
>    }

vs

> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
>          /*
>           * Insert new message(s) into proper slot of circular buffer
>           */
> -        max = segP->maxMsgNum;
> +        max = pg_atomic_read_u32(&segP->maxMsgNum);
>          while (nthistime-- > 0)
>          {
>              segP->buffer[max % MAXNUMMESSAGES] = *data++;
>              max++;
>          }
>
> -        /* Update current value of maxMsgNum using spinlock */
> -        SpinLockAcquire(&segP->msgnumLock);
> -        segP->maxMsgNum = max;
> -        SpinLockRelease(&segP->msgnumLock);
> +        /* Update current value of maxMsgNum using atomic with memory barrier */
> +        pg_write_barrier();
> +        pg_atomic_write_u32(&segP->maxMsgNum, max);
>
>          /*
>           * Now that the maxMsgNum change is globally visible, we give everyone
>         * a swift kick to make sure they read the newly added messages.
>         * Releasing SInvalWriteLock will enforce a full memory barrier, so
>         * these (unlocked) changes will be committed to memory before we exit
>         * the function.
>         */
>        for (i = 0; i < segP->numProcs; i++)
>        {
>            ProcState  *stateP = &segP->procState[segP->pgprocnos[i]];
>
>            stateP->hasMessages = true;
>        }

On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
could be reordered with the read of maxMsgNum. Which, afaict, would lead to
missing messages. That's not prevented by the pg_write_barrier() in
SIInsertDataEntries().  I think there may be other similar dangers.

This could be solved by adding full memory barriers in a few places. But:

I'd also like to know a bit more about the motivation here - I can easily
believe that you hit contention around the shared inval queue, but I find it
somewhat hard to believe that a spinlock that's acquired *once* per batch
("quantum"), covering a single read/write, is going to be the bottleneck,
rather than the much longer held LWLock, that protects iterating over all
procs.

Have you verified that this actually addresses the performance issue?

Greetings,

Andres Freund



Hi, Andres

21.03.2025 19:33, Andres Freund wrote:
> Hi,
> 
> On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
>> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
>> From: Yura Sokolov <y.sokolov@postgrespro.ru>
>> Date: Mon, 3 Feb 2025 11:58:33 +0300
>> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>>
>> msgnumLock spinlock could be highly contended.
>> Comment states it was used as memory barrier.
>> Lets use atomic ops with memory barriers directly instead.
>>
>> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
>> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
>> no full barrier semantic is required, and explicit read/write barriers
>> are cheaper at least on x86_64.
> 
> Is it actually true that full barriers aren't required?  I think we might
> actually rely on a stronger ordering.
> 
> 
>> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
>>       */
>>      stateP->hasMessages = false;
>>
>> -    /* Fetch current value of maxMsgNum using spinlock */
>> -    SpinLockAcquire(&segP->msgnumLock);
>> -    max = segP->maxMsgNum;
>> -    SpinLockRelease(&segP->msgnumLock);
>> +    /* Fetch current value of maxMsgNum using atomic with memory barrier */
>> +    max = pg_atomic_read_u32(&segP->maxMsgNum);
>> +    pg_read_barrier();
>>
>>      if (stateP->resetState)
>>      {
>>         /*
>>          * Force reset.  We can say we have dealt with any messages added
>>          * since the reset, as well; and that means we should clear the
>>          * signaled flag, too.
>>          */
>>         stateP->nextMsgNum = max;
>>         stateP->resetState = false;
>>         stateP->signaled = false;
>>         LWLockRelease(SInvalReadLock);
>>         return -1;
>>     }
> 
> vs
> 
>> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
>>          /*
>>           * Insert new message(s) into proper slot of circular buffer
>>           */
>> -        max = segP->maxMsgNum;
>> +        max = pg_atomic_read_u32(&segP->maxMsgNum);
>>          while (nthistime-- > 0)
>>          {
>>              segP->buffer[max % MAXNUMMESSAGES] = *data++;
>>              max++;
>>          }
>>
>> -        /* Update current value of maxMsgNum using spinlock */
>> -        SpinLockAcquire(&segP->msgnumLock);
>> -        segP->maxMsgNum = max;
>> -        SpinLockRelease(&segP->msgnumLock);
>> +        /* Update current value of maxMsgNum using atomic with memory barrier */
>> +        pg_write_barrier();
>> +        pg_atomic_write_u32(&segP->maxMsgNum, max);
>>
>>          /*
>>           * Now that the maxMsgNum change is globally visible, we give everyone
>>          * a swift kick to make sure they read the newly added messages.
>>          * Releasing SInvalWriteLock will enforce a full memory barrier, so
>>          * these (unlocked) changes will be committed to memory before we exit
>>          * the function.
>>          */
>>         for (i = 0; i < segP->numProcs; i++)
>>         {
>>             ProcState  *stateP = &segP->procState[segP->pgprocnos[i]];
>>
>>             stateP->hasMessages = true;
>>         }
> 
> On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
> could be reordered with the read of maxMsgNum. Which, afaict, would lead to
> missing messages. That's not prevented by the pg_write_barrier() in
> SIInsertDataEntries().  I think there may be other similar dangers.
> 
> This could be solved by adding full memory barriers in a few places.

Big thanks for review and suggestion!

I agree, pg_memory_barrier should be added before read of segP->maxMsgNum.
I think, change of stateP->hasMessages to atomic variable is better way,
but it will change sizeof ProcState.

I don't see the need to full barrier after read of maxMsgNum, since other
ProcState fields are protected by SInvalReadLock, aren't they? So I leave
read_barrier there.

I still avoid use of read_membarrier since it is actually write operation.
Although pg_memory_barrier is implemented as write operation as well at
x86_64, but on memory cell on process's stack, so it will not be contended.

And atomic_write_membarrier is used to write maxMsgNum just to simplify
code. If backport is considered, then write_barriers before and after could
be used instead.

Fixes version is attached.

> But:
> 
> I'd also like to know a bit more about the motivation here - I can easily
> believe that you hit contention around the shared inval queue, but I find it
> somewhat hard to believe that a spinlock that's acquired *once* per batch
> ("quantum"), covering a single read/write, is going to be the bottleneck,
> rather than the much longer held LWLock, that protects iterating over all
> procs.
> 
> Have you verified that this actually addresses the performance issue?

Problem on this spinlock were observed at least by two independent technical
supports. First, some friendly vendor company shared the idea to remove it.
We don't know exactly their situation. But I suppose it was quite similar
to situation out tech support investigated at our client some months later:

(Cite from tech support report:)
> Almost 20% of CPU time is spend at backtraces like:
    4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
    49c847 SIGetDataEntries
    49bf94 ReceiveSharedInvalidMessages
    4a14ba LockRelationOid
    1671f4 relation_open
    1de1cd table_open
    5e82aa RelationGetStatExtList
    402a01 get_relation_statistics (inlined)
    402a01 get_relation_info
    407a9e build_simple_rel
    3daa1d add_base_rels_to_query
    3daa1d add_base_rels_to_query
    3dd92b query_planner


Client has many NUMA-nodes in single machine, and software actively
generates invalidation messages (probably, due active usage of temporary
tables).

Since, backtrace is quite obvious and ends at s_lock, the patch have to help.

-- 
regards
Yura Sokolov aka funny-falcon
Attachment
Hi,

On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
> 21.03.2025 19:33, Andres Freund wrote:
> > I'd also like to know a bit more about the motivation here - I can easily
> > believe that you hit contention around the shared inval queue, but I find it
> > somewhat hard to believe that a spinlock that's acquired *once* per batch
> > ("quantum"), covering a single read/write, is going to be the bottleneck,
> > rather than the much longer held LWLock, that protects iterating over all
> > procs.
> > 
> > Have you verified that this actually addresses the performance issue?
> 
> Problem on this spinlock were observed at least by two independent technical
> supports. First, some friendly vendor company shared the idea to remove it.
> We don't know exactly their situation. But I suppose it was quite similar
> to situation out tech support investigated at our client some months later:
> 
> (Cite from tech support report:)
> > Almost 20% of CPU time is spend at backtraces like:
>     4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
>     49c847 SIGetDataEntries
>     49bf94 ReceiveSharedInvalidMessages
>     4a14ba LockRelationOid
>     1671f4 relation_open
>     1de1cd table_open
>     5e82aa RelationGetStatExtList
>     402a01 get_relation_statistics (inlined)
>     402a01 get_relation_info
>     407a9e build_simple_rel
>     3daa1d add_base_rels_to_query
>     3daa1d add_base_rels_to_query
>     3dd92b query_planner
> 
> 
> Client has many NUMA-nodes in single machine, and software actively
> generates invalidation messages (probably, due active usage of temporary
> tables).
> 
> Since, backtrace is quite obvious and ends at s_lock, the patch have to help.

I don't believe we have the whole story here. It just doesn't seem plausible
that, with the current code, the spinlock in SIGetDataEntries() would be the
bottleneck, rather than contention on the lwlock. The spinlock just covers a
*single read*.  Unless pgpro has modified the relevant code?

One possible explanation for why the spinlock shows up so badly is that it is
due to false sharing. Note that SiSeg->msgnumLock and the start of
SiSeg->buffer[] are on the same cache line.

How was this "Almost 20% of CPU time is spend at backtraces like" determined?

Greetings,

Andres Freund



Good day, Andres

24.03.2025 16:08, Andres Freund wrote:
> On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
>> 21.03.2025 19:33, Andres Freund wrote:
>>> I'd also like to know a bit more about the motivation here - I can easily
>>> believe that you hit contention around the shared inval queue, but I
find it
>>> somewhat hard to believe that a spinlock that's acquired *once* per batch
>>> ("quantum"), covering a single read/write, is going to be the bottleneck,
>>> rather than the much longer held LWLock, that protects iterating over all
>>> procs.
>>>
>>> Have you verified that this actually addresses the performance issue?
>>
>> Problem on this spinlock were observed at least by two independent technical
>> supports. First, some friendly vendor company shared the idea to remove it.
>> We don't know exactly their situation. But I suppose it was quite similar
>> to situation out tech support investigated at our client some months later:
>>
>> (Cite from tech support report:)
>>> Almost 20% of CPU time is spend at backtraces like:
>>     4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
>>     49c847 SIGetDataEntries
>>     49bf94 ReceiveSharedInvalidMessages
>>     4a14ba LockRelationOid
>>     1671f4 relation_open
>>     1de1cd table_open
>>     5e82aa RelationGetStatExtList
>>     402a01 get_relation_statistics (inlined)
>>     402a01 get_relation_info
>>     407a9e build_simple_rel
>>     3daa1d add_base_rels_to_query
>>     3daa1d add_base_rels_to_query
>>     3dd92b query_planner
>>
>>
>> Client has many NUMA-nodes in single machine, and software actively
>> generates invalidation messages (probably, due active usage of temporary
>> tables).
>>
>> Since, backtrace is quite obvious and ends at s_lock, the patch have to
help.
>
> I don't believe we have the whole story here. It just doesn't seem plausible
> that, with the current code, the spinlock in SIGetDataEntries() would be the
> bottleneck, rather than contention on the lwlock. The spinlock just covers a
> *single read*.  Unless pgpro has modified the relevant code?
>
> One possible explanation for why the spinlock shows up so badly is that it is
> due to false sharing. Note that SiSeg->msgnumLock and the start of
> SiSeg->buffer[] are on the same cache line.
>
> How was this "Almost 20% of CPU time is spend at backtraces like" determined?

Excuse me I didn't attached flamegraph collected by our tech support from
client's server during peak of the problem. So I attach it now.

If you open it in browser and search for "SIGetDataEntries", you'll see it
consumes 18.4%. It is not single large bar. Instead there are dozens of
calls to SIGetDataEntries, and every one spend almost all its time in
s_lock. If you search for s_lock, it consumes 16.9%, and almost every call
to s_lock is from SIGetDataEntries.

Looks like, we call to ReceiveSharedInvalidMessages
(AcceptInvalidationMessages, actually) too frequently during planing. And
if there are large stream of invalidation messages, SIGetDataEntries have
some work very frequently. Therefore many backends, which plans their
queries at the moment, start to fight for msgNumLock.

If ReceiveSharedInvalidMessages (and SIGetDataEntries by it) were called
rarely, then you conclusion were right: taking spinlock around just read of
one variable before processing large batch of messages looks to not be
source of problem. But since it is called very frequently, and stream of
messages is high, "there is always few new messages".

As I've said, it is most probably due to use of famous 1C software, which
uses a lot of temporary tables. So it generates high amount of invalidation
messages. We've patched pgpro postgres to NOT SEND most of invalidation
messages generated by temporary tables, but it is difficult to not send all
of such.


-- 
regards
Yura Sokolov aka funny-falcon
Attachment
25.03.2025 13:52, Yura Sokolov пишет:
> Good day, Andres
> 
> 24.03.2025 16:08, Andres Freund wrote:
>> On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
>>> 21.03.2025 19:33, Andres Freund wrote:
>>>> I'd also like to know a bit more about the motivation here - I can easily
>>>> believe that you hit contention around the shared inval queue, but I
> find it
>>>> somewhat hard to believe that a spinlock that's acquired *once* per batch
>>>> ("quantum"), covering a single read/write, is going to be the bottleneck,
>>>> rather than the much longer held LWLock, that protects iterating over all
>>>> procs.
>>>>
>>>> Have you verified that this actually addresses the performance issue?
>>>
>>> Problem on this spinlock were observed at least by two independent technical
>>> supports. First, some friendly vendor company shared the idea to remove it.
>>> We don't know exactly their situation. But I suppose it was quite similar
>>> to situation out tech support investigated at our client some months later:
>>>
>>> (Cite from tech support report:)
>>>> Almost 20% of CPU time is spend at backtraces like:
>>>     4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
>>>     49c847 SIGetDataEntries
>>>     49bf94 ReceiveSharedInvalidMessages
>>>     4a14ba LockRelationOid
>>>     1671f4 relation_open
>>>     1de1cd table_open
>>>     5e82aa RelationGetStatExtList
>>>     402a01 get_relation_statistics (inlined)
>>>     402a01 get_relation_info
>>>     407a9e build_simple_rel
>>>     3daa1d add_base_rels_to_query
>>>     3daa1d add_base_rels_to_query
>>>     3dd92b query_planner
>>>
>>>
>>> Client has many NUMA-nodes in single machine, and software actively
>>> generates invalidation messages (probably, due active usage of temporary
>>> tables).
>>>
>>> Since, backtrace is quite obvious and ends at s_lock, the patch have to
> help.
>>
>> I don't believe we have the whole story here. It just doesn't seem plausible
>> that, with the current code, the spinlock in SIGetDataEntries() would be the
>> bottleneck, rather than contention on the lwlock. The spinlock just covers a
>> *single read*.  Unless pgpro has modified the relevant code?
>>
>> One possible explanation for why the spinlock shows up so badly is that it is
>> due to false sharing. Note that SiSeg->msgnumLock and the start of
>> SiSeg->buffer[] are on the same cache line.
>>
>> How was this "Almost 20% of CPU time is spend at backtraces like" determined?
> 
> Excuse me I didn't attached flamegraph collected by our tech support from
> client's server during peak of the problem. So I attach it now.
> 
> If you open it in browser and search for "SIGetDataEntries", you'll see it
> consumes 18.4%. It is not single large bar. Instead there are dozens of
> calls to SIGetDataEntries, and every one spend almost all its time in
> s_lock. If you search for s_lock, it consumes 16.9%, and almost every call
> to s_lock is from SIGetDataEntries.
> 
> Looks like, we call to ReceiveSharedInvalidMessages
> (AcceptInvalidationMessages, actually) too frequently during planing. And
> if there are large stream of invalidation messages, SIGetDataEntries have
> some work very frequently. Therefore many backends, which plans their
> queries at the moment, start to fight for msgNumLock.
>
> If ReceiveSharedInvalidMessages (and SIGetDataEntries by it) were called
> rarely, then you conclusion were right: taking spinlock around just read of
> one variable before processing large batch of messages looks to not be
> source of problem. But since it is called very frequently, and stream of
> messages is high, "there is always few new messages".

And one more reason for contention: ReceiveSharedInvalidMessages calls
SIGetDataEntries for batches sized MAXINVALMSGS (=32). It calls it again
and again until queue is empty. There could be a lot of messages pushed to
a queue.

In fact, we have to increase MAXNUMMESSAGES up to 16k to not fall into
InvalidateSystemCaches (which costs much more), so theoretically
SIGetDataEntries could be called upto 512 times per
ReceiveSharedInvalidMessages.

And it is not so theoretically, since COMMIT of transaction may send
thousands of invalidation messages at once.

Probably, increasing MAXINVALMSGS is a good idea as well.

-- 
regards
Yura Sokolov aka funny-falcon