Re: futex results with dbt-3 - Mailing list pgsql-performance

From Manfred Spraul
Subject Re: futex results with dbt-3
Date
Msg-id 4176A2C1.3070205@colorfullife.com
Whole thread Raw
In response to Re: futex results with dbt-3  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: futex results with dbt-3
List pgsql-performance
Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>The bigger problem here is that the SMP locking bottlenecks we are
>>>currently seeing are *hardware* issues (AFAICT anyway).  The only way
>>>that futexes can offer a performance win is if they have a smarter way
>>>of executing the basic atomic-test-and-set sequence than we do;
>>>
>>>
>>>
>>lwlocks operations are not a basic atomic-test-and-set sequence. They
>>are spinlock, several nonatomic operations, spin_unlock.
>>
>>
>
>Right, and it is the spinlock that is the problem.  See discussions a
>few months back: at least on Intel SMP machines, most of the problem
>seems to have to do with trading the spinlock's cache line back and
>forth between CPUs.
>
I'd disagree: cache line bouncing is one problem. If this happens then
there is only one solution: The number of changes to that cacheline must
be reduced. The tools that are used in the linux kernel are:
- hashing. An emergency approach if there is no other solution. I think
RedHat used it for the buffer cache RH AS: Instead of one buffer cache,
there were lots of smaller buffer caches with individual locks. The
cache was chosen based on the file position (probably mixed with some
pointers to avoid overloading cache 0).
- For read-heavy loads: sequence locks. A reader reads a counter value
and then accesses the data structure. At the end it checks if the
counter was modified. If it's still the same value then it can continue,
otherwise it must retry. Writers acquire a normal spinlock and then
modify the counter value. RCU is the second option, but there are
patents - please be careful before using that tool.
- complete rewrites that avoid the global lock. I think the global
buffer cache is now gone, everything is handled per-file. I think there
is a global list for buffer replacement, but the at the top of the
buffer replacement strategy is a simple clock algorithm. That means that
simple lookups/accesses just set a (local) referenced bit and don't have
to acquire a global lock. I know that this is the total opposite of ARC,
but perhaps it's the only scalable solution. ARC could be used as the
second level strategy.

But: According to the descriptions the problem is a context switch
storm. I don't see that cache line bouncing can cause a context switch
storm. What causes the context switch storm? If it's the pg_usleep in
s_lock, then my patch should help a lot: with pthread_rwlock locks, this
line doesn't exist anymore.

--
    Manfred

pgsql-performance by date:

Previous
From: Manfred Spraul
Date:
Subject: Re: futex results with dbt-3
Next
From: Manfred Spraul
Date:
Subject: Re: futex results with dbt-3