Re: Spinlock performance improvement proposal - Mailing list pgsql-hackers

From Neil Padgett
Subject Re: Spinlock performance improvement proposal
Date
Msg-id 3BB22278.5F5F37DF@redhat.com
Whole thread Raw
In response to Spinlock performance improvement proposal  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Spinlock performance improvement proposal
Re: Spinlock performance improvement proposal
List pgsql-hackers
Tom Lane wrote:
> 
> At the just-past OSDN database conference, Bruce and I were annoyed by
> some benchmark results showing that Postgres performed poorly on an
> 8-way SMP machine.  Based on past discussion, it seems likely that the
> culprit is the known inefficiency in our spinlock implementation.
> After chewing on it for awhile, we came up with an idea for a solution.
> 
> The following proposal should improve performance substantially when
> there is contention for a lock, but it creates no portability risks
> because it uses the same system facilities (TAS and SysV semaphores)
> that we have always relied on.  Also, I think it'd be fairly easy to
> implement --- I could probably get it done in a day.
> 
> Comments anyone?


We have been doing some scalability testing just recently here at Red
Hat. The machine I was using was a 4-way 550 MHz Xeon SMP machine, I
also ran the machine in uniprocessor mode to make some comparisons. All
runs were made on Red Hat Linux running 2.4.x series kernels. I've
examined a number of potentially interesting cases -- I'm still
analyzing the results, but some of the initial results might be
interesting:

- We have tried benchmarking the following: TAS spinlocks (existing
implementation), SysV semaphores (existing implementation), Pthread
Mutexes. Pgbench runs were conducted for 1 to 512 simultaneous backends.
 For these three cases we found: - TAS spinlocks fared the best of all three lock types, however above
100 clients the Pthread mutexes were lock step in performance. I expect
this is due to the cost of any system calls being      negligible
relative to lock wait time. - SysV semaphore implementation faired terribly as expected. However,
it is worse, relative to the TAS spinlocks on SMP than on uniprocessor.

- Since the above seemed to indicate that the lock implementation may
not be the problem (Pthread mutexes are supposed to be implemented to be
less bang-bang than the Postgres TAS spinlocks, IIRC), I decided to
profile Postgres. After much trouble, I got results for it using
oprofile, a kernel profiler for Linux. Unfortunately, I can only profile
for uniprocessor right now using oprofile, as it doesn't support SMP
boxes yet. (soon, I hope.)

Initial results (top five -- if you would like a complete profile, let
me know):
Each sample counts as 1 samples. %   cumulative   self              self     total           time   samples   samples
calls  T1/call  T1/call  name    26.57  42255.02 42255.02                            
 
FindLockCycleRecurse 5.55  51081.02  8826.00                             s_lock_sleep 5.07  59145.03  8064.00
                 heapgettup 4.48  66274.03  7129.00                             hash_search 4.48  73397.03  7123.00
                       s_lock 2.85  77926.03  4529.00                            
 
HeapTupleSatisfiesSnapshot 2.07  81217.04  3291.00                             SHMQueueNext 1.85  84154.04  2937.00
                       AllocSetAlloc 1.84  87085.04  2931.00                             fmgr_isbuiltin 1.64  89696.04
2611.00                            set_ps_display 1.51  92101.04  2405.00                             FunctionCall2
1.47 94442.04  2341.00                             XLogInsert 1.39  96649.04  2207.00
_bt_compare1.22  98597.04  1948.00                             SpinAcquire 1.22 100544.04  1947.00
      LockBuffer 1.21 102469.04  1925.00                             tag_hash 1.01 104078.05  1609.00
         LockAcquire
 
.
.
.

(The samples are proportional to execution time.)

This would seem to point to the deadlock detector. (Which some have
fingered as a possible culprit before, IIRC.)

However, this seems to be a red herring. Removing the deadlock detector
had no effect. In fact, benchmarking showed removing it yielded no
improvement in transaction processing rate on uniprocessor or SMP
systems. Instead, it seems that the deadlock detector simply amounts to
"something to do" for the blocked backend while it waits for lock
acquisition. 

Profiling bears this out:

Flat profile:

Each sample counts as 1 samples. %   cumulative   self              self     total           time   samples   samples
calls  T1/call  T1/call  name    12.38  14112.01 14112.01                             s_lock_sleep10.18  25710.01
11598.01                            s_lock 6.47  33079.01  7369.00                             hash_search 5.88
39784.02 6705.00                             heapgettup 5.32  45843.02  6059.00                            
 
HeapTupleSatisfiesSnapshot  2.62  48830.02  2987.00                             AllocSetAlloc 2.48  51654.02  2824.00
                         fmgr_isbuiltin 1.89  53813.02  2159.00                             XLogInsert 1.86  55938.02
2125.00                            _bt_compare 1.72  57893.03  1955.00                             SpinAcquire 1.61
59733.03 1840.00                             LockBuffer 1.60  61560.03  1827.00
FunctionCall21.56  63339.03  1779.00                             tag_hash 1.46  65007.03  1668.00
     set_ps_display 1.20  66372.03  1365.00                             SearchCatCache 1.14  67666.03  1294.00
                  LockAcquire
 
. 
.
.

Our current suspicion isn't that the lock implementation is the only
problem (though there is certainly room for improvement), or perhaps
isn't even the main problem. For example, there has been some suggestion
that perhaps some component of the database is causing large lock
contention. My opinion is that rather than guessing and taking stabs in
the dark, we need to take a more reasoned approach to these things.
IMHO, the next step should be to apply instrumentation (likely via some
neat macros) to all lock acquires / releases. Then, it will be possible
to determine what components are the greatest consumers of locks, and to
determine whether it is a component problem or a systemic problem. (i.e.
some component vs. simply just the lock implementation.)

Neil

-- 
Neil Padgett
Red Hat Canada Ltd.                       E-Mail:  npadgett@redhat.com
2323 Yonge Street, Suite #300, 
Toronto, ON  M4P 2C9


pgsql-hackers by date:

Previous
From: "D. Hageman"
Date:
Subject: Re: Spinlock performance improvement proposal
Next
From: Tom Lane
Date:
Subject: Re: [SQL] CHECK problem really OK now...