Re: Spinlock performance improvement proposal - Mailing list pgsql-hackers
From | Neil Padgett |
---|---|
Subject | Re: Spinlock performance improvement proposal |
Date | |
Msg-id | 3BB22278.5F5F37DF@redhat.com Whole thread Raw |
In response to | Spinlock performance improvement proposal (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Spinlock performance improvement proposal
Re: Spinlock performance improvement proposal |
List | pgsql-hackers |
Tom Lane wrote: > > At the just-past OSDN database conference, Bruce and I were annoyed by > some benchmark results showing that Postgres performed poorly on an > 8-way SMP machine. Based on past discussion, it seems likely that the > culprit is the known inefficiency in our spinlock implementation. > After chewing on it for awhile, we came up with an idea for a solution. > > The following proposal should improve performance substantially when > there is contention for a lock, but it creates no portability risks > because it uses the same system facilities (TAS and SysV semaphores) > that we have always relied on. Also, I think it'd be fairly easy to > implement --- I could probably get it done in a day. > > Comments anyone? We have been doing some scalability testing just recently here at Red Hat. The machine I was using was a 4-way 550 MHz Xeon SMP machine, I also ran the machine in uniprocessor mode to make some comparisons. All runs were made on Red Hat Linux running 2.4.x series kernels. I've examined a number of potentially interesting cases -- I'm still analyzing the results, but some of the initial results might be interesting: - We have tried benchmarking the following: TAS spinlocks (existing implementation), SysV semaphores (existing implementation), Pthread Mutexes. Pgbench runs were conducted for 1 to 512 simultaneous backends. For these three cases we found: - TAS spinlocks fared the best of all three lock types, however above 100 clients the Pthread mutexes were lock step in performance. I expect this is due to the cost of any system calls being negligible relative to lock wait time. - SysV semaphore implementation faired terribly as expected. However, it is worse, relative to the TAS spinlocks on SMP than on uniprocessor. - Since the above seemed to indicate that the lock implementation may not be the problem (Pthread mutexes are supposed to be implemented to be less bang-bang than the Postgres TAS spinlocks, IIRC), I decided to profile Postgres. After much trouble, I got results for it using oprofile, a kernel profiler for Linux. Unfortunately, I can only profile for uniprocessor right now using oprofile, as it doesn't support SMP boxes yet. (soon, I hope.) Initial results (top five -- if you would like a complete profile, let me know): Each sample counts as 1 samples. % cumulative self self total time samples samples calls T1/call T1/call name 26.57 42255.02 42255.02 FindLockCycleRecurse 5.55 51081.02 8826.00 s_lock_sleep 5.07 59145.03 8064.00 heapgettup 4.48 66274.03 7129.00 hash_search 4.48 73397.03 7123.00 s_lock 2.85 77926.03 4529.00 HeapTupleSatisfiesSnapshot 2.07 81217.04 3291.00 SHMQueueNext 1.85 84154.04 2937.00 AllocSetAlloc 1.84 87085.04 2931.00 fmgr_isbuiltin 1.64 89696.04 2611.00 set_ps_display 1.51 92101.04 2405.00 FunctionCall2 1.47 94442.04 2341.00 XLogInsert 1.39 96649.04 2207.00 _bt_compare1.22 98597.04 1948.00 SpinAcquire 1.22 100544.04 1947.00 LockBuffer 1.21 102469.04 1925.00 tag_hash 1.01 104078.05 1609.00 LockAcquire . . . (The samples are proportional to execution time.) This would seem to point to the deadlock detector. (Which some have fingered as a possible culprit before, IIRC.) However, this seems to be a red herring. Removing the deadlock detector had no effect. In fact, benchmarking showed removing it yielded no improvement in transaction processing rate on uniprocessor or SMP systems. Instead, it seems that the deadlock detector simply amounts to "something to do" for the blocked backend while it waits for lock acquisition. Profiling bears this out: Flat profile: Each sample counts as 1 samples. % cumulative self self total time samples samples calls T1/call T1/call name 12.38 14112.01 14112.01 s_lock_sleep10.18 25710.01 11598.01 s_lock 6.47 33079.01 7369.00 hash_search 5.88 39784.02 6705.00 heapgettup 5.32 45843.02 6059.00 HeapTupleSatisfiesSnapshot 2.62 48830.02 2987.00 AllocSetAlloc 2.48 51654.02 2824.00 fmgr_isbuiltin 1.89 53813.02 2159.00 XLogInsert 1.86 55938.02 2125.00 _bt_compare 1.72 57893.03 1955.00 SpinAcquire 1.61 59733.03 1840.00 LockBuffer 1.60 61560.03 1827.00 FunctionCall21.56 63339.03 1779.00 tag_hash 1.46 65007.03 1668.00 set_ps_display 1.20 66372.03 1365.00 SearchCatCache 1.14 67666.03 1294.00 LockAcquire . . . Our current suspicion isn't that the lock implementation is the only problem (though there is certainly room for improvement), or perhaps isn't even the main problem. For example, there has been some suggestion that perhaps some component of the database is causing large lock contention. My opinion is that rather than guessing and taking stabs in the dark, we need to take a more reasoned approach to these things. IMHO, the next step should be to apply instrumentation (likely via some neat macros) to all lock acquires / releases. Then, it will be possible to determine what components are the greatest consumers of locks, and to determine whether it is a component problem or a systemic problem. (i.e. some component vs. simply just the lock implementation.) Neil -- Neil Padgett Red Hat Canada Ltd. E-Mail: npadgett@redhat.com 2323 Yonge Street, Suite #300, Toronto, ON M4P 2C9
pgsql-hackers by date: