why roll-your-own s_lock? / improving scalability - Mailing list pgsql-hackers

From Nils Goroll
Subject why roll-your-own s_lock? / improving scalability
Date
Msg-id 4FE9EB27.9020502@schokola.de
Whole thread Raw
Responses Re: why roll-your-own s_lock? / improving scalability
Re: why roll-your-own s_lock? / improving scalability
List pgsql-hackers
Hi,

I am currently trying to understand what looks like really bad scalability of
9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only
marginal amounts of additional load seem to push it to 70% and the application
becomes highly unresponsive.

My current understanding basically matches the issues being addressed by various
9.2 improvements, well summarized in
http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf

An additional aspect is that, in order to address the latent risk of data loss &
corruption with WBCs and async replication, we have deliberately moved the db
from a similar system with WB cached storage to ssd based storage without a WBC,
which, by design, has (in the best WBC case) approx. 100x higher latencies, but
much higher sustained throughput.


On the new system, even with 30% user "acceptable" load, oprofile makes apparent
significant lock contention:

opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres


Profiling through timer interrupt
samples  %        image name               symbol name
30240    27.9720  postgres                 s_lock
5069      4.6888  postgres                 GetSnapshotData
3743      3.4623  postgres                 AllocSetAlloc
3167      2.9295  libc-2.12.so             strcoll_l
2662      2.4624  postgres                 SearchCatCache
2495      2.3079  postgres                 hash_search_with_hash_value
2143      1.9823  postgres                 nocachegetattr
1860      1.7205  postgres                 LWLockAcquire
1642      1.5189  postgres                 base_yyparse
1604      1.4837  libc-2.12.so             __strcmp_sse42
1543      1.4273  libc-2.12.so             __strlen_sse42
1156      1.0693  libc-2.12.so             memcpy

Unfortunately I don't have profiling data for the high-load / contention
condition yet, but I fear the picture will be worse and pointing in the same
direction.

<pure speculation>
In particular, the _impression_ is that lock contention could also be related to
I/O latencies making me fear that cases could exist where spin locks are being
helt while blocking on IO.
</pure speculation>


Looking at the code, it appears to me that the roll-your-own s_lock code cannot
handle a couple of cases, for instance it will also spin when the lock holder is
not running at all or blocking on IO (which could even be implicit, e.g. for a
page flush). These issues have long been addressed by adaptive mutexes and futexes.

Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when
having spun for long (not not blocked), spin even longer in future), which
appears to me to have the potential of becoming highly counter-productive.


Now that the scene is set, here's the simple question: Why all this? Why not
simply use posix mutexes which, on modern platforms, will map to efficient
implementations like adaptive mutexes or futexes?

Thanks, Nils


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [PATCH] lock_timeout and common SIGALRM framework
Next
From: Robert Haas
Date:
Subject: Re: PATCH: Improve DROP FUNCTION hint