Re: spinlocks on powerpc - Mailing list pgsql-hackers

From Robert Haas
Subject Re: spinlocks on powerpc
Date
Msg-id CA+TgmoaO-L9rABu9y43=4zrNTtbyVCzmkUcZBc1hiP-Esc08NQ@mail.gmail.com
Whole thread Raw
In response to Re: spinlocks on powerpc  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: spinlocks on powerpc
Re: spinlocks on powerpc
List pgsql-hackers
On Mon, Jan 2, 2012 at 12:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (It's depressing that these numbers have hardly moved since August ---
> at least on this test, the work that Robert's done has not made any
> difference.)

Most of the scalability work that's been committed since August has
really been about ProcArrayLock, which does have an impact on read
scalability, but is a much more serious problem on write workloads.
On read-only workloads, you get spinlock contention, because everyone
who wants a snapshot has to take the LWLock mutex to increment the
shared lock count and again (just a moment later) to decrement it.
But on write workloads, transactions must take need ProcArrayLock in
exclusive mode to commit, so you have the additional problem of
snapshot-taking forcing committers to wait and (probably to a lesser
degree) visca versa.  Most of the benefit we've gotten so far has come
from shortening the time for which ProcArrayLock is held in shared
mode while taking snapshots, which is going to primarily benefit write
workloads.  I'm a bit surprised that you haven't seen any benefit at
all on read workloads - I would have expected a small but measurable
gain - but I'm not totally shocked if there isn't one.

The architecture may play into it, too.  Most of the testing that I
have done has been on AMD64 or Itanium, and those have significantly
different performance characteristics.  The Itanium machine I've used
for testing is faster in absolute terms than the AMD64 box, but it
seems to also suffer more severely in the presence of spinlock
contention.  This means that, on current sources, ProcArrayLock is a
bigger problem on Itanium than it is on AMD64.  I don't have a PPC64
box to play with ATM, so I can't speculate on what the situation is
there.  It's occurred to me to wonder whether the Itanium vs. AMD64
effects are specific to those architectures or general characteristics
of strong memory ordering architectures vs. weak memory architectures,
but I don't really have enough data to know.  I'm concerned by this
whole issue of spinlocks, since the previous round of testing on
Itanium pretty much proves that getting the spinlock implementation
wrong is a death sentence.  If PPC64 is going to require specific
tweaks for every subarchitecture, that's going to be a colossal
nuisance, but probably a necessary one if we don't want to suck there.For Itanium, I was able to find some fairly
official-looking
documentation that said "this is how you should do it".  It would be
nice to find something similar for PPC64, instead of testing every
machine and reinventing the wheel ourselves.  I wonder whether the gcc
folks have done any meaningful thinking about this in their builtin
atomics; if so, that might be an argument for using that as more than
just a fallback.  If not, it's a pretty good argument against it, at
least IMHO.

All that having been said...

> That last is clearly a winner for reasonable numbers of processes,
> so I committed it that way, but I'm a little worried by the fact that it
> looks like it might be a marginal loss when the system is overloaded.
> I would like to see results from your machine.

I'm unconvinced by these numbers.  There is a measurable change but it
is pretty small.  The Itanium changes resulted in an enormous gain at
higher concurrency levels.  I've seen several cases where improving
one part of the code actually makes performance worse, because of
things like: once lock A is less contented, lock B becomes more
contended, and for some reason the effect on lock B is greater than
the effect on lock A.  It was precisely this sort of effect that lead
to the sinval optimizations commited as
b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4; the lock manager
optimizations improved things with moderate numbers of processes but
were much worse at high numbers of processes precisely because the
lock manager (which is partitioned) wasn't there to throttle the
beating on SInvalReadLock *which isn't).  I'd be inclined to say we
should optimize for architectures where either of both of these
techniques make the sort of big splash Manabu Ori is seeing on his
machine, and assume that the much smaller changes you're seeing on
your machines are as likely to be artifacts as real effects.  When and
if enough evidence emerges to say otherwise, we can decide whether to
rethink.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: controlling the location of server-side SSL files
Next
From: Simon Riggs
Date:
Subject: Re: ALTER TABLE lock strength reduction patch is unsafe