Re: spinlocks on powerpc - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: spinlocks on powerpc |
Date | |
Msg-id | CA+TgmoaO-L9rABu9y43=4zrNTtbyVCzmkUcZBc1hiP-Esc08NQ@mail.gmail.com Whole thread Raw |
In response to | Re: spinlocks on powerpc (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: spinlocks on powerpc
Re: spinlocks on powerpc |
List | pgsql-hackers |
On Mon, Jan 2, 2012 at 12:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > (It's depressing that these numbers have hardly moved since August --- > at least on this test, the work that Robert's done has not made any > difference.) Most of the scalability work that's been committed since August has really been about ProcArrayLock, which does have an impact on read scalability, but is a much more serious problem on write workloads. On read-only workloads, you get spinlock contention, because everyone who wants a snapshot has to take the LWLock mutex to increment the shared lock count and again (just a moment later) to decrement it. But on write workloads, transactions must take need ProcArrayLock in exclusive mode to commit, so you have the additional problem of snapshot-taking forcing committers to wait and (probably to a lesser degree) visca versa. Most of the benefit we've gotten so far has come from shortening the time for which ProcArrayLock is held in shared mode while taking snapshots, which is going to primarily benefit write workloads. I'm a bit surprised that you haven't seen any benefit at all on read workloads - I would have expected a small but measurable gain - but I'm not totally shocked if there isn't one. The architecture may play into it, too. Most of the testing that I have done has been on AMD64 or Itanium, and those have significantly different performance characteristics. The Itanium machine I've used for testing is faster in absolute terms than the AMD64 box, but it seems to also suffer more severely in the presence of spinlock contention. This means that, on current sources, ProcArrayLock is a bigger problem on Itanium than it is on AMD64. I don't have a PPC64 box to play with ATM, so I can't speculate on what the situation is there. It's occurred to me to wonder whether the Itanium vs. AMD64 effects are specific to those architectures or general characteristics of strong memory ordering architectures vs. weak memory architectures, but I don't really have enough data to know. I'm concerned by this whole issue of spinlocks, since the previous round of testing on Itanium pretty much proves that getting the spinlock implementation wrong is a death sentence. If PPC64 is going to require specific tweaks for every subarchitecture, that's going to be a colossal nuisance, but probably a necessary one if we don't want to suck there.For Itanium, I was able to find some fairly official-looking documentation that said "this is how you should do it". It would be nice to find something similar for PPC64, instead of testing every machine and reinventing the wheel ourselves. I wonder whether the gcc folks have done any meaningful thinking about this in their builtin atomics; if so, that might be an argument for using that as more than just a fallback. If not, it's a pretty good argument against it, at least IMHO. All that having been said... > That last is clearly a winner for reasonable numbers of processes, > so I committed it that way, but I'm a little worried by the fact that it > looks like it might be a marginal loss when the system is overloaded. > I would like to see results from your machine. I'm unconvinced by these numbers. There is a measurable change but it is pretty small. The Itanium changes resulted in an enormous gain at higher concurrency levels. I've seen several cases where improving one part of the code actually makes performance worse, because of things like: once lock A is less contented, lock B becomes more contended, and for some reason the effect on lock B is greater than the effect on lock A. It was precisely this sort of effect that lead to the sinval optimizations commited as b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4; the lock manager optimizations improved things with moderate numbers of processes but were much worse at high numbers of processes precisely because the lock manager (which is partitioned) wasn't there to throttle the beating on SInvalReadLock *which isn't). I'd be inclined to say we should optimize for architectures where either of both of these techniques make the sort of big splash Manabu Ori is seeing on his machine, and assume that the much smaller changes you're seeing on your machines are as likely to be artifacts as real effects. When and if enough evidence emerges to say otherwise, we can decide whether to rethink. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: