Thread: Minor optimisation of XLogInsert()

Minor optimisation of XLogInsert()

From
Simon Riggs
Date:
Patch adds cacheline padding around parts of XLogCtl.

Idea from way back, but never got performance tested in a situation
where WALInsertLock was the bottleneck.

So this is speculative, in need of testing to investigate value.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Minor optimisation of XLogInsert()

From
Robert Haas
Date:
On Tue, Nov 15, 2011 at 6:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Patch adds cacheline padding around parts of XLogCtl.
>
> Idea from way back, but never got performance tested in a situation
> where WALInsertLock was the bottleneck.
>
> So this is speculative, in need of testing to investigate value.

I kicked off a round of benchmarking around this.  I'll post results
in a few hours when the run finishes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Minor optimisation of XLogInsert()

From
Robert Haas
Date:
On Wed, Nov 16, 2011 at 9:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 15, 2011 at 6:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Patch adds cacheline padding around parts of XLogCtl.
>>
>> Idea from way back, but never got performance tested in a situation
>> where WALInsertLock was the bottleneck.
>>
>> So this is speculative, in need of testing to investigate value.
>
> I kicked off a round of benchmarking around this.  I'll post results
> in a few hours when the run finishes.

I thought that the best way to see any possible benefit of this patch
would be to use permanent tables (since unlogged tables won't
generated enough XLogInsert traffic to throttle the system) and lots
of clients.  So I ran tests at 32 clients and at 80 clients.  There
appeared to be a very small speedup.

resultswp.master.32:tps = 10275.238580 (including connections establishing)
resultswp.master.32:tps = 10231.634195 (including connections establishing)
resultswp.master.32:tps = 10220.907852 (including connections establishing)
resultswp.master.32:tps = 10115.534399 (including connections establishing)
resultswp.master.32:tps = 10048.268430 (including connections establishing)
resultswp.xloginsert-padding.32:tps = 10310.046371 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10269.839056 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10259.268030 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10242.861923 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10167.947496 (including
connections establishing)

Taking the median of those five results, the patch seems to have sped
things up by 0.3%.  At 80 clients:

resultswp.master.80:tps = 7540.388586 (including connections establishing)
resultswp.master.80:tps = 7502.803369 (including connections establishing)
resultswp.master.80:tps = 7469.338925 (including connections establishing)
resultswp.master.80:tps = 7505.377256 (including connections establishing)
resultswp.master.80:tps = 7509.402704 (including connections establishing)
resultswp.xloginsert-padding.80:tps = 7541.305973 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7568.942936 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7533.128618 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7558.258839 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7562.585635 (including
connections establishing)

Again taking the median of the five results, that's about an 0.7% speedup.

I also tried this over top of my flexlock patch.  I figured that the
benefit ought to be greater there, because with ProcArrayLock
contention reduced, WALInsertLock contention should be the whole
banana.  But:

resultswp.flexlock.32:tps = 11628.279679 (including connections establishing)
resultswp.flexlock.32:tps = 11556.254524 (including connections establishing)
resultswp.flexlock.32:tps = 11606.840931 (including connections establishing)
resultswp.flexlock-xloginsert-padding.32:tps = 11357.904459 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.32:tps = 11226.538104 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.32:tps = 11187.932415 (including
connections establishing)

That's about a 3.3% slowdown.

resultswp.flexlock.80:tps = 11560.508772 (including connections establishing)
resultswp.flexlock.80:tps = 11673.255674 (including connections establishing)
resultswp.flexlock.80:tps = 11597.229252 (including connections establishing)
resultswp.flexlock-xloginsert-padding.80:tps = 11092.549726 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.80:tps = 11179.927207 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.80:tps = 11112.771332 (including
connections establishing)

That's even a little worse.

One possible explanation is that I don't believe that what you've done
here is actually sufficient to guarantee alignment on cache-line
boundary - ShmemAlloc() only guarantees MAXALIGN alignment unless the
allocation is at least BLCKSZ bytes.  So depending on exactly how much
shared memory gets allocated before XLogCtlData gets allocated, it's
possible that the change could actually end up splitting all of the
things you're trying to put on their own cache line across two cache
lines.  Might be worth fixing that and running this through its paces
again.

(I wonder if we shouldn't just align every shared memory allocation to
64 or 128 bytes.  It wouldn't waste much memory and it would make us
much more resistant to performance changes caused by minor
modifications to the shared memory layout.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Minor optimisation of XLogInsert()

From
Simon Riggs
Date:
On Wed, Nov 16, 2011 at 8:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Taking the median of those five results, the patch seems to have sped
> things up by 0.3%.  At 80 clients:

Thanks for doing that. Even if we fix it as you suggest it seems to
indicate that the WALInsertLock contention is real rather than false
contention. Which lends some clues as to how to proceed.

> One possible explanation is that I don't believe that what you've done
> here is actually sufficient to guarantee alignment on cache-line
> boundary - ShmemAlloc() only guarantees MAXALIGN alignment unless the
> allocation is at least BLCKSZ bytes.  So depending on exactly how much
> shared memory gets allocated before XLogCtlData gets allocated, it's
> possible that the change could actually end up splitting all of the
> things you're trying to put on their own cache line across two cache
> lines.  Might be worth fixing that and running this through its paces
> again.
>
> (I wonder if we shouldn't just align every shared memory allocation to
> 64 or 128 bytes.  It wouldn't waste much memory and it would make us
> much more resistant to performance changes caused by minor
> modifications to the shared memory layout.)

I'll look at that, thanks for the suggestion.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Minor optimisation of XLogInsert()

From
Robert Haas
Date:
On Wed, Nov 16, 2011 at 5:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Nov 16, 2011 at 8:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Taking the median of those five results, the patch seems to have sped
>> things up by 0.3%.  At 80 clients:
>
> Thanks for doing that. Even if we fix it as you suggest it seems to
> indicate that the WALInsertLock contention is real rather than false
> contention. Which lends some clues as to how to proceed.

Sure thing.

My general impression from playing around with this over the last 6-7
months is that cache line sharing is just not that big a problem for
us.  In a case like WALInsertLock, I'm fairly certain that we're
actually putting processes to sleep on a regular basis - and the
overhead of a system call to go to sleep and another one to wake up
and the associated context switches dwarfs the cost of passing the
cache line around.  It's far more important to get rid of the sleeping
(which, sadly, is far harder than padding out the data structures).

There are some cases where the cache line really does seem to matter -
e.g. Pavan's PGPROC_MINIMAL patch delivers excellent results on
Itanium - but those seem to be fairly rate FWICT.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Minor optimisation of XLogInsert()

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> (I wonder if we shouldn't just align every shared memory allocation to
> 64 or 128 bytes.  It wouldn't waste much memory and it would make us
> much more resistant to performance changes caused by minor
> modifications to the shared memory layout.)

I could get behind this idea if we had a reasonably clear idea of the
hardware's cache line width, but AFAIK there is no portable way to
identify that.  (This is a pretty fatal objection to Simon's original
patch as well...)
        regards, tom lane


Re: Minor optimisation of XLogInsert()

From
Robert Haas
Date:
On Wed, Nov 16, 2011 at 11:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> (I wonder if we shouldn't just align every shared memory allocation to
>> 64 or 128 bytes.  It wouldn't waste much memory and it would make us
>> much more resistant to performance changes caused by minor
>> modifications to the shared memory layout.)
>
> I could get behind this idea if we had a reasonably clear idea of the
> hardware's cache line width, but AFAIK there is no portable way to
> identify that.  (This is a pretty fatal objection to Simon's original
> patch as well...)

I don't think it's very important to know the exact size.  As far as I
can tell, x64 is 64 bytes and Itanium and Power are 128 bytes.  If you
optimize for those, you'll also handle any smaller size (that's a
power of two, without which a lot of things we do are wrongheaded)
without wasting much memory.  If you run into hardware with a giant
256-byte or large cache line, you might have some sharing, but you
can't win 'em all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company