Thread: Striping CLOG and Subtrans

Striping CLOG and Subtrans

From
Simon Riggs
Date:
Transaction Ids are assigned consecutively. As a result, access to the
bitmaps for CLOG and Subtrans will have a tendency to access the
just-recently allocated end of those data structures, which are updated
on successful transaction creation/completion (high level view).

Earlier we found that CPU false sharing has a performance and
scalability reducing effect. Any cache line that needs to be shared
across multiple backends causes additional memory accesses and cache
line bouncing.

Currently, the layout of CLOG and Subtrans is also consecutive, i.e. an
XId and XId+1 are adjacent in memory.

Within each cache line for CLOG, we will have about 256 transactions,
assuming the lowest cache line of any modern CPU, 64 bytes. If all
transactions are roughly the same duration, then we may get a worst case
that nearly all of those transactions require nearly simultaneous
access. We can imagine many situations where we are far from the worst
case, but there will be times when contention for those cache lines is
high. The actual situation seems likely to be multiple overlayed Normal
distributions, but in any case: there will be "hot spots". These hot
spots could be responsible for extended locking times, which then cause
contention for the LWlocks protecting CLOG and Subtrans. That contention
effects everybody.

I propose that we lay the XIds out differently within the CLOG and
Subtrans, so that XId and XId+1 are at least one cache line apart. In
this way, access to 256 adjacent XId numbers would access 256 different
cache lines within one page of the CLOG.

The layout implied by TransactionIdToPage() would remain. So we align
the XIds in consecutive pages, but make the on-page layout significantly
different - but retain the same size and other characteristics.

Thus, it would seem that this could be almost a 3 line change:
Clog.cTransactionIdToByte()TransactionIdToBIndex()
Subtrans.cTransactionIdToEntry()

These would be changed so that we stripe the XIds across the page,
rather than assume they are sequential. The exact striping mechanism is
just some simple modulo arithmetic based upon a multiple of
ALIGNOF_BUFFER.

The Subtrans code makes no assumption about the way in which XIds are
located, but there is an assumption within the CLOG code: At startup
time, all XIds on the current page higher than the current XId are
zeroed by writing zeroes to higher bytes. I would propose that we still
perform the logical operation of zeroing later XIds, but in a slightly
different physical form. The way we do that might effect startup time,
so I have avoided specifying a particular stripe algorithm in case we
need to change that to improve the CLOG startup code. 

These changes have almost no negative impact on run time performance and
can be implemented with minimum change. We can discuss whether the false
sharing phenomena actually occurs, but the bottom line ISTM is that if
we can avoid it ever occurring, for almost free, then why not? If it
does occur then we know it can be destroy SMP performance and on some
CPUs can lead to context switching.

Comments?

Best Regards, Simon Riggs





Re: Striping CLOG and Subtrans

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> These changes have almost no negative impact on run time performance and
> can be implemented with minimum change. We can discuss whether the false
> sharing phenomena actually occurs, but the bottom line ISTM is that if
> we can avoid it ever occurring, for almost free, then why not? 

No, you've put the burden of proof in the wrong place.  You are
proposing a significant logical complication in the code for a
completely hypothetical improvement --- there is *no* evidence on
the table that cache contention within clog pages is even measurable.
Show us some experimental numbers first.
        regards, tom lane


Re: Striping CLOG and Subtrans

From
Simon Riggs
Date:
On Sat, 2005-12-03 at 11:49 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > These changes have almost no negative impact on run time performance and
> > can be implemented with minimum change. We can discuss whether the false
> > sharing phenomena actually occurs, but the bottom line ISTM is that if
> > we can avoid it ever occurring, for almost free, then why not? 
> 
> No, you've put the burden of proof in the wrong place.  You are
> proposing a significant logical complication in the code for a
> completely hypothetical improvement --- there is *no* evidence on
> the table that cache contention within clog pages is even measurable.
> Show us some experimental numbers first.

In a way, I agree with you on the burden of proof. 

Code wise: I'm not sure this represents a significant logical
complication. There would be no more code than there is now, changes
would be isolated to about 3 places in two files.

There is no evidence either way, is all I would add. But we do have
strong indications that it is likely. It's gonna be hard to come up with
a smoking gun. We'll have to rethink our performance testing regime to
include some larger scale testing with instrumentation.

Shelved until measurements indicate requirement.

Best Regards, Simon Riggs