Re: CLOG contention, part 2 - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: CLOG contention, part 2
Date
Msg-id CA+U5nMLdT3ypF9orYAApBS-fW_mDA5zANniq8eDxVQZwQ-AOFA@mail.gmail.com
Whole thread Raw
In response to Re: CLOG contention, part 2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: CLOG contention, part 2
List pgsql-hackers
On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Given that, I obviously cannot test this at this point,

Patch with minor corrections attached here for further review.

> but let me go
> ahead and theorize about how well it's likely to work.  What Tom
> suggested before (and after some reflection I think I believe it) is
> that the frequency of access will be highest for the newest CLOG page
> and then drop off for each page further back you go.  Clearly, if that
> drop-off is fast - e.g. each buffer further backward is half as likely
> to be accessed as the next newer one - then the fraction of accesses
> that will hit pages that are far enough back to benefit from this
> optimization will be infinitesmal; 1023 out of every 1024 accesses
> will hit the first ten pages, and on a high-velocity system those all
> figure to have been populated since the last checkpoint.

That's just making up numbers, so its not much help. The "theory"
would apply to one workload but not another, so may well be true for
some workload but I doubt whether all databases work that way. I ask
accept the "long tail" distribution as being very common, we just
don't know how long that tail is "typically" or even if there is a
dominant single use case.

> The best
> case for this patch should be an access pattern that involves a very
> long tail;

Agreed


> actually, pgbench is a pretty good fit for that

Completely disagree, as described in detail in the other patch about
creating a realistic test environment for this patch.

pgbench is *not* a real world test.

pgbench loads all the data in one go, then pretends the data got their
one transaction at a time. So pgbench with no mods is actually the
theoretically most unreal imaginable. You have to run pgbench for 1
million transactions before you even theoretically show any gain from
this patch, and it would need to be a long test indeed before the
averaged effect of the patch was large enough to avoid the zero
contribution from the first million transacts.

The only real world way to test this patch is to pre-create the
database using a scale factor of >100 using the modified pgbench, then
run a test. That correctly simulates the real world situation where
all data arrived in single transactions.


> assuming
> the scale factor is large enough.  For example, at scale factor 100,
> we've got 10,000,000 tuples: choosing one at random, we're almost
> exactly 90% likely to find one that hasn't been chosen in the last
> 1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page).  In terms of
> reducing contention on the main CLOG SLRU, that sounds pretty
> promising, but depends somewhat on the rate at which transactions are
> processed relative to the frequency of checkpoints, since that will
> affect how many pages back you have go to use the history path.

> However, there is a potential fly in the ointment: in other cases in
> which we've reduced contention at the LWLock layer, we've ended up
> with very nasty contention at the spinlock layer that can sometimes
> eat more CPU time than the LWLock contention did.   In that light, it
> strikes me that it would be nice to be able to partition the
> contention N ways rather than just 2 ways.  I think we could do that
> as follows.  Instead of having one control lock per SLRU, have N
> locks, where N is probably a power of 2.  Divide the buffer pool for
> the SLRU N ways, and decree that each slice of the buffer pool is
> controlled by one of the N locks.  Route all requests for a page P to
> slice P mod N.  Unlike this approach, that wouldn't completely
> eliminate contention at the LWLock level, but it would reduce it
> proportional to the number of partitions, and it would reduce spinlock
> contention according to the number of partitions as well.  A down side
> is that you'll need more buffers to get the same hit rate, but this
> proposal has the same problem: it doubles the amount of memory
> allocated for CLOG.  Of course, this approach is all vaporware right
> now, so it's anybody's guess whether it would be better than this if
> we had code for it.  I'm just throwing it out there.

We've already discussed that and my patch for that has already been
rules out by us for this CF.

A much better take is to list what options for scaling we have:
* separate out the history
* partition access to the most active parts

For me, any loss of performance comes from two areas:
(1) concurrent access to pages
(2) clog LRU is dirty and delays reading in new pages

For the most active parts, (1) is significant. Using partitioning at
the page level will be ineffective in reducing contention because
almost all of the contention is on the first 1-2 pages. If we do
partitioning, it should be done by *striping* the most recent pages
across many locks, as I already suggested. Reducing page size would
reduce page contention but increase number of new page events and so
make (2) more important. Increasing page size will amplify (1).

(2) is less significant but much more easily removed - and this is why
it is proposed in this release.
Access to the history need not conflict at all, so doing this is free.

I agree with you that we should further analyse CLOG contention in
following releases but that is not an argument against making this
change now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: COPY with hints, rebirth
Next
From: Cédric Villemain
Date:
Subject: Re: WIP: URI connection string support for libpq