Re: CLOG contention, part 2 - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: CLOG contention, part 2 |
Date | |
Msg-id | CA+U5nMLdT3ypF9orYAApBS-fW_mDA5zANniq8eDxVQZwQ-AOFA@mail.gmail.com Whole thread Raw |
In response to | Re: CLOG contention, part 2 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: CLOG contention, part 2
|
List | pgsql-hackers |
On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Given that, I obviously cannot test this at this point, Patch with minor corrections attached here for further review. > but let me go > ahead and theorize about how well it's likely to work. What Tom > suggested before (and after some reflection I think I believe it) is > that the frequency of access will be highest for the newest CLOG page > and then drop off for each page further back you go. Clearly, if that > drop-off is fast - e.g. each buffer further backward is half as likely > to be accessed as the next newer one - then the fraction of accesses > that will hit pages that are far enough back to benefit from this > optimization will be infinitesmal; 1023 out of every 1024 accesses > will hit the first ten pages, and on a high-velocity system those all > figure to have been populated since the last checkpoint. That's just making up numbers, so its not much help. The "theory" would apply to one workload but not another, so may well be true for some workload but I doubt whether all databases work that way. I ask accept the "long tail" distribution as being very common, we just don't know how long that tail is "typically" or even if there is a dominant single use case. > The best > case for this patch should be an access pattern that involves a very > long tail; Agreed > actually, pgbench is a pretty good fit for that Completely disagree, as described in detail in the other patch about creating a realistic test environment for this patch. pgbench is *not* a real world test. pgbench loads all the data in one go, then pretends the data got their one transaction at a time. So pgbench with no mods is actually the theoretically most unreal imaginable. You have to run pgbench for 1 million transactions before you even theoretically show any gain from this patch, and it would need to be a long test indeed before the averaged effect of the patch was large enough to avoid the zero contribution from the first million transacts. The only real world way to test this patch is to pre-create the database using a scale factor of >100 using the modified pgbench, then run a test. That correctly simulates the real world situation where all data arrived in single transactions. > assuming > the scale factor is large enough. For example, at scale factor 100, > we've got 10,000,000 tuples: choosing one at random, we're almost > exactly 90% likely to find one that hasn't been chosen in the last > 1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page). In terms of > reducing contention on the main CLOG SLRU, that sounds pretty > promising, but depends somewhat on the rate at which transactions are > processed relative to the frequency of checkpoints, since that will > affect how many pages back you have go to use the history path. > However, there is a potential fly in the ointment: in other cases in > which we've reduced contention at the LWLock layer, we've ended up > with very nasty contention at the spinlock layer that can sometimes > eat more CPU time than the LWLock contention did. In that light, it > strikes me that it would be nice to be able to partition the > contention N ways rather than just 2 ways. I think we could do that > as follows. Instead of having one control lock per SLRU, have N > locks, where N is probably a power of 2. Divide the buffer pool for > the SLRU N ways, and decree that each slice of the buffer pool is > controlled by one of the N locks. Route all requests for a page P to > slice P mod N. Unlike this approach, that wouldn't completely > eliminate contention at the LWLock level, but it would reduce it > proportional to the number of partitions, and it would reduce spinlock > contention according to the number of partitions as well. A down side > is that you'll need more buffers to get the same hit rate, but this > proposal has the same problem: it doubles the amount of memory > allocated for CLOG. Of course, this approach is all vaporware right > now, so it's anybody's guess whether it would be better than this if > we had code for it. I'm just throwing it out there. We've already discussed that and my patch for that has already been rules out by us for this CF. A much better take is to list what options for scaling we have: * separate out the history * partition access to the most active parts For me, any loss of performance comes from two areas: (1) concurrent access to pages (2) clog LRU is dirty and delays reading in new pages For the most active parts, (1) is significant. Using partitioning at the page level will be ineffective in reducing contention because almost all of the contention is on the first 1-2 pages. If we do partitioning, it should be done by *striping* the most recent pages across many locks, as I already suggested. Reducing page size would reduce page contention but increase number of new page events and so make (2) more important. Increasing page size will amplify (1). (2) is less significant but much more easily removed - and this is why it is proposed in this release. Access to the history need not conflict at all, so doing this is free. I agree with you that we should further analyse CLOG contention in following releases but that is not an argument against making this change now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: