Re: Speed up Clog Access by increasing CLOG buffers - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Speed up Clog Access by increasing CLOG buffers
Date
Msg-id 4a52a34f-57fa-7bcf-d34c-c15db40f0361@2ndquadrant.com
Whole thread Raw
In response to Re: Speed up Clog Access by increasing CLOG buffers  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Speed up Clog Access by increasing CLOG buffers
List pgsql-hackers
On 10/25/2016 06:10 AM, Amit Kapila wrote:
> On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>> In the results you've posted on 10/12, you've mentioned a regression with 32
>>>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>>>> ~10% difference). I have no idea what scale was used for those tests,
>>>
>>> That test was with scale factor 300 on POWER 4 socket machine. I think
>>> I need to repeat this test with multiple reading to confirm it was
>>> regression or run to run variation. I will do that soon and post the
>>> results.
>>
>> As promised, I have rerun my test (3 times), and I did not see any regression.
>>
>
> Thanks Tomas and Dilip for doing detailed performance tests for this
> patch.  I would like to summarise the performance testing results.
>
> 1. With update intensive workload, we are seeing gains from 23%~192%
> at client count >=64 with group_update patch [1].
> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
> gains from 12% to ~70% at client count >=64 [2].  Tests are done on
> 8-socket intel   m/c.
> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
> Tests are done on 8-socket intel m/c.
> 4. To see why the patch only helps at higher client count, we have
> done wait event testing for various workloads [4], [5] and the results
> indicate that at lower clients, the waits are mostly due to
> transactionid or clientread.  At client-counts where contention due to
> CLOGControlLock is significant, this patch helps a lot to reduce that
> contention.  These tests are done on on 8-socket intel m/c and
> 4-socket power m/c
> 5. With pgbench workload (unlogged tables), we are seeing gains from
> 15% to > 300% at client count >=72 [6].
>

It's not entirely clear which of the above tests were done on unlogged 
tables, and I don't see that in the referenced e-mails. That would be an 
interesting thing to mention in the summary, I think.

> There are many more tests done for the proposed patches where gains
> are either or similar lines as above or are neutral.  We do see
> regression in some cases.
>
> 1. When data doesn't fit in shared buffers, there is regression at
> some client counts [7], but on analysis it has been found that it is
> mainly due to the shift in contention from CLOGControlLock to
> WALWriteLock and or other locks.

The questions is why shifting the lock contention to WALWriteLock should 
cause such significant performance drop, particularly when the test was 
done on unlogged tables. Or, if that's the case, how it makes the 
performance drop less problematic / acceptable.

FWIW I plan to run the same test with logged tables - if it shows 
similar regression, I'll be much more worried, because that's a fairly 
typical scenario (logged tables, data set > shared buffers), and we 
surely can't just go and break that.

> 2. We do see in some cases that granular_locking and no_content_lock
> patches has shown significant increase in contention on
> CLOGControlLock.  I have already shared my analysis for same upthread
> [8].

I do agree that some cases this significantly reduces contention on the 
CLogControlLock. I do however think that currently the performance gains 
are limited almost exclusively to cases on unlogged tables, and some 
logged+async cases.

On logged tables it usually looks like this (i.e. modest increase for 
high client counts at the expense of significantly higher variability):
  http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64

or like this (i.e. only partial recovery for the drop above 36 clients):
  http://tvondra.bitbucket.org/#pgbench-3000-logged-async-skip-64

And of course, there are cases like this:
  http://tvondra.bitbucket.org/#dilip-300-logged-async

I'd really like to understand why the patched results behave that 
differently depending on client count.
>> Attached is the latest group update clog patch.>

How is that different from the previous versions?
>
> In last commit fest, the patch was returned with feedback to evaluate
> the cases where it can show win and I think above results indicates
> that the patch has significant benefit on various workloads.  What I
> think is pending at this stage is the either one of the committer or
> the reviewers of this patch needs to provide feedback on my analysis
> [8] for the cases where patches are not showing win.
>
> Thoughts?
>

I do agree the patch(es) significantly reduce CLogControlLock, although 
with WAL logging enabled (which is what matters for most production 
deployments) it pretty much only shifts the contention to a different 
lock (so the immediate performance benefit is 0).

Which raises the question why to commit this patch now, before we have a 
patch addressing the WAL locks. I realize this is a chicken-egg problem, 
but my worry is that the increased WALWriteLock contention will cause 
regressions in current workloads.

BTW I've ran some tests with the number of clog buffers increases to 
512, and it seems like a fairly positive. Compare for example these two 
results:
  http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip-clog-512

The first one is with the default 128 buffers, the other one is with 512 
buffers. The impact on master is pretty obvious - for 72 clients the tps 
jumps from 160k to 197k, and for higher client counts it gives us about 
+50k tps (typically increase from ~80k to ~130k tps). And the tps 
variability is significantly reduced.

For the other workload, the results are less convincing though:
  http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
http://tvondra.bitbucket.org/#dilip-300-unlogged-sync-clog-512

Interesting that the master adopts the zig-zag patter, but shifted.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Gilles Darold
Date:
Subject: Re: Patch to implement pg_current_logfile() function
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Unused variable in postgres_fdw/deparse.c