Re: [WIP] Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date
Msg-id 20121024152748.GC22334@tornado.leadboat.com
Whole thread Raw
In response to Re: [WIP] Performance Improvement by reducing WAL for Update Operation  (Amit kapila <amit.kapila@huawei.com>)
List pgsql-hackers
On Wed, Oct 24, 2012 at 05:55:56AM +0000, Amit kapila wrote:
> Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:
> > Stepping back a moment, I would expect this patch to change performance in at
> > least four ways (Heikki largely covered this upthread):
> 
> > a) High-concurrency workloads will improve thanks to reduced WAL insert
> > contention.
> > b) All workloads will degrade due to the CPU cost of identifying and
> > implementing the optimization.
> > c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
> > d) Workloads composed primarily of long transactions with high WAL volume will
> >    improve due to having fewer end-of-WAL-segment fsync requests.
> 
> All your points are very good summarization of work, but I think one point can be added :
> e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert()  due to reduced size of xlog
record.

True.

> > Your benchmark numbers show small gains and losses for single-client
> > workloads, moving to moderate gains for 2-client workloads.  This suggests
> > strong influence from (a), some influence from (b), and little influence from
> > (c) and (d).  Actually, the response to scale evident in your numbers seems
> > too good to be true; why would (a) have such a large effect over the
> > transition from one client to two clients? 
> 
> I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point
(e)mentioned by me.
 
> For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert().
> However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we
seesuch a effect from 
 
> 1 client to 2 clients.

Note that the CRC calculation over variable-size data in the WAL record
happens before taking WALInsertLock.

> > Also, for whatever reason, all
> > your numbers show fairly bad scaling.  With the XLOG scale and LZ patches,
> > synchronous_commit=off, -F 80, and rec length 250, 8-client average
> > performance is only 2x that of 1-client average performance.

Correction: with the XLOG scale patch only, your benchmark runs show 8-client
average performance as 2x that of 1-client average performance.  With both the
XLOG scale and LZ patches, it grows to almost 4x.  However, both ought to be
closer to 8x.

> > -Patch-             -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
> > HEAD,-F80           816         1644        6528        1821 MiB
> > xlogscale,-F80      824         1643        6551        1826 MiB
> > xlogscale+lz,-F80   717         1466        5924        1137 MiB
> > xlogscale+lz,-F100  753         1508        5948        1548 MiB
> 
> > Those are short runs with no averaging of multiple iterations; don't put too
> > much faith in the absolute numbers.  Still, I consistently get linear scaling
> > from 1 client to 8 clients.  Why might your results have been so different in
> > this regard?
> 
> 1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted
for8 threads is
 
> of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you
seethe numbers
 

I doubt that.  Your 2-client numbers also show scaling well-below linear.
With 8 cores, 16-client performance should not fall off compared to 8 clients.

Perhaps 2 clients saturate your I/O under this workload, but 1 client does
not.  Granted, that theory doesn't explain all your numbers, such as the
improvement for record length 50 @ -c1.

> 2. Now, if we see that in the results you have posted, 
>     a) there is not much performance difference between head and xlog scale 

Note that the xlog scale patch addresses a different workload:
http://archives.postgresql.org/message-id/505B3648.1040801@vmware.com

>     b) with LZ patch it shows there is decrease in performance
>    I think this can be because it has ran for very less time as you have also mentioned.

Yes, that's possible.

> > It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
> > the optimization kicking in far more frequently for the latter.
> 
> The results with avg of 3 - 15mins runs for LZ patch are:
>  -Patch-                       -tps@-c1-   -tps@-c2-   -tps@-c16-j8  
>  xlogscale+lz,-F80         663            1232               2498      
>  xlogscale+lz,-F100       660            1221               2361      
> 
> The result is showing that avg. tps is better with -F80 which is I think what is expected.

Yes.  Let me elaborate on the point I hoped to make.  Based on my test above,
-F80 more than doubles the bulk WAL savings compared to -F100.  Your benchmark
runs showed a 61.8% performance improvement at -F100 and a 62.5% performance
improvement at -F80.  If shrinking WAL increases performance, shrinking it
more should increase performance more.  Instead, you observed similar levels
of improvement at both fill factors.  Why?

> So to conclude, according to me, following needs to be done.
> 
> 1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather
thanwith -c16 -j8.
 

With unpatched HEAD, synchronous_commit=off, and sufficient I/O bandwidth, you
should be able to get pgbench to scale linearly to 8 clients.  You can then
benchmark for effects (a), (b) and (e).  With insufficient I/O bandwidth,
you're benchmarking (c) and (d).  (And/or other effects I haven't considered.)

> 2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time.

Agreed.

> Please let me know what is you opinion for above, do we need to do anything more than what is mentioned?

I think the next step is to figure out what limits your scaling.  Then we can
form a theory about the meaning of your benchmark numbers.

nm



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: [PATCH] Support for Array ELEMENT Foreign Keys
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] Support for Array ELEMENT Foreign Keys