Re: [WIP] Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

From Amit kapila
Subject Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date
Msg-id 6C0B27F7206C9E4CA54AE035729E9C3828542FB9@szxeml509-mbx
Whole thread Raw
In response to Re: [WIP] Performance Improvement by reducing WAL for Update Operation  (Noah Misch <noah@leadboat.com>)
Responses Re: [WIP] Performance Improvement by reducing WAL for Update Operation
List pgsql-hackers
Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:
>Hi Amit,

Noah, Thank you for taking the performance data.

>On Tue, Oct 16, 2012 at 09:22:39AM +0000, Amit kapila wrote:
> On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:
>> > Please find the readings of LZ patch along with Xlog-Scale patch.
>> > The comparison is between for Update operations
>> > base code + Xlog Scale Patch
>> > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)
>
>> This contains all the consolidated data and comparison for both the approaches:
>
>> The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as
compareto previous one where configuration used for wal_page_size was 1K 

> What is "wal_page_size"?  Is that ./configure --with-wal-blocksize?
Yes.

> Observations From Performance Data
> ----------------------------------------------
> 1. With both the approaches Performance data is good.
>     LZ compression - upto 100% performance improvement.
>     Offset Approach - upto 160% performance improvement.
> 2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500
lengthchanged value). 
> 3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression
Approach).

> Stepping back a moment, I would expect this patch to change performance in at
> least four ways (Heikki largely covered this upthread):

> a) High-concurrency workloads will improve thanks to reduced WAL insert
> contention.
> b) All workloads will degrade due to the CPU cost of identifying and
> implementing the optimization.
> c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
> d) Workloads composed primarily of long transactions with high WAL volume will
>    improve due to having fewer end-of-WAL-segment fsync requests.

All your points are very good summarization of work, but I think one point can be added :
e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert()  due to reduced size of xlog
record.

> Your benchmark numbers show small gains and losses for single-client
> workloads, moving to moderate gains for 2-client workloads.  This suggests
> strong influence from (a), some influence from (b), and little influence from
> (c) and (d).  Actually, the response to scale evident in your numbers seems
> too good to be true; why would (a) have such a large effect over the
> transition from one client to two clients?

I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e)
mentionedby me. 
For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert().
However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we
seesuch a effect from  
1 client to 2 clients.

> Also, for whatever reason, all
> your numbers show fairly bad scaling.  With the XLOG scale and LZ patches,
> synchronous_commit=off, -F 80, and rec length 250, 8-client average
> performance is only 2x that of 1-client average performance.

I am really sorry, this is my mistake about putting the numbers; the 8 threads number is actually a number with -c16
-j8
means 16 clients and 8 threads. That can be the reason it's just showing 2X otherwise it would have shown numbers
similarto what you are seeing. 


> Benchmark results:

> -Patch-             -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
> HEAD,-F80           816         1644        6528        1821 MiB
> xlogscale,-F80      824         1643        6551        1826 MiB
> xlogscale+lz,-F80   717         1466        5924        1137 MiB
> xlogscale+lz,-F100  753         1508        5948        1548 MiB

> Those are short runs with no averaging of multiple iterations; don't put too
> much faith in the absolute numbers.  Still, I consistently get linear scaling
> from 1 client to 8 clients.  Why might your results have been so different in
> this regard?

1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for
8threads is 
of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you
seethe numbers 
2. Now, if we see that in the results you have posted,    a) there is not much performance difference between head and
xlogscale    b) with LZ patch it shows there is decrease in performance  I think this can be because it has ran for
veryless time as you have also mentioned. 



> It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
> the optimization kicking in far more frequently for the latter.

The results with avg of 3 - 15mins runs for LZ patch are:-Patch-                       -tps@-c1-   -tps@-c2-
-tps@-c16-j8 xlogscale+lz,-F80         663            1232               2498      xlogscale+lz,-F100       660
  1221               2361       

The result is showing that avg. tps is better with -F80 which is I think what is expected.

So to conclude, according to me, following needs to be done.

1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather
thanwith -c16 -j8. 
2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time.

Please let me know what is you opinion for above, do we need to do anything more than what is mentioned?

With Regards,
Amit Kapila.


pgsql-hackers by date:

Previous
From: "Xiong He"
Date:
Subject: About the translation of po files into zh_CN (Simplified Chinese) in PG 9.2 release (branch)
Next
From: Amit kapila
Date:
Subject: Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation