Re: [WIP] Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [WIP] Performance Improvement by reducing WAL for Update Operation |
Date | |
Msg-id | 20121024152748.GC22334@tornado.leadboat.com Whole thread Raw |
In response to | Re: [WIP] Performance Improvement by reducing WAL for Update Operation (Amit kapila <amit.kapila@huawei.com>) |
List | pgsql-hackers |
On Wed, Oct 24, 2012 at 05:55:56AM +0000, Amit kapila wrote: > Wednesday, October 24, 2012 5:51 AM Noah Misch wrote: > > Stepping back a moment, I would expect this patch to change performance in at > > least four ways (Heikki largely covered this upthread): > > > a) High-concurrency workloads will improve thanks to reduced WAL insert > > contention. > > b) All workloads will degrade due to the CPU cost of identifying and > > implementing the optimization. > > c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume. > > d) Workloads composed primarily of long transactions with high WAL volume will > > improve due to having fewer end-of-WAL-segment fsync requests. > > All your points are very good summarization of work, but I think one point can be added : > e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert() due to reduced size of xlog record. True. > > Your benchmark numbers show small gains and losses for single-client > > workloads, moving to moderate gains for 2-client workloads. This suggests > > strong influence from (a), some influence from (b), and little influence from > > (c) and (d). Actually, the response to scale evident in your numbers seems > > too good to be true; why would (a) have such a large effect over the > > transition from one client to two clients? > > I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e)mentioned by me. > For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert(). > However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we seesuch a effect from > 1 client to 2 clients. Note that the CRC calculation over variable-size data in the WAL record happens before taking WALInsertLock. > > Also, for whatever reason, all > > your numbers show fairly bad scaling. With the XLOG scale and LZ patches, > > synchronous_commit=off, -F 80, and rec length 250, 8-client average > > performance is only 2x that of 1-client average performance. Correction: with the XLOG scale patch only, your benchmark runs show 8-client average performance as 2x that of 1-client average performance. With both the XLOG scale and LZ patches, it grows to almost 4x. However, both ought to be closer to 8x. > > -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- > > HEAD,-F80 816 1644 6528 1821 MiB > > xlogscale,-F80 824 1643 6551 1826 MiB > > xlogscale+lz,-F80 717 1466 5924 1137 MiB > > xlogscale+lz,-F100 753 1508 5948 1548 MiB > > > Those are short runs with no averaging of multiple iterations; don't put too > > much faith in the absolute numbers. Still, I consistently get linear scaling > > from 1 client to 8 clients. Why might your results have been so different in > > this regard? > > 1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for8 threads is > of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you seethe numbers I doubt that. Your 2-client numbers also show scaling well-below linear. With 8 cores, 16-client performance should not fall off compared to 8 clients. Perhaps 2 clients saturate your I/O under this workload, but 1 client does not. Granted, that theory doesn't explain all your numbers, such as the improvement for record length 50 @ -c1. > 2. Now, if we see that in the results you have posted, > a) there is not much performance difference between head and xlog scale Note that the xlog scale patch addresses a different workload: http://archives.postgresql.org/message-id/505B3648.1040801@vmware.com > b) with LZ patch it shows there is decrease in performance > I think this can be because it has ran for very less time as you have also mentioned. Yes, that's possible. > > It's also odd that your -F100 numbers tend to follow your -F80 numbers despite > > the optimization kicking in far more frequently for the latter. > > The results with avg of 3 - 15mins runs for LZ patch are: > -Patch- -tps@-c1- -tps@-c2- -tps@-c16-j8 > xlogscale+lz,-F80 663 1232 2498 > xlogscale+lz,-F100 660 1221 2361 > > The result is showing that avg. tps is better with -F80 which is I think what is expected. Yes. Let me elaborate on the point I hoped to make. Based on my test above, -F80 more than doubles the bulk WAL savings compared to -F100. Your benchmark runs showed a 61.8% performance improvement at -F100 and a 62.5% performance improvement at -F80. If shrinking WAL increases performance, shrinking it more should increase performance more. Instead, you observed similar levels of improvement at both fill factors. Why? > So to conclude, according to me, following needs to be done. > > 1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather thanwith -c16 -j8. With unpatched HEAD, synchronous_commit=off, and sufficient I/O bandwidth, you should be able to get pgbench to scale linearly to 8 clients. You can then benchmark for effects (a), (b) and (e). With insufficient I/O bandwidth, you're benchmarking (c) and (d). (And/or other effects I haven't considered.) > 2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time. Agreed. > Please let me know what is you opinion for above, do we need to do anything more than what is mentioned? I think the next step is to figure out what limits your scaling. Then we can form a theory about the meaning of your benchmark numbers. nm
pgsql-hackers by date: