Re: checkpointer continuous flushing - V16 - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing - V16
Date
Msg-id alpine.DEB.2.10.1602191012160.1254@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing - V16  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing - V16
List pgsql-hackers
Hello Andres,

> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

Below the results of a lot of tests with pgbench to exercise checkpoints 
on the above version when fetched.

Overall comments: - sorting & flushing is basically always a winner - benchmarking with short runs on large databases
isa bad idea   the results are very different if a longer run is used   (see andres00b vs andres00c)
 

# HOST/SOFT
 16 GB 2 cpu 8 cores 200 GB RAID1 HDD, ext4 FS Ubuntu 12.04 LTS (precise)

# ABOUT THE REPORTED STATISTICS
 tps: is the "excluding connection" time tps, the higher the better 1-sec tps: average of measured per-second tps
note- it should be the same as the previous one, but due to various          hazards in the trace, especially when
thingsgo badly and pg get          stuck, it may be different. Such hazard also explain why there          may be some
non-integertps reported for some seconds. stddev: standard deviation, the lower the better the five figures in bracket
givea feel of the distribution: - min: minimal per-second tps seen in the trace - q1: first quarter per-second tps seen
inthe trace - med: median per-second tps seen in the trace - q3: third quarter per-second tps seen in the trace - max:
maximalper-second tps seen in the trace the last percentage dubbed "<=10.0" is percent of seconds where performance
isbelow 10 tps: this measures of how unresponsive pg was during the run
 

###### TINY2
 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4   with scale = 10 (~ 200 MB)
 postgresql.conf:   shared_buffers = 1GB   max_wal_size = 1GB   checkpoint_timeout = 300s
checkpoint_completion_target= 0.8   checkpoint_flush_after = { none, 0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0%      1 | 2575.0 / 2575.1 ± 359.3 [  1.0,
2595.9,2712.0, 2732.0, 2847.0] 0.1%      2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0%
 
    0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0%      1 | 2596.6 / 2596.9 ± 307.4 [296.0,
2590.5,2707.9, 2738.0, 2847.8] 0.0%      2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0%
 
   32 0 | 2625.5 / 2625.5 ± 250.5 [  1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1%      1 | 2630.2 / 2630.2 ± 243.1 [301.8,
2654.9,2697.2, 2726.0, 2837.4] 0.0%      2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0%
 
   64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0%      1 | 2591.1 / 2591.1 ± 305.2 [455.9,
2608.9,2680.2, 2734.1, 2859.0] 0.0%      2 | 2047.8 / 2046.4 ± 925.8 [  0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ?
 

Pretty small setup, all data fit in buffers. Good tps performance all around
(best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction
in tps stddev.

###### SMALL
 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4   with scale = 120 (~ 2 GB)
 postgresql.conf:   shared_buffers = 2GB   checkpoint_timeout = 300s   checkpoint_completion_target = 0.8
checkpoint_flush_after= { none, 0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 209.2 / 204.2 ± 516.5 [0.0,   0.0,   4.0,    5.0, 2251.0] 82.3%      1 | 207.4 / 204.2 ± 518.7 [0.0,   0.0,
4.0,   5.0, 2245.1] 82.3%      2 | 217.5 / 211.0 ± 530.3 [0.0,   0.0,   3.0,    5.0, 2255.0] 82.0%      3 | 217.8 /
213.2± 531.7 [0.0,   0.0,   4.0,    6.0, 2261.9] 81.7%      4 | 230.7 / 223.9 ± 542.7 [0.0,   0.0,   4.0,    7.0,
2282.0]80.7%
 
    0 0 | 734.8 / 735.5 ± 879.9 [0.0,   1.0,  16.5, 1748.3, 2281.1] 47.0%      1 | 694.9 / 693.0 ± 849.0 [0.0,   1.0,
29.5,1545.7, 2428.0] 46.4%      2 | 735.3 / 735.5 ± 888.4 [0.0,   0.0,  12.0, 1781.2, 2312.1] 47.9%      3 | 736.0 /
737.5± 887.1 [0.0,   1.0,  16.0, 1794.3, 2317.0] 47.5%      4 | 734.9 / 735.1 ± 885.1 [0.0,   1.0,  15.5, 1781.0,
2297.1]47.2%
 
   32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0,  753.0, 2312.1]  0.2%      1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5,
671.0, 744.0, 2319.0]  0.1%      2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0,  756.0, 2331.0]  0.2%      3 | 744.1 /
744.1± 414.4 [0.0, 555.5, 685.2,  758.0, 2285.1]  0.1%      4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0,  759.0,
2308.1] 0.1%
 
   64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0,  759.0, 2353.0]  0.1%      1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2,
680.0, 758.2, 2296.0]  0.1%      2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1,  757.0, 2310.0]  0.1%      3 | 529.0 /
526.6± 410.9 [0.0, 245.0, 444.0,  701.0, 2380.9]  1.5% ??      4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0,  754.0,
2298.0] 0.1%
 

Sorting brings * 3.3 tps, flushing significantly reduces tps stddev.
Pg comes from 80% unresponsive to nearly always responsive.

###### MEDIUM
 pgbench:  -M prepared -N -P 1 -T 4000 -j 2 -c 4   with scale = 250 (~ 3.8 GB)
 postgresql.conf:   shared_buffers = 4GB   max_wal_size = 4GB   checkpoint_timeout = 15min
checkpoint_completion_target= 0.8   checkpoint_flush_after = { none, 0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 |  214.8 /  211.8 ± 513.7 [0.0,   1.0,    4.0,    5.0, 2344.0] 82.4%      1 |  219.2 /  215.0 ± 524.1 [0.0,
0.0,   4.0,    5.0, 2316.0] 82.2%      2 |  240.9 /  234.6 ± 550.8 [0.0,   0.0,    4.0,    6.0, 2320.2] 81.0%
 
    0 0 | 1064.7 / 1065.3 ± 888.2 [0.0,  11.0, 1089.0, 2017.7, 2461.9] 24.7%      1 | 1060.2 / 1061.2 ± 889.9 [0.0,
10.0,1056.7, 2022.0, 2444.9] 25.1%      2 | 1060.2 / 1061.4 ± 889.1 [0.0,   9.0, 1085.8, 2002.8, 2473.0] 25.6%
 
   32 0 | 1059.4 / 1059.4 ± 476.3 [3.0, 804.9,  980.0, 1123.0, 2448.1]  0.1%      1 | 1062.5 / 1062.6 ± 475.6 [0.0,
807.0, 988.0, 1132.0, 2441.0]  0.1%      2 | 1063.7 / 1063.7 ± 475.4 [0.0, 814.0,  987.0, 1131.2, 2432.1]  0.1%
 
   64 0 | 1052.6 / 1052.6 ± 475.3 [0.0, 793.0,  974.0, 1118.1, 2445.1]  0.1%      1 | 1059.8 / 1059.8 ± 475.1 [0.0,
799.0, 987.5, 1131.0, 2457.1]  0.1%      2 | 1058.5 / 1058.5 ± 472.8 [0.0, 807.0,  985.0, 1127.7, 2442.0]  0.1%
 

Sorting brings * 4.8 tps, flushing significantly reduces tps stddev.
Pg comes from +80% unresponsive to nearly always responsive.

Performance is significantly better than "small" above, probably thanks to
the longer checkpoint timeout.


###### LARGE
 pgbench -M prepared -N -P 1 -T 7500 -j 2 -c 4   with scale = 1000 (~ 15 GB)
 postgresql.conf:   shared_buffers = 4GB   max_wal_size = 2GB   checkpoint_timeout = 40min
checkpoint_completion_target= 0.8   checkpoint_flush_after = { none, 0, 32, 64}
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 |  68.7 /  65.3 ± 78.6 [0.0,  3.0,   6.0, 136.0, 291.0] 53.1%      1 |  70.6 /  70.3 ± 80.1 [0.0,  4.0,  10.0,
151.0,282.0] 50.1%      2 |  74.3 /  75.8 ± 84.9 [0.0,  4.0,   9.0, 162.0, 311.2] 50.3%
 
    0 0 | 117.2 / 116.9 ± 83.8 [0.0, 14.0, 139.0, 193.0, 372.4] 24.0%      1 | 117.3 / 117.8 ± 83.8 [0.0, 16.0, 140.0,
193.0,279.0] 23.9%      2 | 117.6 / 118.2 ± 84.1 [0.0, 16.0, 141.0, 194.0, 297.8] 23.7%
 
   32 0 | 114.2 / 114.2 ± 45.7 [0.0, 84.0, 100.0, 131.0, 613.6]  0.4%      1 | 112.5 / 112.6 ± 44.0 [0.0, 83.0,  98.0,
130.0,293.0]  0.2%      2 | 108.0 / 108.0 ± 44.7 [0.0, 79.0,  94.0, 124.0, 303.6]  0.3%
 
   64 0 | 113.0 / 113.0 ± 45.5 [0.0, 83.0,  99.0, 131.0, 289.0]  0.4%      1 |  80.0 /  80.3 ± 39.1 [0.0, 56.0,  72.0,
95.0,281.0]  0.8% ??      2 | 112.2 / 112.3 ± 44.5 [0.0, 82.0,  99.0, 129.0, 282.0]  0.3%
 

Data do not fit in the available memory, so plenty of read accesses.
Sorting still has some impact on tps performance (* 1.6), flushing
greatly improves responsiveness.


###### ANDRES00
 pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16   with scale = 800 (~ 13 GB)
 postgresql.conf:   shared_buffers = 2GB   max_wal_size = 100GB   wal_level = hot_standby   maintenance_work_mem = 2GB
checkpoint_timeout = 30s   checkpoint_completion_target = 0.8   synchronous_commit = off   checkpoint_flush_after = {
none,0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 328.7 / 329.9 ± 716.9 [0.0, 0.0,   0.0,    0.0, 3221.2] 77.7%      1 | 338.2 / 338.7 ± 728.6 [0.0, 0.0,
0.0,  17.0, 3296.3] 75.0%      2 | 304.5 / 304.3 ± 705.5 [0.0, 0.0,   0.0,    0.0, 3463.4] 79.3%
 
    0 0 | 425.6 / 464.0 ± 724.0 [0.0, 0.0,   0.0, 1000.6, 3363.7] 61.0%      1 | 461.5 / 463.1 ± 735.8 [0.0, 0.0,
0.0,1011.2, 3490.9] 58.7%      2 | 452.4 / 452.6 ± 744.3 [0.0, 0.0,   0.0, 1078.9, 3631.9] 63.3%
 
   32 0 | 514.4 / 515.8 ± 651.8 [0.0, 0.0, 337.4,  808.3, 2876.0] 40.7%      1 | 512.0 / 514.6 ± 661.6 [0.0, 0.0,
317.6, 690.8, 3315.8] 35.0%      2 | 529.5 / 530.3 ± 673.0 [0.0, 0.0, 321.1,  906.4, 3360.8] 40.3%
 
   64 0 | 529.6 / 530.9 ± 668.2 [0.0, 0.0, 322.1,  786.1, 3538.0] 33.3%      1 | 496.4 / 498.0 ± 606.6 [0.0, 0.0,
321.4, 746.0, 2629.6] 36.3%      2 | 521.0 / 521.7 ± 657.0 [0.0, 0.0, 328.4,  737.9, 3262.9] 34.3%
 

Data just hold in memory, maybe. Run is very short, settings are low, this
is not representative of an sane installation, this is for testing a lot of
checkpoints in a difficult situation. Sorting and flushing do bring
significant benefits.


###### ANDRES00b (same as ANDRES00 but scale 800->1000)
 pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16   with scale = 1000 (~ 15 GB)
 postgresql.conf:  shared_buffers = 2GB  max_wal_size = 100GB  wal_level = hot_standby  maintenance_work_mem = 2GB
checkpoint_timeout= 30s  checkpoint_completion_target = 0.8  synchronous_commit = off  checkpoint_flush_after = { none,
0,32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 150.2 / 150.3 ± 401.6 [0.0,   0.0,   0.0,   0.0, 2199.4] 75.1%      1 | 139.2 / 139.2 ± 372.2 [0.0,   0.0,
0.0,  0.0, 2111.4] 78.3% ***      2 | 127.3 / 127.1 ± 341.2 [0.0,   0.0,   0.0,  53.0, 2144.3] 74.7% ***
 
    0 0 | 199.0 / 209.2 ± 400.4 [0.0,   0.0,   0.0, 243.6, 1846.0] 65.7%      1 | 220.4 / 226.7 ± 423.2 [0.0,   0.0,
0.0,264.0, 1777.0] 63.5% *      2 | 195.5 / 205.3 ± 337.9 [0.0,   0.0, 123.0, 212.0, 1721.9] 43.2%
 
   32 0 | 362.3 / 359.0 ± 308.4 [0.0, 200.0, 265.0, 416.4, 1816.6]  5.0%      1 | 323.6 / 321.2 ± 327.1 [0.0, 142.9,
210.0,353.4, 1907.0]  4.0%      2 | 309.0 / 310.7 ± 381.3 [0.0, 122.0, 175.5, 298.0, 2090.4]  5.0%
 
   64 0 | 342.7 / 343.6 ± 331.1 [0.0, 143.0, 239.5, 409.9, 1623.6]  5.3%      1 | 333.8 / 328.2 ± 356.3 [0.0, 132.9,
211.5,358.1, 1629.1] 10.7% ??      2 | 352.0 / 352.0 ± 332.3 [0.0, 163.5, 239.9, 400.1, 1643.4]  5.3%
 

A little bit larger than previous so that it does not really fit in memory.
The performance inpact is significant compared to previous. Sorting and
flushing brings * 2 tps, unresponsiveness comes from 75% to reach a better 5%.

###### ANDRES00c (same as ANDRES00b but time 300 -> 4000)
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 115.2 / 114.3 ± 256.4 [0.0,   0.0,  75.0, 131.1, 3389.0] 46.5%      1 | 118.4 / 117.9 ± 248.3 [0.0,   0.0,
87.0,151.0, 3603.6] 46.7%      2 | 120.1 / 119.2 ± 254.4 [0.0,   0.0,  91.0, 143.0, 3307.8] 43.8%
 
    0 0 | 217.4 / 211.0 ± 237.1 [0.0, 139.0, 193.0, 239.0, 3115.4] 16.8%      1 | 216.2 / 209.6 ± 244.9 [0.0, 138.9,
188.0,231.0, 3331.3] 16.3%      2 | 218.6 / 213.8 ± 246.7 [0.0, 137.0, 187.0, 232.0, 3229.6] 16.2%
 
   32 0 | 146.6 / 142.5 ± 234.5 [0.0,  59.0,  93.0, 151.1, 3294.7] 17.5%      1 | 148.0 / 142.6 ± 239.2 [0.0,  64.0,
95.9,144.0, 3361.8] 16.0%      2 | 147.6 / 140.4 ± 233.2 [0.0,  59.4,  94.0, 148.0, 3108.4] 18.0%
 
   64 0 | 145.3 / 140.5 ± 233.6 [0.0,  61.0,  93.0, 147.7, 3212.6] 16.5%      1 | 145.6 / 140.3 ± 233.3 [0.0,  58.0,
93.0,146.0, 3351.8] 17.3%      2 | 147.7 / 142.2 ± 233.2 [0.0,  61.0,  97.0, 148.4, 3616.3] 17.0%
 

The only difference between ANDRES00B and ANDRES00C is the duration, from
5 minutes to 66 minutes. This show that short runs can be widelely misleading:
In particular the longer runs shows less that half tps for some settings, and
the relative comparison of head vs sort vs sort+flush is different.

###### ANDRES00d (same as ANDRES00b but wal_level hot_standby->minimal)
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 191.6 / 195.1 ± 439.3 [0.0,   0.0,   0.0,   0.0, 2540.2] 76.3%      1 | 211.3 / 213.6 ± 461.9 [0.0,   0.0,
0.0, 13.0, 3203.7] 75.0%      2 | 152.4 / 154.9 ± 217.6 [0.0,   0.0,  58.0, 235.6,  995.9] 39.3% ???
 
    0 0 | 247.2 / 251.7 ± 454.0 [0.0,   0.0,   0.0, 375.3, 2592.4] 67.7%      1 | 215.4 / 232.7 ± 446.5 [0.0,   0.0,
0.0,103.0, 3046.7] 72.3%      2 | 160.6 / 160.8 ± 222.1 [0.0,   0.0,  80.0, 209.6,  885.3] 42.0% ???
 
   32 0 | 399.9 / 397.0 ± 356.6 [0.0,  67.0, 348.0, 572.8, 2604.2] 21.0%      1 | 391.8 / 392.5 ± 371.7 [0.0,  85.5,
314.4,549.3, 2590.3] 20.7%      2 | 406.1 / 404.8 ± 380.6 [0.0,  95.0, 348.5, 569.0, 3383.7] 21.3%
 
   64 0 | 395.9 / 396.1 ± 352.4 [0.0,  89.5, 342.5, 556.0, 2366.9] 17.7%      1 | 355.1 / 351.9 ± 296.7 [0.0, 172.5,
306.1,468.1, 1663.5] 16.0%      2 | 403.6 / 401.8 ± 390.5 [0.0,   0.0, 337.0, 636.1, 2591.3] 26.7% ???
 

###### ANDRES00e (same as ANDRES00b but maintenance_work_mem=2GB->64MB)
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 153.5 / 161.3 ± 401.3 [0.0,   0.0,   0.0,   0.0, 2546.0] 82.0%      1 | 170.7 / 175.9 ± 399.9 [0.0,   0.0,
0.0, 14.0, 2537.4] 74.7%      2 | 184.7 / 190.4 ± 389.2 [0.0,   0.0,   0.0, 158.5, 2544.6] 69.3%
 
    0 0 | 211.2 / 227.8 ± 418.8 [0.0,   0.0,   0.0, 334.6, 2589.3] 65.7%      1 | 221.7 / 226.0 ± 415.7 [0.0,   0.0,
0.0,276.8, 2588.2] 68.4%      2 | 232.5 / 233.2 ± 403.5 [0.0,   0.0,   0.0, 377.0, 2260.2] 62.0%
 
   32 0 | 373.2 / 374.4 ± 309.2 [0.0, 180.6, 321.8, 475.2, 2596.5] 11.3%      1 | 348.7 / 348.1 ± 328.4 [0.0, 127.0,
284.1,451.9, 2595.1] 17.3%      2 | 376.3 / 375.3 ± 315.5 [0.0, 186.5, 329.6, 487.1, 2365.4] 15.3%
 
   64 0 | 388.9 / 387.8 ± 348.7 [0.0, 164.0, 305.9, 546.5, 2587.2] 15.0%      1 | 380.3 / 378.7 ± 338.8 [0.0, 171.1,
317.4,524.8, 2592.4] 16.7%      2 | 369.8 / 367.4 ± 340.5 [0.0,  77.4, 320.6, 525.5, 2484.7] 20.7%
 

Hmmm, interesting: maintenance_work_mem seems to have some influence on
performance, although it is not too consistent between settings, probably
because as the memory is used to its limit the performance is quite
sensitive to the available memory.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: [PoC] WaitLatchOrSocketMulti (Re: Performance degradation in commit ac1d794)
Next
From: Simon Riggs
Date:
Subject: Re: Typo in bufmgr.c that result in waste of memory