Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1508171431580.28260@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hello Andres, > On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote: >> sort/flush : tps avg & stddev (percent of time beyond 10.0 tps) >> on on : 631 +- 131 (0.1%) >> on off : 564 +- 303 (12.0%) >> off on : 167 +- 315 (76.8%) # stuck... >> off off : 177 +- 305 (71.2%) # ~ current pg > > What exactly do you mean with 'stuck'? I mean that the during the I/O storms induced by the checkpoint pgbench sometimes get stuck, i.e. does not report its progression every second (I run with "-P 1"). This occurs when sort is off, either with or without flush, for instance an extract from the off/off medium run: progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977 progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059 progress:575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708 progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan progress: 815.0s, 0.0 tps, lat -nan ms stddev -nan progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 817.0 s, 0.0 tps,lat -nan ms stddev -nan progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 819.0 s, 0.0 tps, lat -nan ms stddev-nan progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan progress:822.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 824.0 s,0.0 tps, lat -nan ms stddev -nan progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 826.0 s, 0.0 tps, lat -nanms stddev -nan There is a 239.4 seconds gap in pgbench output. This occurs from time to time and may represent a significant part of the run, and I count these "stuck" times as 0 tps. Sometimes pgbench is stuck performance wise but manages nevetheless to report a "0.0 tps" every second, as above after it unstuck. The actual origin of the issue with a stuck client (pgbench, libpq, OS, postgres...) is unclear to me, but the whole system does not behave well under an I/O storm anyway, and I have not succeeded in understanding where pgbench is stuck when it does not report its progress. I tried some runs with gdb but it did not get stuck and reported a lot of "0.0 tps" during the storms. Here are a few more figures with the v8 version of the patch, on a host with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported the medium case, and the small case turned afterwards. small postgresql.conf: shared_buffers = 2GB checkpoint_timeout = 300s # this is the default checkpoint_completion_target= 0.8 # initialization: pgbench -i -s 120 medium postgresql.conf: ## ALREADY REPORTED shared_buffers = 4GB checkpoint_timeout = 15min checkpoint_completion_target= 0.8 max_wal_size = 4GB # initialization: pgbench -i -s 250 warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4 # 400 tps throttled test sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4 options / percent of skipped/late transactions sort/flush / small medium on on : 3.5 2.7 on off : 24.6 16.2 off on : 66.1 68.4 off off : 63.2 68.7 # 200 tps throttled test sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4 options / percent of skipped/late transactions sort/flush / small medium on on : 1.9 2.7 on off : 14.3 9.5 off on : 45.6 47.4 off off : 47.4 48.8 # 100 tps throttled test sh> pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4 options / percent of skipped/late transactions sort/flush / small medium on on : 0.9 1.8 on off : 9.3 7.9 off on : 5.0 13.0 off off : 31.2 31.9 # full speed 1 client sh> pgbench -M prepared -N -P 1 -T 4000 options / tps avg & stddev (percent of time below 10.0 tps) sort/flush / small medium on on : 564 +- 148 ( 0.1%) 631 +- 131 ( 0.1%) on off : 470 +- 340 (21.7%) 564 +- 303 (12.0%) off on : 157+- 296 (66.2%) 167 +- 315 (76.8%) off off : 154 +- 251 (61.5%) 177 +- 305 (71.2%) # full speed 2 threads 4 clients sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4 options / tps avg & stddev (percent of time below 10.0 tps) sort/flush / small medium on on : 757 +- 417 ( 0.1%) 1058 +- 455 ( 0.1%) on off : 752 +- 893 (48.4%) 1056 +- 942 (32.8%) off on : 173+- 521 (83.0%) 170 +- 500 (88.3%) off off : 199 +- 512 (82.5%) 209 +- 506 (82.0%) In all cases, the "sort on & flush on" provides the best results, with tps speedup from 3-5, and overall high responsiveness (& lower latency). -- Fabien.
pgsql-hackers by date: