Logged by: Kingter Wang
Email address: tew(at)vmware(dot)com
PostgreSQL version: 9.4
Operating system: Suse 11 sp3
Description:
A new option "--progress=NUM" is introduced into postgres 9.4,
it is designed to show pgbench progress report every NUM seconds.
The option can work properly if the NUM < 2148. For example,
pgbench can show progress report every 60 seconds as below:
$ ./pgbench --progress=60 -T 600
starting vacuum...end.
progress: 60.0 s, 527.0 tps, lat 120.424 ms stddev 58.133
progress: 120.1 s, 482.5 tps, lat 130.335 ms stddev 104.964
progress: 180.1 s, 378.3 tps, lat 168.859 ms stddev 403.032
progress: 240.0 s, 358.6 tps, lat 178.792 ms stddev 417.703
progress: 300.1 s, 363.0 tps, lat 176.089 ms stddev 447.271
progress: 420.0 s, 579.7 tps, lat 110.126 ms stddev 243.220
progress: 480.1 s, 138.7 tps, lat 461.893 ms stddev 980.454
progress: 540.6 s, 114.6 tps, lat 558.521 ms stddev 1048.582
progress: 600.0 s, 153.5 tps, lat 417.516 ms stddev 969.933
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 64
number of threads: 16
duration: 600 s
number of transactions actually processed: 211803
latency average: 181.513 ms
latency stddev: 465.317 ms
tps = 351.267829 (including connections establishing)
tps = 351.520664 (excluding connections establishing)
But pgbench will unexpectedly show progress report extremely frequently,
if NUM >= 2148.
For example, passed --progress=3600, however, the pgbench report progress
with a frequency less than 1 second.
$ ./pgbench --progress=3600 -T 259200
starting vacuum...end.
progress: 0.6 s, 77.6 tps, lat 140.995 ms stddev 39.355
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
progress: 0.6 s, 264.6 tps, lat 172.637 ms stddev 0.000
progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
.
.
.
====Diagnosis====
Looking into the source code, the root cause is integer overflow of
"int progress", see source code from "pgbench.c.
The integer progress multiple by 1000,000 to convert from second to micro-second.
If progress >= 2148, then the value of progress * 1000,000 (2,148,000,000)
will larger than max value of 32 bits integer (2,147,483,647).
The integer overflow lead to negative value, hence,
the value of 'next_report' does not incremental but decremental.
Therefore, the 'now' (current time) will be always larger than 'next_report',
This result in the progress report is printed out for every loop
(very very frequently), rather than wait for next time up of progress.
====Fix Patch Description====
Fix Integer overflow which happened when covert progress from seconds to micro-seconds by
(progress * 1000000) if progress >= 2148.
Attached the patch.
====Test Done====
Test passed with the patch:
$pgbench --progress=3600 -j 16 -c 64 -T 259200 pgbench
starting vacuum...end.
progress: 3600.0 s, 410.6 tps, lat 155.848 ms stddev 773.983
progress: 7200.0 s, 410.5 tps, lat 155.906 ms stddev 740.046
progress: 10800.0 s, 459.6 tps, lat 139.261 ms stddev 771.408
progress: 14400.1 s, 496.4 tps, lat 128.880 ms stddev 649.061
Best Regards,
Kingter Wang