Thread: pgbench show progress report extremely frequently if "--progress" >= 2148 caused by integer overflow

Logged by: Kingter Wang
Email address: tew(at)vmware(dot)com
PostgreSQL version: 9.4
Operating system: Suse 11 sp3
Description:

A new option "--progress=NUM" is introduced into postgres 9.4,
it is designed to show pgbench progress report every NUM seconds.
The option can work properly if the NUM < 2148. For example,
pgbench can show progress report every 60 seconds as below:

    $ ./pgbench --progress=60 -T 600
    starting vacuum...end.
    progress: 60.0 s, 527.0 tps, lat 120.424 ms stddev 58.133
    progress: 120.1 s, 482.5 tps, lat 130.335 ms stddev 104.964
    progress: 180.1 s, 378.3 tps, lat 168.859 ms stddev 403.032
    progress: 240.0 s, 358.6 tps, lat 178.792 ms stddev 417.703
    progress: 300.1 s, 363.0 tps, lat 176.089 ms stddev 447.271
    progress: 420.0 s, 579.7 tps, lat 110.126 ms stddev 243.220
    progress: 480.1 s, 138.7 tps, lat 461.893 ms stddev 980.454
    progress: 540.6 s, 114.6 tps, lat 558.521 ms stddev 1048.582
    progress: 600.0 s, 153.5 tps, lat 417.516 ms stddev 969.933
    transaction type: TPC-B (sort of)
    scaling factor: 10
    query mode: simple
    number of clients: 64
    number of threads: 16
    duration: 600 s
    number of transactions actually processed: 211803
    latency average: 181.513 ms
    latency stddev: 465.317 ms
    tps = 351.267829 (including connections establishing)
    tps = 351.520664 (excluding connections establishing)

But pgbench will unexpectedly show progress report extremely frequently,
if NUM >= 2148.
For example, passed --progress=3600, however, the pgbench report progress
with a frequency less than 1 second.

    $ ./pgbench --progress=3600 -T 259200
    starting vacuum...end.
    progress: 0.6 s, 77.6 tps, lat 140.995 ms stddev 39.355
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
    progress: 0.6 s, 264.6 tps, lat 172.637 ms stddev 0.000
    progress: 0.6 s, 0.0 tps, lat -nan ms stddev -nan
        .
        .
        .

====Diagnosis====
Looking into the source code, the root cause is integer overflow of
"int progress", see source code from "pgbench.c.

The integer progress multiple by 1000,000 to convert from second to micro-second.
If progress >= 2148, then the value of progress * 1000,000 (2,148,000,000)
will larger than max value of 32 bits integer (2,147,483,647).

The integer overflow lead to negative value, hence,
the value of 'next_report' does not incremental but decremental.

Therefore, the 'now' (current time) will be always larger than 'next_report',
This result in the progress report is printed out for every loop
(very very frequently), rather than wait for next time up of progress.

====Fix Patch Description====

Fix Integer overflow which happened when covert progress from seconds to micro-seconds by
(progress * 1000000) if progress >= 2148.

Attached the patch.

====Test Done====

Test passed with the patch:
$pgbench --progress=3600 -j 16 -c 64 -T 259200 pgbench
starting vacuum...end.
progress: 3600.0 s, 410.6 tps, lat 155.848 ms stddev 773.983
progress: 7200.0 s, 410.5 tps, lat 155.906 ms stddev 740.046
progress: 10800.0 s, 459.6 tps, lat 139.261 ms stddev 771.408
progress: 14400.1 s, 496.4 tps, lat 128.880 ms stddev 649.061


Best Regards,

Kingter Wang



Attachment
Hello,

> The integer progress multiple by 1000,000 to convert from second to
> micro-second. If progress >= 2148, then the value of progress * 1000,000
> (2,148,000,000) will larger than max value of 32 bits integer
> (2,147,483,647).

Indeed. Shame on me! I had not thought of overflows because of the false
security of in64 target type. Thanks for the missing cast!

--
Fabien.