Re: pgbench rate limiting changes transaction latency computation - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: pgbench rate limiting changes transaction latency computation
Date
Msg-id a411e48c-f5af-45a4-873a-770c5a2da863@iki.fi
Whole thread Raw
In response to Re: pgbench rate limiting changes transaction latency computation  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: pgbench rate limiting changes transaction latency computation
List pgsql-hackers
On 12/06/2019 09:23, Fabien COELHO wrote:
>> But that was just because I was showing a simplistic example. E.g.
>> here's a log of a vacuum finishing, and then another starting a few
>> seconds later (both vacuums lasting a fair while):
>>
>> progress: 139.0 s, 2438.4 tps, txtime 13.033 ms stddev 3.830, lat 17530.784 ms stddev 590.153, lag 17517.751 ms
>> progress: 140.0 s, 2489.0 tps, txtime 12.911 ms stddev 3.642, lat 17752.862 ms stddev 600.661, lag 17739.952 ms
>> progress: 141.0 s, 2270.0 tps, txtime 14.021 ms stddev 4.965, lat 17973.805 ms stddev 594.784, lag 17959.784 ms
>> progress: 142.0 s, 1408.0 tps, txtime 22.848 ms stddev 5.365, lat 18417.808 ms stddev 632.729, lag 18394.960 ms
>> progress: 143.0 s, 3001.0 tps, txtime 10.724 ms stddev 4.318, lat 18796.971 ms stddev 617.462, lag 18786.247 ms
>> progress: 144.0 s, 4678.0 tps, txtime 6.823 ms stddev 2.136, lat 18503.253 ms stddev 669.072, lag 18496.431 ms
>> progress: 145.0 s, 4577.0 tps, txtime 7.001 ms stddev 1.526, lat 18108.596 ms stddev 689.843, lag 18101.596 ms
>> progress: 146.0 s, 2596.0 tps, txtime 12.261 ms stddev 3.060, lat 17961.623 ms stddev 683.498, lag 17949.363 ms
>> progress: 147.0 s, 2654.0 tps, txtime 12.072 ms stddev 3.282, lat 18120.009 ms stddev 685.074, lag 18107.938 ms
>> progress: 148.0 s, 3471.0 tps, txtime 9.240 ms stddev 3.702, lat 18251.712 ms stddev 676.572, lag 18242.472 ms
>> progress: 149.0 s, 3056.0 tps, txtime 10.468 ms stddev 5.131, lat 18058.950 ms stddev 675.334, lag 18048.482 ms
>> progress: 150.0 s, 2319.0 tps, txtime 13.778 ms stddev 3.762, lat 18305.101 ms stddev 688.186, lag 18291.323 ms
>> progress: 151.0 s, 2355.0 tps, txtime 13.567 ms stddev 3.891, lat 18586.073 ms stddev 691.656, lag 18572.506 ms
>> progress: 152.0 s, 2321.0 tps, txtime 13.742 ms stddev 3.708, lat 18835.985 ms stddev 709.580, lag 18822.244 ms
>> progress: 153.0 s, 2360.0 tps, txtime 13.604 ms stddev 3.533, lat 19121.166 ms stddev 709.682, lag 19107.562 ms
>>
>> The period inbetween where no vacuum was running is imo considerably
>> harder to spot looking at 'lat'.
> 
> ISTM that the signal is pretty clear in whether the lag increases or
> decreases. Basically the database is 18 seconds behind its load, which is
> very bad if a user is waiting.

That was my thought too, when looking at this example. When there is 
already a long queue of transactions, the txtime of individual 
transactions doesn't matter much. The most important thing under that 
condition is how fast the system can dissolve the queue (or how fast it 
builds up even more). So the derivative of the lag or lat seems like the 
most important figure. We don't print exactly that, but it's roughly the 
same as the TPS. Jitter experienced by the user matters too, i.e. stddev 
of 'lat'.

To illustrate this, imagine that the server magically detected that 
there's a long queue of transactions. It would be beneficial to go into 
"batch mode", where it collects incoming transactions into larger 
batches. The effect of this imaginary batch mode is that the TPS rate 
increases by 50%, but the txtime also increases by 1000%, and becomes 
highly variable. Would that be a good tradeoff? I would say yes. The 
user is experiencing an 18 s delay anyway, and the increase in txtime 
would be insignificant compared to that, but the queue would be busted 
more quickly.

Of course, there is no such batch mode in PostgreSQL, and I wouldn't 
suggest trying to implement anything like that. In a different kind of 
application, you would rather maintain a steady txtime when the server 
is at full load, even if it means a lower overall TPS rate. And that 
feels like a more important goal than just TPS. I think we all agree on 
that. To simulate that kind of an application, though, you probably 
don't want to use -R, or you would use it with --latency-limit. Except 
clearly Andres is trying to do just that, which is why I'm still a bit 
confused :-).

- Heikki



pgsql-hackers by date:

Previous
From: legrand legrand
Date:
Subject: Re: Adaptive query optimization
Next
From: Masahiko Sawada
Date:
Subject: Re: [PATCH] Speedup truncates of relation forks