Re: pgbench rate limiting changes transaction latency computation - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: pgbench rate limiting changes transaction latency computation |
Date | |
Msg-id | 20190611232432.3clamcuejrxuvzqy@alap3.anarazel.de Whole thread Raw |
In response to | Re: pgbench rate limiting changes transaction latency computation (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: pgbench rate limiting changes transaction latency computation
Re: pgbench rate limiting changes transaction latency computation |
List | pgsql-hackers |
Hi, On 2019-06-11 11:31:15 +0300, Heikki Linnakangas wrote: > It's not fair to say that its meaning was changed. Before 9.4, there was no > -R option. Well, my point is that -R changed the existing meaning of a field, and that that's not nice. > Yeah, I can see that the server-observed transaction latency would often be > more useful than what's printed now. But changing the current meaning > doesn't seem like a good idea. Well, then a *new* column should have been added for that value under -R. Although admittedly 'lat' is not a very good descriptor for the non -R behaviour. But anyway, to go forward, I think we should replace 'lat' with a 'txtime' (or similar) column that is not affected by -R. And then, under -R only, add a new 'txlat' column, that shows the 'current' meaning of lat under -R. Not convinced the names are right, but you get the gist. > As Fabien said, the reported latency is the latency seen by the > imaginary user of the system, and without -R, there's no lag so it's the > same number. See also how it works with --latency-limit. The limit is on the > reported latency, which includes the lag. Well, that's how it works in a lot of scenarios (e.g. interactive scenarios, where users give up). But there's also a lot where the amount of work doesn't decrease due to lateness, it just queues up (e.g. many batch / queue processing workloads). > The outlier behavior seems very visible in both of the above. The system > completely stalled for about 1-2 seconds. And then it takes a few seconds to > process the backlog and catch up. But that was just because I was showing a simplistic example. E.g. here's a log of a vacuum finishing, and then another starting a few seconds later (both vacuums lasting a fair while): progress: 139.0 s, 2438.4 tps, txtime 13.033 ms stddev 3.830, lat 17530.784 ms stddev 590.153, lag 17517.751 ms progress: 140.0 s, 2489.0 tps, txtime 12.911 ms stddev 3.642, lat 17752.862 ms stddev 600.661, lag 17739.952 ms progress: 141.0 s, 2270.0 tps, txtime 14.021 ms stddev 4.965, lat 17973.805 ms stddev 594.784, lag 17959.784 ms progress: 142.0 s, 1408.0 tps, txtime 22.848 ms stddev 5.365, lat 18417.808 ms stddev 632.729, lag 18394.960 ms progress: 143.0 s, 3001.0 tps, txtime 10.724 ms stddev 4.318, lat 18796.971 ms stddev 617.462, lag 18786.247 ms progress: 144.0 s, 4678.0 tps, txtime 6.823 ms stddev 2.136, lat 18503.253 ms stddev 669.072, lag 18496.431 ms progress: 145.0 s, 4577.0 tps, txtime 7.001 ms stddev 1.526, lat 18108.596 ms stddev 689.843, lag 18101.596 ms progress: 146.0 s, 2596.0 tps, txtime 12.261 ms stddev 3.060, lat 17961.623 ms stddev 683.498, lag 17949.363 ms progress: 147.0 s, 2654.0 tps, txtime 12.072 ms stddev 3.282, lat 18120.009 ms stddev 685.074, lag 18107.938 ms progress: 148.0 s, 3471.0 tps, txtime 9.240 ms stddev 3.702, lat 18251.712 ms stddev 676.572, lag 18242.472 ms progress: 149.0 s, 3056.0 tps, txtime 10.468 ms stddev 5.131, lat 18058.950 ms stddev 675.334, lag 18048.482 ms progress: 150.0 s, 2319.0 tps, txtime 13.778 ms stddev 3.762, lat 18305.101 ms stddev 688.186, lag 18291.323 ms progress: 151.0 s, 2355.0 tps, txtime 13.567 ms stddev 3.891, lat 18586.073 ms stddev 691.656, lag 18572.506 ms progress: 152.0 s, 2321.0 tps, txtime 13.742 ms stddev 3.708, lat 18835.985 ms stddev 709.580, lag 18822.244 ms progress: 153.0 s, 2360.0 tps, txtime 13.604 ms stddev 3.533, lat 19121.166 ms stddev 709.682, lag 19107.562 ms The period inbetween where no vacuum was running is imo considerably harder to spot looking at 'lat'. I guess you can argue that one can just look at tps instead. But for other rate limited cases that doesn't work as well: progress: 121.0 s, 961.0 tps, txtime 3.452 ms stddev 0.947, lat 3.765 ms stddev 1.268, lag 0.313 ms progress: 122.0 s, 979.0 tps, txtime 5.388 ms stddev 8.737, lat 7.378 ms stddev 11.137, lag 1.990 ms progress: 123.0 s, 1078.8 tps, txtime 3.679 ms stddev 1.278, lat 4.322 ms stddev 3.216, lag 0.643 ms progress: 124.0 s, 1082.2 tps, txtime 5.575 ms stddev 9.790, lat 8.716 ms stddev 15.317, lag 3.141 ms progress: 125.0 s, 990.0 tps, txtime 3.489 ms stddev 1.148, lat 3.817 ms stddev 1.456, lag 0.328 ms progress: 126.0 s, 955.0 tps, txtime 9.284 ms stddev 15.362, lat 14.210 ms stddev 22.084, lag 4.926 ms progress: 127.0 s, 960.0 tps, txtime 11.951 ms stddev 11.222, lat 17.732 ms stddev 21.066, lag 5.781 ms progress: 128.0 s, 945.9 tps, txtime 11.702 ms stddev 17.590, lat 23.791 ms stddev 45.327, lag 12.089 ms progress: 129.0 s, 1013.1 tps, txtime 19.871 ms stddev 19.407, lat 42.530 ms stddev 39.582, lag 22.659 ms progress: 130.0 s, 1004.7 tps, txtime 12.748 ms stddev 7.864, lat 19.827 ms stddev 20.084, lag 7.079 ms progress: 131.0 s, 1025.2 tps, txtime 9.005 ms stddev 16.524, lat 18.491 ms stddev 29.864, lag 9.485 ms progress: 132.0 s, 1015.9 tps, txtime 3.366 ms stddev 0.885, lat 3.640 ms stddev 1.182, lag 0.274 ms progress: 133.0 s, 1013.1 tps, txtime 4.749 ms stddev 8.485, lat 6.828 ms stddev 12.520, lag 2.079 ms progress: 134.0 s, 1026.0 tps, txtime 4.362 ms stddev 2.556, lat 4.879 ms stddev 3.158, lag 0.517 ms here e.g. the was a noticable slowdown, where looking at tps doesn't help (because we're still able to meet the tps goal). I still don't quite get when I would want to look at lat, when I have lag. They're always going to be close by. > For testing the server under full load, like during that catch up period, > testing without -R seems better. One area in which postgres is pretty weak, although less bad than we used to be, is in is predicatable latency. Most production servers aren't run under the highest possible throughput, therefore optimizing jitter under loaded but not breakneck speeds is important. And to be able to localize where such latency is introduced, it's important to see the precise moment things got slower / where performance recovered. > > I think we should just restore lat to a sane behaviour under -R, and if > > you want to have lat + lag as a separate column in -R mode, then let's > > do that. > > It seems like a bad idea to change the meaning of the value after the fact. > Would be good to at least rename it, to avoid confusion. Maybe that's not > too important for the interactive -P reports, but got to be mindful about > the numbers logged in the log file, at least. Given that 'lat' is currently in the file, and I would bet the majority of users of pgbench use it without -R, I'm not convinced that's something to care about. Most are going to interpret it the way it's computed without -R. Greetings, Andres Freund
pgsql-hackers by date: