Re: pgbench progress report improvements - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: pgbench progress report improvements |
Date | |
Msg-id | alpine.DEB.2.02.1309210829260.18614@sto Whole thread Raw |
In response to | Re: pgbench progress report improvements (Noah Misch <noah@leadboat.com>) |
Responses |
Re: pgbench progress report improvements
Re: pgbench progress report improvements Re: pgbench progress report improvements |
List | pgsql-hackers |
Hello Noah, Thanks a lot for all these comments! I'm not planning to apply all of them directly, especially removing features that I think really desirable. Please find a defense against some of your suggestions. I wish to wait for some more feedback about these arguments before spending time in heavy changes. >> Improve pgbench measurements & progress report > > These changes are loosely coupled; please separate them into several patch > files: I thought about this. I submitted a bunch of very small pgbench patches to the previous commit fest, and it resulted in a lot of duplicated tests by reviewers, which made it a waste of their precious time. ISTM that what you suggest will result in more reviewer time, which is the critical resource of the commitfest. Moreover, splitting the patch as you suggest, especially in 7 patches, would result in conflicting or dependent patches, which creates yet more issues for the reviews. There is *some* coupling, because the reports are about the measurements performed, and I change both to try to have something consistent in the end. I can separate the doc fix, but this is just one line... Basically, I tried to optimize reviewer time by a combined patch focussed on pgbench measurements and their reporting. But this does not prevent discussing features separately! >> - Use progress option both under init & bench. >> >> Activate progress report by default, every 5 seconds. >> When initializing, --quiet reverts to the old every 100,000 insertions >> behavior... > > Patch (1): Change the default from --progress=0 to --progress=5 > > This has a disadvantage of causing two extra gettimeofday() calls per > transaction. That's not cheap on all platforms; a user comparing > pgbench results across versions will need to make a point of disabling > this again to make his results comparable. Given that threat and > uncertainty about which default would be more popular, I think we should > drop this part. The gettimeofday call time is very small compared to network and database (disk!) accesses involved in a pgbench run. On my small laptop, gettimeofday takes about 0.00000002 seconds (0.02 µs -- micro seconds) per call, which is typically under 1/1000 of the fastest local-socket read-only one-table cache-hit prepared trivial transaction. This is a little different when initializing, but I put a guard on gettimeofday in that case. If this is a blocker, I can put 0, but I really do not think it is necessary because of the performance impact. About the default behavior : it should be what is more useful. I found that running pgbench to get significant results requires long minutes or even hours because of warmup time. Running a command for 30 minutes without any feedback is annoying. The second point is that the feedback would help user notice that the figures evolve slowly but significantly, and realise that their too short runs are not significant at all. So it seems to me that it should be the desirable behavior, and --progress=0 is always possible anyway for the very performance (over-)conscious tester. > Patch (2): Make --initialize mode respect --progress. > > The definition you've chosen for --quiet makes that option contrary to its own > name: message-per-100k-tuples is typically more voluminous than your proposed > new default of message-per-5s. In fact, since --quiet currently switches from > message-per-100k-tuples to message-per-5s, your patch gives it the exact > opposite of its present effect. > > During the 9.3 development cycle, we _twice_ made changes pertaining to > --initialize message frequency: > > http://www.postgresql.org/message-id/flat/20120620.170427.347012755716399568.t-ishii@sraoss.co.jp > http://www.postgresql.org/message-id/flat/1346472039.18010.10.camel@vanquo.pezone.net > > That gives me pause about working through yet another change; we keep burning > time on this minor issue. I totally agree that this "quiet" is not convincing! My actual opinion is that quiet should just mean quiet:-), i.e. no progress report. I tried to preserve the row-counting behavior because I thought that someone would object otherwise, but I would be really in favor of dropping the row-counting report behavior altogether and only keep the time triggered report. >> - Measure transaction latency instead of computing it. > > Patch (3): Report the standard deviation of latency. > > Seems potentially useful; see inline comments below. > > In my limited testing, the interesting latency cases involved stray > clusters of transactions showing 10x-100x mean latency. If that's > typical, I'd doubt mean/stddev is a great analytical tool. But I have > little reason to believe that my experience here was representative. The stddev measurement is sensitive to the cache hit ratio, so this is a good indicator of whether the steady state is reached. Indeed, mean is the most basic tool, and stddev the second most basic... They are better than nothing, and the stddev does add value. > Patch (4): Redefine "latency" as reported by pgbench and report "lag" more. > > If we do this, should we also augment the --log output to contain the figures > necessary to reconstruct this calculation? I think that would mean adding > fields for the time when the first command of the transaction was sent. Yep, I thought about it, but decided against for the following reason: the lag measure is only significant under throttling, which is not the typical use case for pgbench. When there is no throttling, there is no lag, so nothing to report. Also, having two formats would mean that existing tool that analyse these logs would have to deal with both. > ... > Patch (5) > > That theory is sound, but I would like at least one report reproducing that > behavior and finding that this change improved it. It was really annoying when I was debugging pgbench, because I did not know whether flukes I was seeing were due to this issue or elsewhere. Having the start time taken when the test really starts help rule out this as a potential issue for strange figures on short runs. The bug report I mentionned was a pretty short test with a *lot* of threads which took possibly several seconds to be started, but I cannot run that on my small laptop. >> - Updated documentation & help message. > > Include any relevant documentation and --help updates with the patch > necessitating them. If there are errors in the documentation today, put fixes > for those in a separate patch (7). I could do a separate patch for a one-liner doc update for homogeneity... > Additionally, this patch contains numerous whitespace-only changes. Hmmm. This is emacs fault:-) I'll (try) to remove them. > Patches (5) and (6) are nicely small and free of UI questions. At a minimum, > let's try to get those done in this CommitFest. I propose dropping patches > (1) and (2). I disagree: I thing that having --progress and --quiet work consistently both under initialization and bench is valuable. The other point is whether the default should be 0, and I defended why it should really not be 0, see above. > Patches (3) and (4) have open questions, but I think they have > good value potential. See, and evaluate, my arguments and answers. >> Sample output under benchmarking with --progress=1 > >> progress: 36.0 s, 10.9 tps, 91.864 +- 124.874 ms lat >> progress: 37.0 s, 115.2 tps, 8.678 +- 1.792 ms lat > > "x +- y" does not convey that y is a standard deviation. I'm refering to the ± sign, but in ascii because I doubt that I can use a UTF character:-) From http://en.wikipedia.org/wiki/Plus-minus_sign: "In experimental sciences, the sign commonly indicates the confidence interval or error in a measurement, often the standarddeviation or standard error." ISTM that "pgbench" is experimental science. > I suggest getting the term "stddev" in there somehow, maybe like this: > > progress: 37.0 s, 115.2 tps, latency avg=8.678 ms stddev=1.792 My issue is to try to keep the line width under control so as to avoid line breaks in the terminal. Under throttling, there is also the time lag appended to the line. Moreover, using 'xxx=figure" breaks simple "cut" pipelining to extract the figures, so I would prefer to stick to spaces. Maybe: progress: 36.0 s, 115.2 tps, lat avg 9.678 ms stddev 1.792, lag 0.143 ms but I liked my "+-" approach:-) >> - " -R, --rate=SPEC target rate in transactions per second\n" >> + " -R, --rate=NUM target rate in transactions per second\n" > > This would fall under patch (7) if you feel it's needed. This is really to be homogeneous with other options. "needed" is a strong word. >> - int64 wait = (int64) >> - throttle_delay * -log(getrand(thread, 1, 1000)/1000.0); >> + int64 wait = (int64) (throttle_delay * >> + 1.00055271703 * -log(getrand(thread, 1, 10000)/10000.0)); > > This quantizes the random number, then dequantizes it, then corrects for the > resolution lost in the round trip. Yep! > Shouldn't it just be: > > int64 wait = (int64) (throttle_delay * > -log(1 - pg_erand48(thread->random_state))); It could. However with this approach the inserted delay would be pretty much unbounded. The benefit of using 10000 is that the delay is bounded by about 10 times the expected average rate, so although it is throttled by a stochastic process, it stays quite active and steady. Basically using the quantization seemed a good compromise to me... >> - if ((!use_quiet) && (j % 100000 == 0)) >> + if (!progress && (j % 100000 == 0)) > > Project style is "progress == 0" rather than "!progress". I could. I used "!progress" because "progress == 0" is interpreted as "no progress", and not "progress with value zero", which would mean a report at every possible time. >> @@ -2143,8 +2174,15 @@ printResults(int ttype, int normal_xacts, int nclients, >> [...] > This output only happens when using --rate. That has made sense > heretofore, because in the absence of rate limiting, average latency can > be trivially calculated from tps. Since we will now track stddev as > well, let's print something like this even when --rate is not in use. It is also printed without --rate. There is a "if" above because there is one report with "lag" (under --rate), and one without. >> - float tps, total_run, latency; >> + double tps, total_run, latency, sqlat, stdev, lag; > > Why these variable type changes? Because C computes stuff in double anyway (all maths functions such as sqrt above are about double), and it was an oversight from my part to have use float in the first place in the previous patch. Thus I fixed it. -- Fabien.
pgsql-hackers by date: