Thread: Avoid stuck of pbgench due to skipped transactions
Hi, I found that pgbench could get stuck when every transaction come to be skipped and the number of transaction is not limitted by -t option. For example, when I usee a large rate (-R) for throttling and a small latency limit (-L) values with a duration (-T), pbbench got stuck. $ pgbench -T 5 -R 100000000 -L 1; When we specify the number of transactions by -t, it doesn't get stuck because the number of skipped transactions are counted and checked during the loop. However, the timer expiration is not checked in the loop although it is checked before and after a sleep for throttling. I think it is better to check the timer expiration even in the loop of transaction skips and to finish pgbnech successfully because we should correcly repport how many transactions are proccessed and skipped also in this case, and getting stuck would not be good anyway. I attached a patch for this fix. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
Hello Yugo-san, > For example, when I usee a large rate (-R) for throttling and a > small latency limit (-L) values with a duration (-T), pbbench > got stuck. > > $ pgbench -T 5 -R 100000000 -L 1; Indeed, it does not get out of the catchup loop for a long time because even scheduling takes more time than the expected transaction time! > I think it is better to check the timer expiration even in the loop > of transaction skips and to finish pgbnech successfully because we > should correcly repport how many transactions are proccessed and > skipped also in this case, and getting stuck would not be good > anyway. > > I attached a patch for this fix. The patch mostly works for me, and I agree that the bench should not be in a loop on any parameters, even when "crazy" parameters are given… However I'm not sure this is the right way to handle this issue. The catch-up loop can be dropped and the automaton can loop over itself to reschedule. Doing that as the attached fixes this issue and also makes progress reporting work proprely in more cases, and reduces the number of lines of code. I did not add a test case because time sensitive tests have been removed (which is too bad, IMHO). -- Fabien.
Attachment
Hello Fabien, On Sun, 13 Jun 2021 08:56:59 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > I attached a patch for this fix. > > The patch mostly works for me, and I agree that the bench should not be in > a loop on any parameters, even when "crazy" parameters are given… > > However I'm not sure this is the right way to handle this issue. > > The catch-up loop can be dropped and the automaton can loop over itself to > reschedule. Doing that as the attached fixes this issue and also makes > progress reporting work proprely in more cases, and reduces the number of > lines of code. I did not add a test case because time sensitive tests have > been removed (which is too bad, IMHO). I agree with your way to fix. However, the progress reporting didn't work because we cannot return from advanceConnectionState to threadRun and just break the loop. + /* otherwise loop over PREPARE_THROTTLE */ break; I attached the fixed patch that uses return instead of break, and I confirmed that this made the progress reporting work property. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
>>> I attached a patch for this fix. >> >> The patch mostly works for me, and I agree that the bench should not be in >> a loop on any parameters, even when "crazy" parameters are given… >> >> However I'm not sure this is the right way to handle this issue. >> >> The catch-up loop can be dropped and the automaton can loop over itself to >> reschedule. Doing that as the attached fixes this issue and also makes >> progress reporting work proprely in more cases, and reduces the number of >> lines of code. I did not add a test case because time sensitive tests have >> been removed (which is too bad, IMHO). > > I agree with your way to fix. However, the progress reporting didn't work > because we cannot return from advanceConnectionState to threadRun and just > break the loop. > > + /* otherwise loop over PREPARE_THROTTLE */ > break; > > I attached the fixed patch that uses return instead of break, and I confirmed > that this made the progress reporting work property. I'm hesitating to do such a strictural change for a degenerate case linked to "insane" parameters, as pg is unlikely to reach 100 million tps, ever. It seems to me enough that the command is not blocked in such cases. -- Fabien.
On Mon, 14 Jun 2021 08:47:40 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > >>> I attached a patch for this fix. > >> > >> The patch mostly works for me, and I agree that the bench should not be in > >> a loop on any parameters, even when "crazy" parameters are given… > >> > >> However I'm not sure this is the right way to handle this issue. > >> > >> The catch-up loop can be dropped and the automaton can loop over itself to > >> reschedule. Doing that as the attached fixes this issue and also makes > >> progress reporting work proprely in more cases, and reduces the number of > >> lines of code. I did not add a test case because time sensitive tests have > >> been removed (which is too bad, IMHO). > > > > I agree with your way to fix. However, the progress reporting didn't work > > because we cannot return from advanceConnectionState to threadRun and just > > break the loop. > > > > + /* otherwise loop over PREPARE_THROTTLE */ > > break; > > > > I attached the fixed patch that uses return instead of break, and I confirmed > > that this made the progress reporting work property. > > I'm hesitating to do such a strictural change for a degenerate case linked > to "insane" parameters, as pg is unlikely to reach 100 million tps, ever. > It seems to me enough that the command is not blocked in such cases. Sure. The change from "break" to "return" is just for making the progress reporting work in the loop, as you mentioned. However, my original intention is avoiding stuck in a corner-case where a unrealistic parameter was used, and I agree with you that this change is not so necessary for handling such a special situation. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, 14 Jun 2021 16:06:10 +0900 Yugo NAGATA <nagata@sraoss.co.jp> wrote: > On Mon, 14 Jun 2021 08:47:40 +0200 (CEST) > Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > I attached the fixed patch that uses return instead of break, and I confirmed > > > that this made the progress reporting work property. > > > > I'm hesitating to do such a strictural change for a degenerate case linked > > to "insane" parameters, as pg is unlikely to reach 100 million tps, ever. > > It seems to me enough that the command is not blocked in such cases. > > Sure. The change from "break" to "return" is just for making the progress > reporting work in the loop, as you mentioned. However, my original intention > is avoiding stuck in a corner-case where a unrealistic parameter was used, and > I agree with you that this change is not so necessary for handling such a > special situation. I attached the v2 patch to clarify that I withdrew the v3 patch. Regards Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: not tested Documentation: not tested Looks fine to me, as a way of catching this edge case.
Hello Greg, On Tue, 22 Jun 2021 19:22:38 +0000 Greg Sabino Mullane <htamfids@gmail.com> wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: tested, failed > Spec compliant: not tested > Documentation: not tested > > Looks fine to me, as a way of catching this edge case. Thank you for looking into this! 'make installcheck-world' and 'Implements feature' are marked "failed", but did you find any problem on this patch? -- Yugo NAGATA <nagata@sraoss.co.jp>
Apologies, just saw this. I found no problems, those "failures" were just me missing checkboxes on the commitfest interface. +1 on the patch.
Cheers,
Greg
On Tue, 10 Aug 2021 10:50:20 -0400 Greg Sabino Mullane <htamfids@gmail.com> wrote: > Apologies, just saw this. I found no problems, those "failures" were just > me missing checkboxes on the commitfest interface. +1 on the patch. Thank you! -- Yugo NAGATA <nagata@sraoss.co.jp>
On 2021/06/17 1:23, Yugo NAGATA wrote: > I attached the v2 patch to clarify that I withdrew the v3 patch. Thanks for the patch! + * For very unrealistic rates under -T, some skipped + * transactions are not counted because the catchup + * loop is not fast enough just to do the scheduling + * and counting at the expected speed. + * + * We do not bother with such a degenerate case. + */ ISTM that the patch changes pgbench so that it can skip counting some skipped transactions here even for realistic rates under -T. Of course, which would happen very rarely. Is this understanding right? On the other hand, even without the patch, in the first place, there seems no guarantee that all the skipped transactions are counted under -T. When the timer is exceeded in CSTATE_END_TX, a client ends without checking outstanding skipped transactions. Therefore the "issue" that some skipped transactions are not counted is not one the patch newly introdues. So that behavior change by the patch would be acceptable. Is this understanding right? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hello Fujii-san, > ISTM that the patch changes pgbench so that it can skip counting > some skipped transactions here even for realistic rates under -T. > Of course, which would happen very rarely. Is this understanding right? Yes. The point is to get out of the scheduling loop when time has expired, as soon it is known, instead of looping there for some possibly long time. > On the other hand, even without the patch, in the first place, there seems > no guarantee that all the skipped transactions are counted under -T. > When the timer is exceeded in CSTATE_END_TX, a client ends without > checking outstanding skipped transactions. Indeed. But that should be very few transactions under latency limit. > Therefore the "issue" that some skipped transactions are not counted is > not one the patch newly introdues. Yep. The patch counts less of them though, because of the early exit introduced in the patch in the scheduling state. Before it could be stuck in the "while (late) { count; schedule; }" loop. > So that behavior change by the patch would be acceptable. Is this > understanding right? I think so. -- Fabien.
On 2021/09/04 15:27, Fabien COELHO wrote: > > Hello Fujii-san, > >> ISTM that the patch changes pgbench so that it can skip counting >> some skipped transactions here even for realistic rates under -T. >> Of course, which would happen very rarely. Is this understanding right? > > Yes. The point is to get out of the scheduling loop when time has expired, as soon it is known, instead of looping therefor some possibly long time. Thanks for checking my understanding! + * For very unrealistic rates under -T, some skipped + * transactions are not counted because the catchup + * loop is not fast enough just to do the scheduling + * and counting at the expected speed. + * + * We do not bother with such a degenerate case. So this comment is a bit misleading? What about updating this as follows? ------------------------------ Stop counting skipped transactions under -T as soon as the timer is exceeded. Because otherwise it can take a very long time to count all of them especially when quite a lot of them happen with unrealistically high rate setting in -R, which would prevent pgbench from ending immediately. Because of this behavior, note that there is no guarantee that all skipped transactions are counted under -T though there is under -t. This is OK in practice because it's very unlikely to happen with realistic setting. ------------------------------ >> So that behavior change by the patch would be acceptable. Is this understanding right? > > I think so. +1 One question is; which version do we want to back-patch to? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hello Fujii-san, > Stop counting skipped transactions under -T as soon as the timer is > exceeded. Because otherwise it can take a very long time to count all of > them especially when quite a lot of them happen with unrealistically > high rate setting in -R, which would prevent pgbench from ending > immediately. Because of this behavior, note that there is no guarantee > that all skipped transactions are counted under -T though there is under > -t. This is OK in practice because it's very unlikely to happen with > realistic setting. Ok, I find this text quite clear. > One question is; which version do we want to back-patch to? If we consider it a "very minor bug fix" which is triggered by somehow unrealistic options, so I'd say 14 & dev, or possibly only dev. -- Fabien.
On 2021/09/07 18:24, Fabien COELHO wrote: > > Hello Fujii-san, > >> Stop counting skipped transactions under -T as soon as the timer is exceeded. Because otherwise it can take a very longtime to count all of them especially when quite a lot of them happen with unrealistically high rate setting in -R, whichwould prevent pgbench from ending immediately. Because of this behavior, note that there is no guarantee that all skippedtransactions are counted under -T though there is under -t. This is OK in practice because it's very unlikely to happenwith realistic setting. > > Ok, I find this text quite clear. Thanks for the check! So attached is the updated version of the patch. >> One question is; which version do we want to back-patch to? > > If we consider it a "very minor bug fix" which is triggered by somehow unrealistic options, so I'd say 14 & dev, or possiblyonly dev. Agreed. Since it's hard to imagine the issue happens in practice, we don't need to bother back-patch to the stable branches. So I'm thinking to commit the patch to 15dev and 14. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021/09/08 23:40, Fujii Masao wrote: > Agreed. Since it's hard to imagine the issue happens in practice, > we don't need to bother back-patch to the stable branches. > So I'm thinking to commit the patch to 15dev and 14. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION