Thread: pgsql: Refactor script execution state machine in pgbench.
Refactor script execution state machine in pgbench. The doCustom() function had grown into quite a mess. Rewrite it, in a more explicit state machine style, for readability. This also fixes one minor bug: if a script consisted entirely of meta commands, doCustom() never returned to the caller, so progress reports with the -P option were not printed. I don't want to backpatch this refactoring, and the bug is quite insignificant, so only commit this to master, and leave the bug unfixed in back-branches. Review and original bug report by Fabien Coelho. Discussion: <alpine.DEB.2.20.1607090850120.3412@sto> Branch ------ master Details ------- http://git.postgresql.org/pg/commitdiff/12788ae49e1933f463bc59a6efe46c4a01701b76 Modified Files -------------- src/bin/pgbench/pgbench.c | 1106 +++++++++++++++++++++++++++------------------ 1 file changed, 661 insertions(+), 445 deletions(-)
Heikki Linnakangas <heikki.linnakangas@iki.fi> writes: > Refactor script execution state machine in pgbench. There's at least one bug in this patch. It's caused pgbench to go into an infinite loop during the TAP tests on my buildfarm machine longfin. $ ps auxww | grep buildfarm | grep pgbench buildfarm 9033 100.0 0.1 2462896 5936 ?? R 4:06AM 719:14.18 /Users/buildfarm/bf-data/HEAD/inst/bin/pgbench--no-vacuum --client=5 --protocol=prepared --transactions=25 --file /Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/tmp_check/data_main_JNh2/pgbench_script buildfarm 9023 0.0 0.2 2603480 18696 ?? S 4:06AM 0:00.34 /Users/buildfarm/bf-data/HEAD/inst/bin/postgres-D /Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/tmp_check/data_main_JNh2/pgdata buildfarm 9009 0.0 0.1 2458948 12224 ?? S 4:06AM 0:04.56 /usr/bin/perl t/001_pgbench.pl buildfarm 9007 0.0 0.0 2433396 2220 ?? S 4:06AM 0:00.00 /bin/sh -c cd . && TESTDIR='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench'PATH="/Users/buildfarm/bf-data/HEAD/inst/bin:$PATH" PGPORT='65678'top_builddir='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/../../..' PG_REGRESS='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/../../../src/test/regress/pg_regress'prove -I ../../../src/test/perl/--verbose t/*.pl buildfarm 9002 0.0 0.0 2432936 2108 ?? S 4:06AM 0:00.01 /Applications/Xcode.app/Contents/Developer/usr/bin/make-C pgbench installcheck Some other buildfarm critters seem to have got through that successfully, so it's probably a portability issue not a hard failure. regards, tom lane
Hello Tom, > There's at least one bug in this patch. Indeed. > It's caused pgbench to go into an infinite loop during the TAP tests on > my buildfarm machine longfin. > > buildfarm 719:14.18 /Users/buildfarm/bf-data/HEAD/inst/bin/pgbench --no-vacuum --client=5 --protocol=prepared --transactions=25--file /Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/tmp_check/data_main_JNh2/pgbench_script > > Some other buildfarm critters seem to have got through that successfully, > so it's probably a portability issue not a hard failure. Hmmm, not easy to identify, especially as I do not have a clang on MacOS... Could you catch the process to identify the infinite loop? What are the states of the five clients? I think that it may be stuck in doCustom or on threadRun. I'm not sure how it could be stuck in doCustom. Re-reading the code, I think that the "end_tx_processed" stuff could be removed in favor of a systematic return at the end of the transaction, but that does not explain anything wrt to an infinite loop: after 25 transactions it should have got to FINISHED and have exited. Maybe it can be stuck in threadRun in the "while (remains>0)" loop, which suggests that a remains-- has been forgotten... but I do not see how this is possible to forget a client by reading the code. So although I hastily declared that bugs would be easier to detect, without any additional clue it is hard to help. -- Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Hello Tom, >> It's caused pgbench to go into an infinite loop during the TAP tests on >> my buildfarm machine longfin. > Hmmm, not easy to identify, especially as I do not have a clang on > MacOS... > Could you catch the process to identify the infinite loop? > What are the states of the five clients? Yeah, I thought you'd ask for more info ;-). I am busy with wrapping the 9.6.0 release, but will go digging as soon as that's done. regards, tom lane
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Could you catch the process to identify the infinite loop? I think the problem is here (pgbench.c lines 4550ff): bool ready; ... ready = FD_ISSET(sock, &input_mask); The result of FD_ISSET isn't a bool, it's an int. I can see in the looping process that the backend connection has socket FD 8, and I think what is happening is that FD_ISSET is returning 1<<8 and that's getting truncated to zero to fit in a bool (char), so that the code just below doesn't call doCustom and we never consume the waiting result. I couldn't reproduce this in a manual build, but perhaps the socket number assignments are a bit different in the buildfarm environment. Rather than s/bool/int/, I'm inclined to remove the "ready" variable altogether. We have exactly zero other places that make any assumptions about the width of FD_ISSET results, and I don't see a good reason why this needs to be the first one. I see a number of other things that look pretty infelicitous in this code --- for example, why is the loop at lines 4440ff break'ing after it comes across the first socket to wait for? That sure looks like it will fail to deal with more than one wait-able socket, rather destroying the point of having multiple threads. Will fix. regards, tom lane
Hello Tom, > I think the problem is here (pgbench.c lines 4550): Hmmm... Indeed "typedef char bool;". I thought postgresql bool was an int, I usually do "typedef enum { false, true } bool;" > I see a number of other things that look pretty infelicitous in this > code --- for example, why is the loop at lines 4440 break'ing after > it comes across the first socket to wait for? Indeed this seems strange. I will have a look at that as well. I'll try to look at these tonight, or if I can't by the end of the week. -- Fabien.
On 09/27/2016 02:25 AM, Tom Lane wrote: > Fabien COELHO <coelho@cri.ensmp.fr> writes: >> Could you catch the process to identify the infinite loop? > > I think the problem is here (pgbench.c lines 4550ff): > > bool ready; > ... > ready = FD_ISSET(sock, &input_mask); > > The result of FD_ISSET isn't a bool, it's an int. I can see in the > looping process that the backend connection has socket FD 8, and > I think what is happening is that FD_ISSET is returning 1<<8 and > that's getting truncated to zero to fit in a bool (char), so that > the code just below doesn't call doCustom and we never consume the > waiting result. Hah, good catch. It's quite a landmine that a macro named "is-something" doesn't return a boolean. I remember we had issues like this in some Postgres macros too, and added a "(<macro>) != 0)" into the macro definition to fix. Of course, FD_ISSET is not in our control, and "bool" is a Postgres thing anyway, so can't do that here. > Will fix. Thanks! - Heikki
> Will fix. Oops, I missed this line... Thanks! -- Fabien.