Thread: seems like a bug in pgbench -R
Hi, There seems to be a bug in pgbench when used with '-R' option, resulting in stuck pgbench process. Reproducing it is pretty easy: echo 'select 1' > select.sql while /bin/true; do pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1; date; done; I do get a stuck pgbench within a few rounds (~10-20), but YMMV. It gets stuck like this (this is on REL_11_STABLE): 0x00007f3b1814fcb3 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x7fff9d668ec0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/select.c:41 41 return SYSCALL_CANCEL (select, nfds, readfds, writefds, exceptfds, (gdb) bt #0 0x00007f3b1814fcb3 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x7fff9d668ec0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/select.c:41 #1 0x000000000040834d in threadRun (arg=arg@entry=0x1c9f370) at pgbench.c:5810 #2 0x0000000000403e0a in main (argc=<optimized out>, argv=<optimized out>) at pgbench.c:5583 All the connections are already closed at this point (there is nothing in pg_stat_activity and log_disconnections=on confirms that), and then there's this: (gdb) p remains $1 = 0 (gdb) p nstate $1 = 1 (gdb) p state[0] $2 = {con = 0x0, id = 0, state = CSTATE_FINISHED, cstack = 0xf21b10, use_file = 0, command = 1, variables = 0xf2b0b0, nvariables = 4, vars_sorted = false, txn_scheduled = 699536519281, sleep_until = 0, txn_begin = {tv_sec = 699536, tv_nsec = 518478603}, stmt_begin = {tv_sec = 0, tv_nsec = 0}, prepared = {false <repeats 128 times>}, cnt = 132, ecnt = 0} So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or one of the other commits touching this part of the code. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> echo 'select 1' > select.sql > > while /bin/true; do > pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1; > date; > done; Indeed. I'll look at it over the weekend. > So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or > one of the other commits touching this part of the code. Yep, possibly. -- Fabien.
>> echo 'select 1' > select.sql >> >> while /bin/true; do >> pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1; >> date; >> done; > > Indeed. I'll look at it over the weekend. > >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or >> one of the other commits touching this part of the code. I could not reproduce this issue on head, but I confirm on 11.2. -- Fabien.
On 3/15/19 5:16 PM, Fabien COELHO wrote: > >>> echo 'select 1' > select.sql >>> >>> while /bin/true; do >>> pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1; >>> date; >>> done; >> >> Indeed. I'll look at it over the weekend. >> >>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or >>> one of the other commits touching this part of the code. > > I could not reproduce this issue on head, but I confirm on 11.2. > AFAICS on head it's fixed by 3bac77c48f166b9024a5ead984df73347466ae12 regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Tomas, >>>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or >>>> one of the other commits touching this part of the code. >> >> I could not reproduce this issue on head, but I confirm on 11.2. >> > > AFAICS on head it's fixed by 3bac77c48f166b9024a5ead984df73347466ae12 Thanks for the information. I pinpointed the exact issue in one go: no surprise given that the patch was motivated by cleaning up this kind of external state machine changes which I thought doubtful and error prone. Attached is a fix to apply on pg11. -- Fabien.
Attachment
Hi Fabien, On Fri, Mar 15, 2019 at 4:17 PM, Fabien COELHO wrote: > >> echo 'select 1' > select.sql > >> > >> while /bin/true; do > >> pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1; > >> date; > >> done; > > > > Indeed. I'll look at it over the weekend. > > > >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or > >> one of the other commits touching this part of the code. > > I could not reproduce this issue on head, but I confirm on 11.2. I could reproduce the stuck on 11.4. On Sat, Mar 16, 2019 at 10:14 AM, Fabien COELHO wrote: > Attached is a fix to apply on pg11. I confirm the stuck doesn't happen after applying your patch. It passes make check-world. This change seems not to affect performance, so I didn't do any performance test. > + /* under throttling we may have finished the last client above */ > + if (remains == 0) > + break; If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients, a thread needs to wait the results or sleep. In that logic, there are the case that a thread tried to wait the results when there are no clients wait the results, and this causes the issue. This is happened when there are only CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be finished and "remains" will be 0. I confirmed above codes prevent such a case. I almost think this is ready for committer, but I have one question. Is it better adding any check like if(maxsock != -1) before the select? else /* no explicit delay, select without timeout */ { nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); } -- Yoshikazu Imai
Hello Yoshikazu, >> I could not reproduce this issue on head, but I confirm on 11.2. > > I could reproduce the stuck on 11.4. > >> Attached is a fix to apply on pg11. > > I confirm the stuck doesn't happen after applying your patch. Ok, thanks for the feedback. >> + /* under throttling we may have finished the last client above */ >> + if (remains == 0) >> + break; > > If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients, > a thread needs to wait the results or sleep. In that logic, there are the case > that a thread tried to wait the results when there are no clients wait the > results, and this causes the issue. This is happened when there are only > CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be > finished and "remains" will be 0. > > I confirmed above codes prevent such a case. Yep. > I almost think this is ready for committer, Almost great, then. > but I have one question. Is it better adding any check like if(maxsock > != -1) before the select? > > else /* no explicit delay, select without timeout */ > { > nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); > } I think that it is not necessary because this case cannot happen: If some clients are still running (remains > 0), they are either sleeping, in which case there would be a timeout, or they are waiting for something from the server, otherwise the script could be advanced further so there would be something else to do for the thread. We could check this by adding "Assert(maxsock != -1);" before this select, but I would not do that for a released version. -- Fabien.
On Wed, July 24, 2019 at 7:02 PM, Fabien COELHO wrote: > > but I have one question. Is it better adding any check like if(maxsock > > != -1) before the select? > > > > else /* no explicit delay, select without timeout */ > > { > > nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); } > > I think that it is not necessary because this case cannot happen: If some > clients are still running (remains > 0), they are either sleeping, in > which case there would be a timeout, or they are waiting for something > from the server, otherwise the script could be advanced further so there > would be something else to do for the thread. Ah, I understand. > We could check this by adding "Assert(maxsock != -1);" before this select, > but I would not do that for a released version. Yeah I also imagined that we can use Assert, but ah, it's released version. I got it. Thanks for telling that. So I'll mark this ready for committer. -- Yoshikazu Imai
Hello Yoshikazu, > |...] So I'll mark this ready for committer. Ok, thanks for the review. -- Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> |...] So I'll mark this ready for committer. > Ok, thanks for the review. LGTM, pushed. regards, tom lane