Thread: seems like a bug in pgbench -R

seems like a bug in pgbench -R

From
Tomas Vondra
Date:
Hi,

There seems to be a bug in pgbench when used with '-R' option, resulting
in stuck pgbench process. Reproducing it is pretty easy:

echo 'select 1' > select.sql

while /bin/true; do
  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
  date;
done;

I do get a stuck pgbench within a few rounds (~10-20), but YMMV. It gets
stuck like this (this is on REL_11_STABLE):

0x00007f3b1814fcb3 in __GI___select (nfds=nfds@entry=0,
readfds=readfds@entry=0x7fff9d668ec0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/select.c:41
41      return SYSCALL_CANCEL (select, nfds, readfds, writefds, exceptfds,
(gdb) bt
#0  0x00007f3b1814fcb3 in __GI___select (nfds=nfds@entry=0,
readfds=readfds@entry=0x7fff9d668ec0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x000000000040834d in threadRun (arg=arg@entry=0x1c9f370) at
pgbench.c:5810
#2  0x0000000000403e0a in main (argc=<optimized out>, argv=<optimized
out>) at pgbench.c:5583

All the connections are already closed at this point (there is nothing
in pg_stat_activity and log_disconnections=on confirms that), and then
there's this:

(gdb) p remains
$1 = 0
(gdb) p nstate
$1 = 1
(gdb) p state[0]
$2 = {con = 0x0, id = 0, state = CSTATE_FINISHED, cstack = 0xf21b10,
use_file = 0, command = 1, variables = 0xf2b0b0, nvariables = 4,
vars_sorted = false, txn_scheduled = 699536519281, sleep_until = 0,
txn_begin = {tv_sec = 699536, tv_nsec = 518478603}, stmt_begin = {tv_sec
= 0,  tv_nsec = 0}, prepared = {false <repeats 128 times>}, cnt = 132,
ecnt = 0}

So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
one of the other commits touching this part of the code.

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: seems like a bug in pgbench -R

From
Fabien COELHO
Date:
> echo 'select 1' > select.sql
>
> while /bin/true; do
>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
>  date;
> done;

Indeed. I'll look at it over the weekend.

> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
> one of the other commits touching this part of the code.

Yep, possibly.

-- 
Fabien.


Re: seems like a bug in pgbench -R

From
Fabien COELHO
Date:
>> echo 'select 1' > select.sql
>> 
>> while /bin/true; do
>>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
>>  date;
>> done;
>
> Indeed. I'll look at it over the weekend.
>
>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
>> one of the other commits touching this part of the code.

I could not reproduce this issue on head, but I confirm on 11.2.

-- 
Fabien.


Re: seems like a bug in pgbench -R

From
Tomas Vondra
Date:
On 3/15/19 5:16 PM, Fabien COELHO wrote:
> 
>>> echo 'select 1' > select.sql
>>>
>>> while /bin/true; do
>>>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
>>>  date;
>>> done;
>>
>> Indeed. I'll look at it over the weekend.
>>
>>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
>>> one of the other commits touching this part of the code.
> 
> I could not reproduce this issue on head, but I confirm on 11.2.
> 

AFAICS on head it's fixed by 3bac77c48f166b9024a5ead984df73347466ae12

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: seems like a bug in pgbench -R

From
Fabien COELHO
Date:
Hello Tomas,

>>>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
>>>> one of the other commits touching this part of the code.
>>
>> I could not reproduce this issue on head, but I confirm on 11.2.
>>
>
> AFAICS on head it's fixed by 3bac77c48f166b9024a5ead984df73347466ae12

Thanks for the information.

I pinpointed the exact issue in one go: no surprise given that the patch 
was motivated by cleaning up this kind of external state machine changes 
which I thought doubtful and error prone.

Attached is a fix to apply on pg11.

-- 
Fabien.
Attachment

RE: seems like a bug in pgbench -R

From
"Imai, Yoshikazu"
Date:
Hi Fabien,

On Fri, Mar 15, 2019 at 4:17 PM, Fabien COELHO wrote:
> >> echo 'select 1' > select.sql
> >> 
> >> while /bin/true; do
> >>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
> >>  date;
> >> done;
> >
> > Indeed. I'll look at it over the weekend.
> >
> >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
> >> one of the other commits touching this part of the code.
> 
> I could not reproduce this issue on head, but I confirm on 11.2.

I could reproduce the stuck on 11.4.

On Sat, Mar 16, 2019 at 10:14 AM, Fabien COELHO wrote:
> Attached is a fix to apply on pg11.

I confirm the stuck doesn't happen after applying your patch.

It passes make check-world.

This change seems not to affect performance, so I didn't do any performance
test.

> +        /* under throttling we may have finished the last client above */
> +        if (remains == 0)
> +            break;

If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
a thread needs to wait the results or sleep. In that logic, there are the case
that a thread tried to wait the results when there are no clients wait the
results, and this causes the issue. This is happened when there are only
CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
finished and "remains" will be 0.

I confirmed above codes prevent such a case.


I almost think this is ready for committer, but I have one question.

Is it better adding any check like if(maxsock != -1) before the select?


else                /* no explicit delay, select without timeout */
{
    nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
}

--
Yoshikazu Imai



RE: seems like a bug in pgbench -R

From
Fabien COELHO
Date:
Hello Yoshikazu,

>> I could not reproduce this issue on head, but I confirm on 11.2.
>
> I could reproduce the stuck on 11.4.
>
>> Attached is a fix to apply on pg11.
>
> I confirm the stuck doesn't happen after applying your patch.

Ok, thanks for the feedback.

>> +        /* under throttling we may have finished the last client above */
>> +        if (remains == 0)
>> +            break;
>
> If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
> a thread needs to wait the results or sleep. In that logic, there are the case
> that a thread tried to wait the results when there are no clients wait the
> results, and this causes the issue. This is happened when there are only
> CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
> finished and "remains" will be 0.
>
> I confirmed above codes prevent such a case.

Yep.

> I almost think this is ready for committer,

Almost great, then.

> but I have one question. Is it better adding any check like if(maxsock 
> != -1) before the select?
>
> else                /* no explicit delay, select without timeout */
> {
>    nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
> }

I think that it is not necessary because this case cannot happen: If some 
clients are still running (remains > 0), they are either sleeping, in 
which case there would be a timeout, or they are waiting for something 
from the server, otherwise the script could be advanced further so there 
would be something else to do for the thread.

We could check this by adding "Assert(maxsock != -1);" before this select, 
but I would not do that for a released version.

-- 
Fabien.



RE: seems like a bug in pgbench -R

From
"Imai, Yoshikazu"
Date:
On Wed, July 24, 2019 at 7:02 PM, Fabien COELHO wrote:
> > but I have one question. Is it better adding any check like if(maxsock
> > != -1) before the select?
> >
> > else                /* no explicit delay, select without timeout */
> > {
> >    nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); }
> 
> I think that it is not necessary because this case cannot happen: If some
> clients are still running (remains > 0), they are either sleeping, in
> which case there would be a timeout, or they are waiting for something
> from the server, otherwise the script could be advanced further so there
> would be something else to do for the thread.

Ah, I understand.

> We could check this by adding "Assert(maxsock != -1);" before this select,
> but I would not do that for a released version.

Yeah I also imagined that we can use Assert, but ah, it's released version.
I got it. Thanks for telling that.

So I'll mark this ready for committer.

--
Yoshikazu Imai




RE: seems like a bug in pgbench -R

From
Fabien COELHO
Date:
Hello Yoshikazu,

> |...] So I'll mark this ready for committer.

Ok, thanks for the review.

-- 
Fabien.



Re: seems like a bug in pgbench -R

From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> |...] So I'll mark this ready for committer.

> Ok, thanks for the review.

LGTM, pushed.

            regards, tom lane