Thread: pgbench bug candidate: negative "initial connection time"

pgbench bug candidate: negative "initial connection time"

From

"kuroda.hayato@fujitsu.com"

Date:

11 June 2021, 08:58:45

Hi Hackers,

I played pgbench with wrong parameters, and I found bug-candidate.
The latest commit in my source is 3a09d75.

1. Do initdb and start.
2. Initialize schema and data with "scale factor" = 1.
3. execute following command many times:

$ pgbench -c 101 -j 10 postgres

Then, sometimes the negative " initial connection time" was returned.
Lateyncy average is also strange.

```
$ pgbench -c 101 -j 10 postgres
starting vacuum...end.
pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  sorry, too many clients already
pgbench (PostgreSQL) 14.0
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 101
number of threads: 10
number of transactions per client: 10
number of transactions actually processed: 910/1010
latency average = 41387686.662 ms
initial connection time = -372896921.586 ms
tps = 0.002440 (without initial connection time)
```

I sought pgbench.c and found a reason.
When a thread failed to get some connections, they do not fill any values to thread->bench_start in threadRun().
And if the failure is caused in the final thread (this means threads[nthreads - 1]->bench_start is zero),
the following if-statement sets bench_start to zero.

```
   6494                 /* first recorded benchmarking start time */
   6495                 if (bench_start == 0 || thread->bench_start < bench_start)
   6496                         bench_start = thread->bench_start;
```

The wrong bench_start propagates to printResult() and then some invalid values are appered.

```
   6509         printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
   6510                                  bench_start - start_time, latency_late);
```

I cannot distinguish whether we have to fix it, but I attache the patch.
This simply ignores a result when therad->bench_start is zero.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

ignore_failed_thread.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

11 June 2021, 14:57:35


Hello Hayato-san,

> I played pgbench with wrong parameters,

That's good:-)

> and I found bug-candidate.
>
> 1. Do initdb and start.
> 2. Initialize schema and data with "scale factor" = 1.
> 3. execute following command many times:
>
> $ pgbench -c 101 -j 10 postgres
>
> Then, sometimes the negative " initial connection time" was returned.
> Lateyncy average is also strange.
>
> ```
> $ pgbench -c 101 -j 10 postgres
> starting vacuum...end.
> pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  sorry, too many clients already

Hmmm.

AFAICR there was a decision to generate a report even if something went 
very wrong, in this case some client could not connect, so some values 
are not initialized, hence the absurd figures, as you show below.

Maybe we should revisit this decision.

> initial connection time = -372896921.586 ms

> I sought pgbench.c and found a reason.

> When a thread failed to get some connections, they do not fill any values to thread->bench_start in threadRun().
> And if the failure is caused in the final thread (this means threads[nthreads - 1]->bench_start is zero),
> the following if-statement sets bench_start to zero.

> I cannot distinguish whether we have to fix it, but I attache the patch.
> This simply ignores a result when therad->bench_start is zero.

> How do you think?

Hmmm. Possibly. Another option could be not to report anything after some 
errors. I'm not sure, because it would depend on the use case. I guess the 
command returned an error status as well.

I'm going to give it some thoughts.

-- 
Fabien.

RE: pgbench bug candidate: negative "initial connection time"

From

"kuroda.hayato@fujitsu.com"

Date:

14 June 2021, 00:42:12

Dear Fabien,

Thank you for replying!

> Hmmm. Possibly. Another option could be not to report anything after some
> errors. I'm not sure, because it would depend on the use case. I guess the
> command returned an error status as well.

I did not know any use cases and decisions , but I vote to report nothing when error occurs.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

14 June 2021, 08:05:18

Hello Kuroda-san,

On Fri, 11 Jun 2021 08:58:45 +0000
"kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote:

> Hi Hackers,
> 
> I played pgbench with wrong parameters, and I found bug-candidate.
> The latest commit in my source is 3a09d75.
> 
> 1. Do initdb and start.
> 2. Initialize schema and data with "scale factor" = 1.
> 3. execute following command many times:
> 
> $ pgbench -c 101 -j 10 postgres
> 
> Then, sometimes the negative " initial connection time" was returned.
> Lateyncy average is also strange.
> 
> ```
> $ pgbench -c 101 -j 10 postgres
> starting vacuum...end.
> pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  sorry, too many clients already
> pgbench (PostgreSQL) 14.0
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 1
> query mode: simple
> number of clients: 101
> number of threads: 10
> number of transactions per client: 10
> number of transactions actually processed: 910/1010
> latency average = 41387686.662 ms
> initial connection time = -372896921.586 ms
> tps = 0.002440 (without initial connection time)
> ```
> 
> I sought pgbench.c and found a reason.
> When a thread failed to get some connections, they do not fill any values to thread->bench_start in threadRun().
> And if the failure is caused in the final thread (this means threads[nthreads - 1]->bench_start is zero),
> the following if-statement sets bench_start to zero.
> 
> ```
>    6494                 /* first recorded benchmarking start time */
>    6495                 if (bench_start == 0 || thread->bench_start < bench_start)
>    6496                         bench_start = thread->bench_start;
> ```
> 
> The wrong bench_start propagates to printResult() and then some invalid values are appered.
> 
> ```
>    6509         printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
>    6510                                  bench_start - start_time, latency_late);
> ```
> 
> I cannot distinguish whether we have to fix it, but I attache the patch.
> This simply ignores a result when therad->bench_start is zero.


 +        /* skip if the thread faild to get connection */
+        if (thread->bench_start == 0)
+            continue;

It detects if a thread failed to get the initial connection by thread->bench_start == 0, but this assumes the initial
valueis zero. For ensuring this, I think it is better to initialize it in an early state, for example like this.
 

@@ -6419,6 +6419,7 @@ main(int argc, char **argv)
                initRandomState(&thread->ts_throttle_rs);
                initRandomState(&thread->ts_sample_rs);
                thread->logfile = NULL; /* filled in later */
+               thread->bench_start = 0;
                thread->latency_late = 0;
                initStats(&thread->stats, 0)

typo: faild -> failed

Regards,
Yugo Nagata
-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

14 June 2021, 08:07:02

On Mon, 14 Jun 2021 00:42:12 +0000
"kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote:

> Dear Fabien,
> 
> Thank you for replying!
> 
> > Hmmm. Possibly. Another option could be not to report anything after some 
> > errors. I'm not sure, because it would depend on the use case. I guess the 
> > command returned an error status as well.
> 
> I did not know any use cases and decisions , but I vote to report nothing when error occurs.

I would prefer to abort the thread whose connection got an error and report
results for other threads, as handled when doConnect fails in CSTATE_START_TX
state. 

In this case, we have to set the state to CSTATE_ABORT before going to 'done'
as fixed in the attached patch, in order to ensure that exit status is 2 and the
result reports "pgbench: fatal: Run was aborted; the above results are incomplete." 

Otherwise, if we want pgbench to exit immediately when a connection error occurs, 
we have tocall exit(1) to ensure the exit code is 1, of course. Anyway, it is wrong
that thecurrent pgbench exit successfully with exit code 0 when doConnnect fails.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Attachment

pgbench_connect_abort.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

14 June 2021, 09:30:14

>>> Hmmm. Possibly. Another option could be not to report anything after some
>>> errors. I'm not sure, because it would depend on the use case. I guess the
>>> command returned an error status as well.
>>
>> I did not know any use cases and decisions , but I vote to report nothing when error occurs.
>
> I would prefer to abort the thread whose connection got an error and report
> results for other threads, as handled when doConnect fails in CSTATE_START_TX
> state.

It is unclear to me whether it makes much sense to report performance when 
things go wrong. At least when a one connection per client bench is run 
ISTM that it should not proceed, because the bench could not even start 
as prescribe. When connection break while the bench has already started, 
maybe it makes more sense to proceed, although I guess that maybe 
reattempting connections would make also sense in such case.

> In this case, we have to set the state to CSTATE_ABORT before going to 'done'
> as fixed in the attached patch, in order to ensure that exit status is 2 and the
> result reports "pgbench: fatal: Run was aborted; the above results are incomplete."

Hmmm. I agree that at least reporting that there was an issue is a good 
idea.

> Otherwise, if we want pgbench to exit immediately when a connection error occurs,
> we have tocall exit(1) to ensure the exit code is 1, of course. Anyway, it is wrong
> that thecurrent pgbench exit successfully with exit code 0 when doConnnect fails.

Indeed, I can only agree on this one.

-- 
Fabien.

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

16 June 2021, 15:59:34

Hello Fabien,

On Mon, 14 Jun 2021 11:30:14 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> >>> Hmmm. Possibly. Another option could be not to report anything after some
> >>> errors. I'm not sure, because it would depend on the use case. I guess the
> >>> command returned an error status as well.
> >>
> >> I did not know any use cases and decisions , but I vote to report nothing when error occurs.
> >
> > I would prefer to abort the thread whose connection got an error and report
> > results for other threads, as handled when doConnect fails in CSTATE_START_TX
> > state.
> 
> It is unclear to me whether it makes much sense to report performance when 
> things go wrong. At least when a one connection per client bench is run 
> ISTM that it should not proceed, because the bench could not even start 
> as prescribe. 

I agreed that when an initial connections fails we cannot start a bench
in the condition that the user wants and that we should stop early to let
the user know it and check the conf. 

I attached a patch, which is a fusion of my previous patch that changes the
state to CSTATE_ABORT when the socket get failure during the bench, and a
part of your patch attached in [1] that exits for initial failures.

[1] https://www.postgresql.org/message-id/alpine.DEB.2.22.394.2106141011100.1338009%40pseudo

> When connection break while the bench has already started, 
> maybe it makes more sense to proceed, 

The result would be incomplete also in this case. However, the reason why
it is worth to proceed is that  such information is still useful for users,
or we don't want to waste the bench that has already started?

> although I guess that maybe 
> reattempting connections would make also sense in such case.

This might become possible after pgbench gets the feature to retry in deadlock
or serialization errors. I am working on rebase of the patch  [2] and I will
submit this in a few days.

[2] https://www.postgresql.org/message-id/20210524112910.444fbfdfbff747bd3b9720ee@sraoss.co.jp

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Attachment

pgbench_connect_abort-2.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

16 June 2021, 18:25:31

Hello Yugo-san,

>> When connection break while the bench has already started,
>> maybe it makes more sense to proceed,
>
> The result would be incomplete also in this case. However, the reason why
> it is worth to proceed is that  such information is still useful for users,
> or we don't want to waste the bench that has already started?

Hmmm. It depends on what the user is testing. If one is interested in 
client resilience under errors, the bench should probably attempt a 
reconnect. If one is interested in best performance when all is well,
then clearly something is amiss and there is no point to go on.

>> although I guess that maybe reattempting connections would make also 
>> sense in such case.
>
> This might become possible after pgbench gets the feature to retry in deadlock
> or serialization errors.

Yes, I agree that part of the needed infrastructure would be in place for 
that. As reconnecting is already in place under -c, so possibly it is just 
a matter of switching between states with some care.

> I am working on rebase of the patch [2] and I will submit this in a few 
> days.

Ok. Very good, I look forward to your submission! I'll be sure to look at 
it.

-- 
Fabien.

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

17 June 2021, 03:55:56

On Wed, 16 Jun 2021 20:25:31 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> 
> Hello Yugo-san,
> 
> >> When connection break while the bench has already started,
> >> maybe it makes more sense to proceed,
> >
> > The result would be incomplete also in this case. However, the reason why
> > it is worth to proceed is that  such information is still useful for users,
> > or we don't want to waste the bench that has already started?
> 
> Hmmm. It depends on what the user is testing. If one is interested in 
> client resilience under errors, the bench should probably attempt a 
> reconnect. If one is interested in best performance when all is well,
> then clearly something is amiss and there is no point to go on.

Agreed. After --max-tries options is implemented on pgbench, we would be
able to add a new feature to allow users to choose if we try to reconnect
or not. However, we don't have it yet for now,  so we should just abort
the client and report the abortion at the end of the bench when a connection
or socket error occurs during the bench, as same the existing behaviour.

By the way, the issue of inital connection erros reported in this thread
will be fixed by the patch attached in my previous post (a major part are
written by you :-) ).  Is this acceptable for you?

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

17 June 2021, 08:37:05

Hello Yugo-san,

> By the way, the issue of inital connection erros reported in this thread
> will be fixed by the patch attached in my previous post (a major part are
> written by you :-)

That does not, on its own, ensure that it is bug free:-)

> ).  Is this acceptable for you?

I disagree on two counts:

First, thread[0] should not appear.

Second, currently the *only* function to change the client state is 
advanceConnectionState, so it can be checked there and any bug is only 
there. We had issues before when several functions where doing updates, 
and it was a mess to understand what was going on. I really want that it 
stays that way, so I disagree with setting the state to ABORTED from 
threadRun. Moreover I do not see that it brings a feature, so ISTM that it 
is not an actual issue not to do it?

-- 
Fabien.

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

17 June 2021, 09:16:42

Hello Fabien,

On Thu, 17 Jun 2021 10:37:05 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> > ).  Is this acceptable for you?
> 
> I disagree on two counts:
> 
> First, thread[0] should not appear.
> 
> Second, currently the *only* function to change the client state is 
> advanceConnectionState, so it can be checked there and any bug is only 
> there. We had issues before when several functions where doing updates, 
> and it was a mess to understand what was going on. I really want that it 
> stays that way, so I disagree with setting the state to ABORTED from 
> threadRun. Moreover I do not see that it brings a feature, so ISTM that it 
> is not an actual issue not to do it?

Ok. I gave up to change the state in threadRun. Instead, I changed the
condition at the end of bench, which enables to report abortion due to
socket errors.

+@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
+ #endif                            /* ENABLE_THREAD_SAFETY */
+ 
+         for (int j = 0; j < thread->nstate; j++)
+-            if (thread->state[j].state == CSTATE_ABORTED)
++            if (thread->state[j].state != CSTATE_FINISHED)
+                 exit_code = 2;
+ 
+         /* aggregate thread level stats */

Does this make sense?

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Attachment

pgbench_connect_abort-3.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

17 June 2021, 09:52:04

>> Second, currently the *only* function to change the client state is
>> advanceConnectionState, so it can be checked there and any bug is only
>> there. We had issues before when several functions where doing updates,
>> and it was a mess to understand what was going on. I really want that it
>> stays that way, so I disagree with setting the state to ABORTED from
>> threadRun. Moreover I do not see that it brings a feature, so ISTM that it
>> is not an actual issue not to do it?
>
> Ok. I gave up to change the state in threadRun. Instead, I changed the
> condition at the end of bench, which enables to report abortion due to
> socket errors.
>
> +@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
> + #endif                            /* ENABLE_THREAD_SAFETY */
> +
> +         for (int j = 0; j < thread->nstate; j++)
> +-            if (thread->state[j].state == CSTATE_ABORTED)
> ++            if (thread->state[j].state != CSTATE_FINISHED)
> +                 exit_code = 2;
> +
> +         /* aggregate thread level stats */
>
> Does this make sense?

Yes, definitely.

-- 
Fabien.

Re: pgbench bug candidate: negative "initial connection time"

From

Kyotaro Horiguchi

Date:

18 June 2021, 08:26:03

At Thu, 17 Jun 2021 11:52:04 +0200 (CEST), Fabien COELHO <coelho@cri.ensmp.fr> wrote in 
> > Ok. I gave up to change the state in threadRun. Instead, I changed the
> > condition at the end of bench, which enables to report abortion due to
> > socket errors.
> >
> > +@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
> > + #endif                            /* ENABLE_THREAD_SAFETY */
> > +
> > +         for (int j = 0; j < thread->nstate; j++)
> > +-            if (thread->state[j].state == CSTATE_ABORTED)
> > ++            if (thread->state[j].state != CSTATE_FINISHED)
> > +                 exit_code = 2;
> > +
> > +         /* aggregate thread level stats */
> >
> > Does this make sense?
> 
> Yes, definitely.

I sought for a simple way to enforce all client finishes with the
states abort or finished but I didn't find.  So +1 for the
change. However, as a matter of style. if we touch the code maybe we
want to enclose the if statement.

Doing this means we regard any state other than CSTATE_FINISHED as
aborted. So, the current goto's to done in threadRun are effectively
aborting a part or the all clients running on the thread. So for
example the following place:

pgbench.c:6713
>        /* must be something wrong */
>        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
>        goto done;

Should say such like "thread %d aborted: %s() failed: ...".


I'm not sure what is the consensus here about the case where aborted
client can recoonect to the same server.  This patch doesn't that.  However, I think causing reconnection needs more
workthan accepted as a fix while beta.
 


====

+            /* as the bench is already running, we do not abort */
             pg_log_error("client %d aborted while establishing connection", st->id);
             st->state = CSTATE_ABORTED;

The comment looks strange that it is saying "we don't abort" while
setting the state to CSTATE_ABORT then showing "client %d aborted".


====
   if ((con = doConnect()) == NULL)
+  {
+    pg_log_fatal("connection for initialization failed");
     exit(1);

doConnect() prints an error emssage given from libpq.  So The
additional messaget is redundant.


====
   errno = THREAD_BARRIER_INIT(&barrier, nthreads);
   if (errno != 0)
+  {
     pg_log_fatal("could not initialize barrier: %m");
+    exit(1);

This is a run-time error. Maybe we should return 2 in that case.


===
     if (thread->logfile == NULL)
     {
       pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-      goto done;
+      exit(1);

Maybe we should exit with 2 this case.  If we exit this case, we might
also want to exit when fclose() fails.  (Currently the error of
fclose() is ignored.)



===
+        /* coldly abort on connection failure */
+        pg_log_fatal("cannot create connection for thread %d client %d",
+               thread->tid, i);
+        exit(1);

It seems to me that the "thread %d client %d(not clinent id but the
client index within the thread)" doesn't make sense to users.  Even if
we showed a message like that, it should show only the global clinent
id (cstate->id).

I think that we should return with 2 here but we return with 1
in another place for the same reason..


===
         /* must be something wrong */
-        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
         goto done;

Why doesn't a fatal error cause an immediate exit?  (And if we change
this to fatal, we also need to change similar errors in the same
function to fatal.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

18 June 2021, 12:54:27

Hello,

> Doing this means we regard any state other than CSTATE_FINISHED as
> aborted. So, the current goto's to done in threadRun are effectively
> aborting a part or the all clients running on the thread. So for
> example the following place:
>
> pgbench.c:6713
>>        /* must be something wrong */
>>        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
>>        goto done;
>
> Should say such like "thread %d aborted: %s() failed: ...".

Yep, possibly.

> I'm not sure what is the consensus here about the case where aborted
> client can recoonect to the same server.  This patch doesn't that.

Non trivial future work.

> However, I think causing reconnection needs more work than accepted as 
> a fix while beta.

It is an entire project which requires some thinking about.

> +            /* as the bench is already running, we do not abort */
>             pg_log_error("client %d aborted while establishing connection", st->id);
>             st->state = CSTATE_ABORTED;
>
> The comment looks strange that it is saying "we don't abort" while
> setting the state to CSTATE_ABORT then showing "client %d aborted".

Indeed. There is abort from the client, which just means that it stops 
sending transaction, and abort for the process, which is basically 
"exit(1)".

> ====
>   if ((con = doConnect()) == NULL)
> +  {
> +    pg_log_fatal("connection for initialization failed");
>     exit(1);
>
> doConnect() prints an error emssage given from libpq.  So The
> additional messaget is redundant.

This is not the same for me: doConnect may fail but we may decide to go 
retry the connection later, or just one client may be disconnected but 
others are going on, which is different from deciding to stop the whole 
program, which deserves a message on its own.

> ====
>   errno = THREAD_BARRIER_INIT(&barrier, nthreads);
>   if (errno != 0)
> +  {
>     pg_log_fatal("could not initialize barrier: %m");
> +    exit(1);
>
> This is a run-time error. Maybe we should return 2 in that case.

Hmmm. Yep.

> ===
>     if (thread->logfile == NULL)
>     {
>       pg_log_fatal("could not open logfile \"%s\": %m", logpath);
> -      goto done;
> +      exit(1);
>
> Maybe we should exit with 2 this case.

Yep.

> If we exit this case, we might also want to exit when fclose() fails. 
> (Currently the error of fclose() is ignored.)

Not sure. I'd let it at that for now.

> ===
> +        /* coldly abort on connection failure */
> +        pg_log_fatal("cannot create connection for thread %d client %d",
> +               thread->tid, i);
> +        exit(1);
>
> It seems to me that the "thread %d client %d(not clinent id but the
> client index within the thread)" doesn't make sense to users.  Even if
> we showed a message like that, it should show only the global clinent
> id (cstate->id).

This is not obvious to me. I think that we should be homogeneous with what 
is already done around.

> I think that we should return with 2 here but we return with 1
> in another place for the same reason..

Possibly.

>         /* must be something wrong */
> -        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
> +        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
>         goto done;
>
> Why doesn't a fatal error cause an immediate exit?

Good point. I do not know, but I would expect it to be the case, and 
AFAICR it does not.

> (And if we change this to fatal, we also need to change similar errors 
> in the same function to fatal.)

Possibly.

I'll look into it over the week-end.

Thanks for the feedback!

-- 
Fabien.

Re: pgbench bug candidate: negative "initial connection time"

From

Fabien COELHO

Date:

18 June 2021, 13:58:48

>>>        /* must be something wrong */
>>>        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
>>>        goto done;
>> 
>> Should say such like "thread %d aborted: %s() failed: ...".

After having a lookg, there are already plenty such cases. I'd say not to 
change anything for beta, and think of it for the next round.

>> ====
>>   errno = THREAD_BARRIER_INIT(&barrier, nthreads);
>>   if (errno != 0)
>> +  {
>>     pg_log_fatal("could not initialize barrier: %m");
>> +    exit(1);
>> 
>> This is a run-time error. Maybe we should return 2 in that case.

I think that you are right, but there are plenty such places where exit 
should be 2 instead of 1 if the doc is followed:

"""Errors during the run such as database errors or problems in the script 
will result in exit status 2."""

My beta take is to let these as they are, i.e. pretty inconsistent all 
over pgbench, and schedule a cleanup on the next round.

>> ===
>>     if (thread->logfile == NULL)
>>     {
>>       pg_log_fatal("could not open logfile \"%s\": %m", logpath);
>> -      goto done;
>> +      exit(1);
>> 
>> Maybe we should exit with 2 this case.
>
> Yep.

The bench is not even started, this is not really run time yet, 1 seems 
ok. The failure may be due to a typo in the path which comes from the 
user.

>> If we exit this case, we might also want to exit when fclose() fails. 
>> (Currently the error of fclose() is ignored.)
>
> Not sure. I'd let it at that for now.

I stand by this.

>> +        /* coldly abort on connection failure */
>> +        pg_log_fatal("cannot create connection for thread %d client %d",
>> +               thread->tid, i);
>> +        exit(1);
>> 
>> It seems to me that the "thread %d client %d(not clinent id but the
>> client index within the thread)" doesn't make sense to users.  Even if
>> we showed a message like that, it should show only the global clinent
>> id (cstate->id).
>
> This is not obvious to me. I think that we should be homogeneous with what is 
> already done around.

ok for only giving the global client id.

>> I think that we should return with 2 here but we return with 1
>> in another place for the same reason..
>
> Possibly.

Again for this one, the bench has not really started, so 1 seems fine.

>>         /* must be something wrong */
>> -        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
>> +        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
>>         goto done;
>> 
>> Why doesn't a fatal error cause an immediate exit?
>
> Good point. I do not know, but I would expect it to be the case, and AFAICR 
> it does not.
>
>> (And if we change this to fatal, we also need to change similar errors in 
>> the same function to fatal.)
>
> Possibly.

On second look, I think that error is fine, indeed we do not stop the 
process, so "fatal" it is not;

Attached Yugo-san patch with some updates discussed in the previous mails, 
so as to move things along.

-- 
Fabien.

Attachment

pgbench-conn-abort-4.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

18 June 2021, 15:46:05

Hello Horiguchi-san, Fabien,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> 
> >>>        /* must be something wrong */
> >>>        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
> >>>        goto done;
> >> 
> >> Should say such like "thread %d aborted: %s() failed: ...".
> 
> After having a lookg, there are already plenty such cases. I'd say not to 
> change anything for beta, and think of it for the next round.

Agreed. Basically, I think the existing message should be left as is.

> >> ====
> >>   errno = THREAD_BARRIER_INIT(&barrier, nthreads);
> >>   if (errno != 0)
> >> +  {
> >>     pg_log_fatal("could not initialize barrier: %m");
> >> +    exit(1);
> >> 
> >> This is a run-time error. Maybe we should return 2 in that case.
> 
> I think that you are right, but there are plenty such places where exit 
> should be 2 instead of 1 if the doc is followed:
> 
> """Errors during the run such as database errors or problems in the script 
> will result in exit status 2."""
> 
> My beta take is to let these as they are, i.e. pretty inconsistent all 
> over pgbench, and schedule a cleanup on the next round.

As same as the below Fabian's comment about thread->logfile, 

> >> ===
> >>     if (thread->logfile == NULL)
> >>     {
> >>       pg_log_fatal("could not open logfile \"%s\": %m", logpath);
> >> -      goto done;
> >> +      exit(1);
> >> 
> >> Maybe we should exit with 2 this case.
> >
> > Yep.
> 
> The bench is not even started, this is not really run time yet, 1 seems 
> ok. The failure may be due to a typo in the path which comes from the 
> user.

the bench is not started at THREAD_BARRIER_INIT, so I think exit(1) is ok.  

> 
> >>         /* must be something wrong */
> >> -        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
> >> +        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
> >>         goto done;
> >> 
> >> Why doesn't a fatal error cause an immediate exit?
> >
> > Good point. I do not know, but I would expect it to be the case, and AFAICR 
> > it does not.
> >
> >> (And if we change this to fatal, we also need to change similar errors in 
> >> the same function to fatal.)
> >
> > Possibly.
> 
> On second look, I think that error is fine, indeed we do not stop the 
> process, so "fatal" it is not;

I replaced this 'fatal' with 'error' because we are aborting the client
instead of exit(1). When pgbench was rewritten to use common logging API
by the commit 30a3e772b40, somehow pg_log_fatal was used, but I am
wondering it should have be pg_log_error.

> Attached Yugo-san patch with some updates discussed in the previous mails, 
> so as to move things along.

Thank you for update. I agree with this fix.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

29 July 2021, 04:23:25

Hello,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> Attached Yugo-san patch with some updates discussed in the previous mails, 
> so as to move things along.

I attached the patch rebased to a change due to 856de3b39cf.

Regards,
Yugo Nagata



-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Attachment

pgbench-conn-abort-5.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

06 September 2021, 17:34:17


On 2021/07/29 13:23, Yugo NAGATA wrote:
> Hello,
> 
> On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
> Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> 
>> Attached Yugo-san patch with some updates discussed in the previous mails,
>> so as to move things along.
> 
> I attached the patch rebased to a change due to 856de3b39cf.

+        pg_log_fatal("connection for initialization failed");
+        pg_log_fatal("setup connection failed");
+                pg_log_fatal("cannot create connection for client %d",

These fatal messages output when doConnect() fails should be a bit more consistent each other? For example,

     could not create connection for initialization
     could not create connection for setup
     could not create connection for client %d

I'm not sure, but *if* "xxx failed" is more proper for pgbench, what about

     connection for initialization failed
     connection for setup failed
     connection for client %d failed


> Exit status 1 indicates static problems such as invalid command-line options.
> Errors during the run such as database errors or problems in the script will
> result in exit status 2.

While reading the code and docs related to the patch, I found
these descriptions in pgbench docs. The first description needs to be
updated? Because even database error (e.g., failure of connection for setup)
can result in exit status 1 if it happens before the benchmark actually runs.

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

23 September 2021, 22:26:45

Hello Fujii-san,

On Tue, 7 Sep 2021 02:34:17 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> On 2021/07/29 13:23, Yugo NAGATA wrote:
> > Hello,
> > 
> > On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
> > Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > 
> >> Attached Yugo-san patch with some updates discussed in the previous mails,
> >> so as to move things along.
> > 
> > I attached the patch rebased to a change due to 856de3b39cf.
> 
> +        pg_log_fatal("connection for initialization failed");
> +        pg_log_fatal("setup connection failed");
> +                pg_log_fatal("cannot create connection for client %d",
> 
> These fatal messages output when doConnect() fails should be a bit more consistent each other? For example,
> 
>      could not create connection for initialization
>      could not create connection for setup
>      could not create connection for client %d

Ok. I fixed as your suggestion.

> > Exit status 1 indicates static problems such as invalid command-line options.
> > Errors during the run such as database errors or problems in the script will
> > result in exit status 2.
> 
> While reading the code and docs related to the patch, I found
> these descriptions in pgbench docs. The first description needs to be
> updated? Because even database error (e.g., failure of connection for setup)
> can result in exit status 1 if it happens before the benchmark actually runs.

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

  Exit status 1 indicates static problems such as invalid command-line options
  or internal errors which are supposed to never occur.  Early errors that occur
  when starting benchmark such as initial connection failures also exit with
  status 1.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Attachment

pgbench-conn-abort-6.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

24 September 2021, 02:26:12


On 2021/09/24 7:26, Yugo NAGATA wrote:
> That makes sense. Failures of setup connection or initial connection doesn't
> seem 'static problems'. I rewrote this description to explain exit status 1
> indicates also interal errors and early errors.
> 
>    Exit status 1 indicates static problems such as invalid command-line options
>    or internal errors which are supposed to never occur.  Early errors that occur
>    when starting benchmark such as initial connection failures also exit with
>    status 1.

LGTM. Barring any objection, I will commit the patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

29 September 2021, 13:11:53

On 2021/09/24 11:26, Fujii Masao wrote:
> 
> 
> On 2021/09/24 7:26, Yugo NAGATA wrote:
>> That makes sense. Failures of setup connection or initial connection doesn't
>> seem 'static problems'. I rewrote this description to explain exit status 1
>> indicates also interal errors and early errors.
>>
>>    Exit status 1 indicates static problems such as invalid command-line options
>>    or internal errors which are supposed to never occur.  Early errors that occur
>>    when starting benchmark such as initial connection failures also exit with
>>    status 1.
> 
> LGTM. Barring any objection, I will commit the patch.

I extracted two changes from the patch and pushed (also back-patched) them.

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgbench bug candidate: negative "initial connection time"

From

Michael Paquier

Date:

01 October 2021, 06:27:23

On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:
> BTW, when logfile fails to be opened, pgbench gets stuck due to commit
> aeb57af8e6. So even if we decided not to back-patch those changes,
> we should improve the handling of logfile open failure, to fix the issue.

There is an entry in the CF for this thread:
https://commitfest.postgresql.org/34/3219/

I have moved that to the next one as some pieces are missing.  If you
are planning to handle the rest, could you register your name as a
committer?
--
Michael

Attachment

signature.asc

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

08 October 2021, 15:41:33


On 2021/10/01 15:27, Michael Paquier wrote:
> On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:
>> BTW, when logfile fails to be opened, pgbench gets stuck due to commit
>> aeb57af8e6. So even if we decided not to back-patch those changes,
>> we should improve the handling of logfile open failure, to fix the issue.
> 
> There is an entry in the CF for this thread:
> https://commitfest.postgresql.org/34/3219/
> 
> I have moved that to the next one as some pieces are missing.  If you
> are planning to handle the rest, could you register your name as a
> committer?

Thanks for letting me know that!
I registered myself as a committer of the patch again.


    pg_time_usec_t conn_duration;    /* cumulated connection and deconnection
                                     * delays */

BTW, while reading the patch, I found the above comment in pgbench.c.
"deconnection" seems a valid word in French (?), but isn't it better to
replace it with "disconnection"? Patch attached.


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

deconnection.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

01 November 2021, 14:01:54


On 2021/09/29 22:11, Fujii Masao wrote:
> 
> 
> On 2021/09/24 11:26, Fujii Masao wrote:
>>
>>
>> On 2021/09/24 7:26, Yugo NAGATA wrote:
>>> That makes sense. Failures of setup connection or initial connection doesn't
>>> seem 'static problems'. I rewrote this description to explain exit status 1
>>> indicates also interal errors and early errors.
>>>
>>>    Exit status 1 indicates static problems such as invalid command-line options
>>>    or internal errors which are supposed to never occur.  Early errors that occur
>>>    when starting benchmark such as initial connection failures also exit with
>>>    status 1.
>>
>> LGTM. Barring any objection, I will commit the patch.
> 
> I extracted two changes from the patch and pushed (also back-patched) them.
> 
> The remainings are the changes of handling of initial connection or
> logfile open failures. I agree to push them at least for the master.
> But I'm not sure if they should be back-patched. Without these changes,
> even when those failures happen, pgbench proceeds the benchmark and
> reports the result. But with the changes, pgbench exits immediately in
> that case. I'm not sure if there are people who expect this behavior,
> but if there are, maybe we should not change it at least at stable branches.
> Thought?

The current behavior should be improved, but not a bug.
So I don't think that the patch needs to be back-patched.
Barring any objection, I will push the attached patch to the master.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

pgbench-conn-abort-7.patch

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

01 November 2021, 14:02:44


On 2021/10/09 0:41, Fujii Masao wrote:
> 
> 
> On 2021/10/01 15:27, Michael Paquier wrote:
>> On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:
>>> BTW, when logfile fails to be opened, pgbench gets stuck due to commit
>>> aeb57af8e6. So even if we decided not to back-patch those changes,
>>> we should improve the handling of logfile open failure, to fix the issue.
>>
>> There is an entry in the CF for this thread:
>> https://commitfest.postgresql.org/34/3219/
>>
>> I have moved that to the next one as some pieces are missing.  If you
>> are planning to handle the rest, could you register your name as a
>> committer?
> 
> Thanks for letting me know that!
> I registered myself as a committer of the patch again.
> 
> 
>      pg_time_usec_t conn_duration;    /* cumulated connection and deconnection
>                                       * delays */
> 
> BTW, while reading the patch, I found the above comment in pgbench.c.
> "deconnection" seems a valid word in French (?), but isn't it better to
> replace it with "disconnection"? Patch attached.

Barring any objection, I will push this patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgbench bug candidate: negative "initial connection time"

From

Fujii Masao

Date:

02 November 2021, 14:11:39


On 2021/11/01 23:01, Fujii Masao wrote:
>> The remainings are the changes of handling of initial connection or
>> logfile open failures. I agree to push them at least for the master.
>> But I'm not sure if they should be back-patched. Without these changes,
>> even when those failures happen, pgbench proceeds the benchmark and
>> reports the result. But with the changes, pgbench exits immediately in
>> that case. I'm not sure if there are people who expect this behavior,
>> but if there are, maybe we should not change it at least at stable branches.
>> Thought?
> 
> The current behavior should be improved, but not a bug.
> So I don't think that the patch needs to be back-patched.
> Barring any objection, I will push the attached patch to the master.

Pushed. Thanks!

I also pushed the typo-fix patch that I proposed upthread.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgbench bug candidate: negative "initial connection time"

From

Yugo NAGATA

Date:

04 November 2021, 00:31:52

On Tue, 2 Nov 2021 23:11:39 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> 
> 
> On 2021/11/01 23:01, Fujii Masao wrote:
> >> The remainings are the changes of handling of initial connection or
> >> logfile open failures. I agree to push them at least for the master.
> >> But I'm not sure if they should be back-patched. Without these changes,
> >> even when those failures happen, pgbench proceeds the benchmark and
> >> reports the result. But with the changes, pgbench exits immediately in
> >> that case. I'm not sure if there are people who expect this behavior,
> >> but if there are, maybe we should not change it at least at stable branches.
> >> Thought?
> > 
> > The current behavior should be improved, but not a bug.
> > So I don't think that the patch needs to be back-patched.
> > Barring any objection, I will push the attached patch to the master.
> 
> Pushed. Thanks!

Thanks!

> 
> I also pushed the typo-fix patch that I proposed upthread.
> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION
> 
> 


-- 
Yugo NAGATA <nagata@sraoss.co.jp>