Re: Suggestion to add --continue-client-on-abort option to pgbench - Mailing list pgsql-hackers

From Yugo Nagata
Subject Re: Suggestion to add --continue-client-on-abort option to pgbench
Date
Msg-id 20250925111721.f88194cd7fa840f156231f88@sraoss.co.jp
Whole thread Raw
In response to Re: Suggestion to add --continue-client-on-abort option to pgbench  (Yugo Nagata <nagata@sraoss.co.jp>)
Responses Re: Suggestion to add --continue-client-on-abort option to pgbench
List pgsql-hackers
On Thu, 25 Sep 2025 11:09:40 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> On Thu, 25 Sep 2025 02:19:27 +0900
> Fujii Masao <masao.fujii@gmail.com> wrote:
> 
> > On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
> > <ikedarintarof@oss.nttdata.com> wrote:
> > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT
> > > immediately after queries failed, without executing discardUntilSync().
> 
> Agreed. 
>  
> > If so, the fix in discardUntilSync() should be part of the continue-on-error
> > patch. The assertion failure fix should be a separate patch, since only
> > that needs to be backpatched (the failure can also occur in back branches).
> 
> +1
> 
> > 
> > > I've attached a patch that fixes the assertion error. The content of v1 patch by
> > > Mr. Nagata is also included. I would appreciate it if you review my patch.
> 
> > + switch (PQresultStatus(res))
> > + {
> > + case PGRES_PIPELINE_SYNC:
> > + received_sync = true;
> > 
> > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
> > 
> > + case PGRES_FATAL_ERROR:
> > + PQclear(res);
> > + goto done;
> > 
> > I don't think we need goto here. How about this instead?
> > 
> > -----------------------
> > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st)
> >                          * results have been discarded.
> >                          */
> >                         st->num_syncs = 0;
> > -                       PQclear(res);
> >                         break;
> >                 }
> > +               /*
> > +                * Stop receiving further results if PGRES_FATAL_ERROR
> > is returned
> > +                * (e.g., due to a connection failure) before
> > PGRES_PIPELINE_SYNC,
> > +                * since PGRES_PIPELINE_SYNC will never arrive.
> > +                */
> > +               else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
> > +                       break;
> >                 PQclear(res);
> >         }
> > +       PQclear(res);
> > 
> >         /* exit pipeline */
> >         if (PQexitPipelineMode(st->con) != 1)
> > -----------------------
> 
> I think Fujii-san's version is better because Ikeda-san's version doesn't
> consider the case where PGRES_PIPELINE_SYNC is followed by another one.
> In that situation, the loop would terminate without getting NULL, which would
> causes an error in PQexitPipelineMode().
> 
> However, I would like to suggest an alternative solution: checking the connection
> status when readCommandResponse() returns false. This seems more straightforwad,
> since the cause of the error can be investigated immediately after it is detected.
> 
> @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
>                                         if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
>                                                 st->state = CSTATE_END_COMMAND;
>                                 }
> -                               else if (canRetryError(st->estatus))
> +                               else if (PQstatus(st->con) == CONNECTION_BAD)
> +                                       st->state = CSTATE_ABORTED;
> +                               else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
> +                                                canRetryError(st->estatus))
>                                         st->state = CSTATE_ERROR;
>                                 else
>                                         st->state = CSTATE_ABORTED;
> 
> What do you think?
> 
> 
> Additionally, I noticed that in pipeline mode, the error message reported in
> readCommandResponse() is lost, because it is reset when PQgetResult() returned
> NULL to indicate the end of query processing. For example:
> 
>  pgbench: client 0 got an error in command 3 (SQL) of script 0; 
>  pgbench: client 1 got an error in command 3 (SQL) of script 0; 
> 
> This can be fixed this by saving the previous error message and using it
> for the report. After the fix:
> 
>  pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL:  terminating connection due to administrator
command
> 
> I've attached update patches.
> 
> 0001 fixes the assersion failure in commandError() and error message lost
> in readCommandResponse().
> 
> 0002 was the previous 0001 that adds --continiue-on-error, including the
> fix to handle connection termination errors.
> 
> 0003 is for improving error messages for errors that cause client abortion.

I think the patch 0001 should be back patched since the issues can occurs
even for retries of serialization failure or deadlock detection in pipeline mode.

Regards,
Yugo Nagata


-- 
Yugo Nagata <nagata@sraoss.co.jp>



pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Documentation fix on pgbench \aset command
Next
From: Fujii Masao
Date:
Subject: Re: Documentation fix on pgbench \aset command