Thread: Re: Suggestion to add --continue-client-on-abort option to pgbench

Re: Suggestion to add --continue-client-on-abort option to pgbench

From
Stepan Neretin
Date:


On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
Hi hackers,

I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.

For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.

The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.

```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```

I have attached the patch. I would appreciate your feedback.

Best regards,

Rintaro Ikeda
NTT DATA Corporation Japan

Hi Rintaro,

Thanks for the patch and explanation. I understand your goal is to ensure that pgbench clients continue running even when some transactions fail due to application-level errors (e.g., constraint violations), especially when running custom scripts.

However, I wonder if the intended behavior can't already be achieved using standard SQL constructs — specifically ON CONFLICT or careful transaction structure. For example, your sample script:

CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));

can be rewritten as:

\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;

This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.

So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?

Best regards,

Stepan Neretin 

RE: Suggestion to add --continue-client-on-abort option to pgbench

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Ikeda-san,

Thanks for starting the new thread! I have never known the issue before I heard at
PGConf.dev.

Few comments:

1.
This parameter seems a type of benchmark option. So should we set
benchmarking_option_set as well?

2.
Not sure, but exit-on-abort seems a similar option. What if both are specified?
Is it allowed?

3.
Can you consider a test case for the new parameter?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Suggestion to add --continue-client-on-abort option to pgbench

From
Rintaro Ikeda
Date:
Dear Kuroda-san, hackers,


On 2025/06/04 21:57, Hayato Kuroda (Fujitsu) wrote:
> Dear Ikeda-san,
> 
> Thanks for starting the new thread! I have never known the issue before I heard at
> PGConf.dev.
> 
> Few comments:
> 
> 1.
> This parameter seems a type of benchmark option. So should we set
> benchmarking_option_set as well?
> 
> 2.
> Not sure, but exit-on-abort seems a similar option. What if both are specified?
> Is it allowed?
> 
> 3.
> Can you consider a test case for the new parameter?
> 
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
> 
> 

Thank you for your valuable comment!

1. I should've also set benchmarking_option_set. I've modified it accordingly.

2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
Therefore, I've updated the patch to throw a FATAL error when two options are 
set simultaneously. Corresponding explanation was also added.
(I'm wondering the name of parameter should be continue-on-abort so that users 
understand the two option are mutually exclusive.)

3. I've added the test.

Additionally, I modified the patch so that st->state does not transition to 
CSTATE_RETRY when a transaction fails and continue-on-error option is enabled. 
In the previous patch, we retry the failed transaction up to max-try times, 
which is unnecessary for our purpose: clients does not exit when its 
transactions fail.

I've attached the updated patch. 
v3-0001-Add-continue-on-error-option-to-pgbench.patch is identical to 
v4-0001-Add-continue-on-error-option-to-pgbench.patch. The v4-0002 patch is the 
diff from the previous patch.

Best regards,
Rintaro Ikeda
Attachment

RE: Suggestion to add --continue-client-on-abort option to pgbench

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Ikeda-san,

Thanks for updating the patch!

> 1. I should've also set benchmarking_option_set. I've modified it accordingly.

Confirmed it has been fixed. Thanks.

> 2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
> Therefore, I've updated the patch to throw a FATAL error when two options are 
> set simultaneously. Corresponding explanation was also added.
> (I'm wondering the name of parameter should be continue-on-abort so that users
> understand the two option are mutually exclusive.)

Make sense, +1.

Here are new comments.

01. build failure
According to the cfbot [1], the documentation cannot be built. IIUC </para> seemed
to be missed here:
```
+       <para>
+        Note that this option can not be used together with
+        <option>--exit-on-abort</option>.
+      </listitem>
+     </varlistentry>
```

02. patch separation
How about separating the patch series like:

0001 - contains option handling and retry part, and documentation
0002 - contains accumulation/reporting part
0003 - contains tests.

I hope above style is more helpful for reviewers.

03. documentation
```
+        Note that this option can not be used together with
+        <option>--exit-on-abort</option>.
```

I feel we should add the similar description in `exit-on-abort` part.

04. documentation
```
+        Client rolls back the failed transaction and starts a new one when its
+        transaction fails due to the reason other than the deadlock and
+        serialization failure. This allows all clients specified with -c option
+        to continuously apply load to the server, even if some transactions fail.
```

I feel the description contains bit redundant part and misses the default behavior.
How about:
```
       <para>
        Clients survive when their transactions are aborted, and they continue
        their run. Without the option, clients exit when transactions they run
        are aborted.
       </para>
       <para>
        Note that serialization failures or deadlock failures do not abort the
        client, so they are not affected by this option.
        See <xref linkend="failures-and-retries"/> for more information.
       </para>
```

05. StatsData
```
+        * When continue-on-error option is specified,
+        * failed (the number of failed transactions) =
+        *   'other_sql_failures' (they got a error when continue-on-error option
+        *                                                 was specified).
```

Let me confirm one point; can serialization_failures and deadlock_failures be
counted when continue-on-error is true? If so, the comment seems not correct for me.
The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
in the case.

06. StatsData
Another point; can other_sql_failures be counted when the continue-on-error is NOT
specified? I feel it should be...

06. usage()
Added line is too long. According to program_help_ok(), the output by help should
be less than 80.

07.
Please run pgindent/pgperltidy, I got some diffs.

[1]: https://cirrus-ci.com/task/5210061275922432

Best regards,
Hayato Kuroda
FUJITSU LIMITED


On Mon, 9 Jun 2025 09:34:03 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

> > 2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
> > Therefore, I've updated the patch to throw a FATAL error when two options are 
> > set simultaneously. Corresponding explanation was also added.

I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)

The --exit-on-abort option forces to exit pgbench immediately when any client is aborted
due to some error. When the --continue-on-error option is not set, SQL errors other than
deadlock or serialization error cause a client to be aborted. On the other hand, when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run after them.
However, clients can still be aborted for other reasons, such as connection failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option remains
useful even when --continue-on-error is enabled. 

> > (I'm wondering the name of parameter should be continue-on-abort so that users
> > understand the two option are mutually exclusive.)

For the same reason as above, I believe --continue-on-error is a more accurate description
of the option's behavior.
 
> 02. patch separation
> How about separating the patch series like:
> 
> 0001 - contains option handling and retry part, and documentation
> 0002 - contains accumulation/reporting part
> 0003 - contains tests.
> 
> I hope above style is more helpful for reviewers.

I'm not sure whether it's necessary to split the patch, as the change doesn't seem very
complex. However, the current separation appears inconsistent. For example, patch 0001
modifies canRetryError(), but patch 0002 reverts that change, and so on.

> 
> 04. documentation
> ```
> +        Client rolls back the failed transaction and starts a new one when its
> +        transaction fails due to the reason other than the deadlock and
> +        serialization failure. This allows all clients specified with -c option
> +        to continuously apply load to the server, even if some transactions fail.
> ```
> 
> I feel the description contains bit redundant part and misses the default behavior.
> How about:
> ```
>        <para>
>         Clients survive when their transactions are aborted, and they continue
>         their run. Without the option, clients exit when transactions they run
>         are aborted.
>        </para>
>        <para>
>         Note that serialization failures or deadlock failures do not abort the
>         client, so they are not affected by this option.
>         See <xref linkend="failures-and-retries"/> for more information.
>        </para>
> ```

I think we can make it clearer as follows:

 Allows clients to continue their run even if an SQL statement fails due to errors other
 than serialization or deadlock. Without this option, the client is aborted after
 such errors.

 Note that serialization and deadlock failures never cause the client to be aborted,
 so they are not affected by this option. See <xref linkend="failures-and-retries"/>
 for more information.

That said, a review by a native English speaker would still be appreciated.

Also, we would need to update several parts of the documentation. For example, the 
"Failures and Serialization/Deadlock Retries" section should be revised to describe the
behavior change. In addition, we should update the explanations of output result examples
and logging, the description of the --failures-detailed option, and so on.

If transactions are not retried after SQL errors other than serialization or deadlock,
this should also be explicitly documented.


> 05. StatsData
> ```
> +        * When continue-on-error option is specified,
> +        * failed (the number of failed transactions) =
> +        *   'other_sql_failures' (they got a error when continue-on-error option
> +        *                                                 was specified).
> ```
> 
> Let me confirm one point; can serialization_failures and deadlock_failures be
> counted when continue-on-error is true? If so, the comment seems not correct for me.
> The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
> in the case.

+1

> 06. StatsData
> Another point; can other_sql_failures be counted when the continue-on-error is NOT
> specified? I feel it should be...

We could do that. However, if an SQL error other than a serialization or deadlock error
occurs when --continue-on-error is not set, pgbench will be aborted midway and the printed
results will be incomplete. Therefore, this might not make much sense.

> 06. usage()
> Added line is too long. According to program_help_ok(), the output by help should
> be less than 80.

+1


Here are additional comments from me.

@@ -4548,6 +4570,8 @@ getResultString(bool skipped, EStatus estatus)
                return "serialization";
            case ESTATUS_DEADLOCK_ERROR:
                return "deadlock";
+           case ESTATUS_OTHER_SQL_ERROR:
+               return "error (except serialization/deadlock)";

Strings returned by getResultString() are printed in the "time" field of the
log when both the -l and --failures-detailed options are set. Therefore, they
should be single words that do not contain any space characters. I wonder if
something like "other" or "other_sql_error" would be appropriate.


@@ -4099,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
                         * can retry the error.
                         */
                        st->state = timer_exceeded ? CSTATE_FINISHED :
+                           continue_on_error ? CSTATE_FAILURE :
                            doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;
                    }
                    else

This fix is not necessary because doRetry() (and canRetryError(), which is called
within it) will return false when continue_on_error is set (after applying patch 0002).


            case PGRES_NONFATAL_ERROR:
            case PGRES_FATAL_ERROR:
                st->estatus = getSQLErrorStatus(PQresultErrorField(res,
                                                                   PG_DIAG_SQLSTATE));
                if (canRetryError(st->estatus))
                {
                    if (verbose_errors)
                        commandError(st, PQerrorMessage(st->con));
                    goto error;
                }
                /* fall through */

            default:    
                /* anything else is unexpected */
                pg_log_error("client %d script %d aborted in command %d query %d: %s",
                             st->id, st->use_file, st->command, qrynum,
                             PQerrorMessage(st->con));
                goto error;
        }

When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabled


Best regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



RE: Suggestion to add --continue-client-on-abort option to pgbench

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Nagata-san,

> > > 2. The exit-on-abort option and continue-on-error option are mutually
> exclusive.
> > > Therefore, I've updated the patch to throw a FATAL error when two options
> are
> > > set simultaneously. Corresponding explanation was also added.
>
> I don't think that's right since "abort" and "error" are different concept in pgbench.
> (Here, "abort" refers to the termination of a client, not a transaction abort.)
>
> The --exit-on-abort option forces to exit pgbench immediately when any client is
> aborted
> due to some error. When the --continue-on-error option is not set, SQL errors
> other than
> deadlock or serialization error cause a client to be aborted. On the other hand,
> when the option
> is set, clients are not aborted due to any SQL errors; instead they continue to run
> after them.
> However, clients can still be aborted for other reasons, such as connection
> failures or
> meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option
> remains
> useful even when --continue-on-error is enabled.

To clarify: another approach is that allow --continue-on-error option to continue
running even when clients meet such errors. Which one is better?

> > 02. patch separation
> > How about separating the patch series like:
> >
> > 0001 - contains option handling and retry part, and documentation
> > 0002 - contains accumulation/reporting part
> > 0003 - contains tests.
> >
> > I hope above style is more helpful for reviewers.
>
> I'm not sure whether it's necessary to split the patch, as the change doesn't seem
> very
> complex. However, the current separation appears inconsistent. For example,
> patch 0001
> modifies canRetryError(), but patch 0002 reverts that change, and so on.

Either way is fine for me if they are changed from the current method.

>
> >
> > 04. documentation
> > ```
> > +        Client rolls back the failed transaction and starts a new one when its
> > +        transaction fails due to the reason other than the deadlock and
> > +        serialization failure. This allows all clients specified with -c option
> > +        to continuously apply load to the server, even if some transactions
> fail.
> > ```
> >
> > I feel the description contains bit redundant part and misses the default
> behavior.
> > How about:
> > ```
> >        <para>
> >         Clients survive when their transactions are aborted, and they continue
> >         their run. Without the option, clients exit when transactions they run
> >         are aborted.
> >        </para>
> >        <para>
> >         Note that serialization failures or deadlock failures do not abort the
> >         client, so they are not affected by this option.
> >         See <xref linkend="failures-and-retries"/> for more information.
> >        </para>
> > ```
>
> I think we can make it clearer as follows:

I do not have confident for English, native speaker is needed....

> > 06. usage()
> > Added line is too long. According to program_help_ok(), the output by help
> should
> > be less than 80.
>
> +1

FYI - I posted a patch which adds the test. You can apply and confirm how the function says.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB1496610451F5896375B2562E6F56BA%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: Suggestion to add --continue-client-on-abort option to pgbench

From
ikedarintarof
Date:
Hi,

Thank you very much for your valuable comments and kind advice. I'm 
currently working on revising the previous patch based on the feedback 
received. I would like to share my thoughts regarding the conditions 
under which the --continue-on-error option should initiate a new 
transaction or a new connection.

In my opinion, when the --continue-on-error option is enabled, pgbench 
clients does not need to start new transactions after network errors and 
other errors except for SQL-level errors.

Network errors are relatively rare, except in failover scenarios. 
Outside of failover, any network issues should be resolved rather than 
worked around. In the context of failover, the key metric is not TPS, 
but system downtime. While one might infer the timing of a failover by 
observing by using --progress option, you can easily determine the 
downtime by executing simple SQL query such as `psql -c 'SELECT 1` every 
second.


On 2025/06/26 18:47, Yugo Nagata wrote:
> On Thu, 26 Jun 2025 05:45:12 +0000
> "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
> 
>> Dear Nagata-san,
>> 
>>> As I understand it, the current patch aims to allow continuation only 
>>> after
>>> SQL-level
>>> errors, such as constraint violations. That seems reasonable, as it 
>>> can simulate
>>> the
>>> behavior of applications that ignore or retry such errors (even 
>>> though retries are
>>> not
>>> implemented in the current patch).
>> 
>> Yes, no one has objections to retry in this case. This is a main part 
>> of the proposal.,
> 
> As I understand it, the proposed --continue-on-error option does not 
> retry the transaction
> in any case; it simply gives up on the transaction. That is, when an 
> SQL-level error occurs,
> the transaction is reported as "failed" rather than "retried", and the 
> random state is discarded.

Retrying the failed transaction is not necessary when the transaction 
failed due to SQL-level errors. Unlike real-world applications, pgbench 
does not need to complete specific transaction successfully. In the case 
of unique constraint violations, retrying the same transaction will 
likely to result in the same error again.

I want to hear your thoughts on this.

Best regards,
Rintaro Ikeda