Thread: Re: Suggestion to add --continue-client-on-abort option to pgbench
Hi hackers,
I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.
For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.
The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.
```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```
I have attached the patch. I would appreciate your feedback.
Best regards,
Rintaro Ikeda
NTT DATA Corporation Japan
Hi Rintaro,
Thanks for the patch and explanation. I understand your goal is to ensure that pgbench
clients continue running even when some transactions fail due to application-level errors (e.g., constraint violations), especially when running custom scripts.
However, I wonder if the intended behavior can't already be achieved using standard SQL constructs — specifically ON CONFLICT
or careful transaction structure. For example, your sample script:
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
can be rewritten as:
\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;
This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.
So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT
, or SAVEPOINT
/ROLLBACK TO
?
Best regards,
Stepan Neretin
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Ikeda-san, Thanks for starting the new thread! I have never known the issue before I heard at PGConf.dev. Few comments: 1. This parameter seems a type of benchmark option. So should we set benchmarking_option_set as well? 2. Not sure, but exit-on-abort seems a similar option. What if both are specified? Is it allowed? 3. Can you consider a test case for the new parameter? Best regards, Hayato Kuroda FUJITSU LIMITED
Dear Kuroda-san, hackers, On 2025/06/04 21:57, Hayato Kuroda (Fujitsu) wrote: > Dear Ikeda-san, > > Thanks for starting the new thread! I have never known the issue before I heard at > PGConf.dev. > > Few comments: > > 1. > This parameter seems a type of benchmark option. So should we set > benchmarking_option_set as well? > > 2. > Not sure, but exit-on-abort seems a similar option. What if both are specified? > Is it allowed? > > 3. > Can you consider a test case for the new parameter? > > Best regards, > Hayato Kuroda > FUJITSU LIMITED > > Thank you for your valuable comment! 1. I should've also set benchmarking_option_set. I've modified it accordingly. 2. The exit-on-abort option and continue-on-error option are mutually exclusive. Therefore, I've updated the patch to throw a FATAL error when two options are set simultaneously. Corresponding explanation was also added. (I'm wondering the name of parameter should be continue-on-abort so that users understand the two option are mutually exclusive.) 3. I've added the test. Additionally, I modified the patch so that st->state does not transition to CSTATE_RETRY when a transaction fails and continue-on-error option is enabled. In the previous patch, we retry the failed transaction up to max-try times, which is unnecessary for our purpose: clients does not exit when its transactions fail. I've attached the updated patch. v3-0001-Add-continue-on-error-option-to-pgbench.patch is identical to v4-0001-Add-continue-on-error-option-to-pgbench.patch. The v4-0002 patch is the diff from the previous patch. Best regards, Rintaro Ikeda
Attachment
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Ikeda-san, Thanks for updating the patch! > 1. I should've also set benchmarking_option_set. I've modified it accordingly. Confirmed it has been fixed. Thanks. > 2. The exit-on-abort option and continue-on-error option are mutually exclusive. > Therefore, I've updated the patch to throw a FATAL error when two options are > set simultaneously. Corresponding explanation was also added. > (I'm wondering the name of parameter should be continue-on-abort so that users > understand the two option are mutually exclusive.) Make sense, +1. Here are new comments. 01. build failure According to the cfbot [1], the documentation cannot be built. IIUC </para> seemed to be missed here: ``` + <para> + Note that this option can not be used together with + <option>--exit-on-abort</option>. + </listitem> + </varlistentry> ``` 02. patch separation How about separating the patch series like: 0001 - contains option handling and retry part, and documentation 0002 - contains accumulation/reporting part 0003 - contains tests. I hope above style is more helpful for reviewers. 03. documentation ``` + Note that this option can not be used together with + <option>--exit-on-abort</option>. ``` I feel we should add the similar description in `exit-on-abort` part. 04. documentation ``` + Client rolls back the failed transaction and starts a new one when its + transaction fails due to the reason other than the deadlock and + serialization failure. This allows all clients specified with -c option + to continuously apply load to the server, even if some transactions fail. ``` I feel the description contains bit redundant part and misses the default behavior. How about: ``` <para> Clients survive when their transactions are aborted, and they continue their run. Without the option, clients exit when transactions they run are aborted. </para> <para> Note that serialization failures or deadlock failures do not abort the client, so they are not affected by this option. See <xref linkend="failures-and-retries"/> for more information. </para> ``` 05. StatsData ``` + * When continue-on-error option is specified, + * failed (the number of failed transactions) = + * 'other_sql_failures' (they got a error when continue-on-error option + * was specified). ``` Let me confirm one point; can serialization_failures and deadlock_failures be counted when continue-on-error is true? If so, the comment seems not correct for me. The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures' in the case. 06. StatsData Another point; can other_sql_failures be counted when the continue-on-error is NOT specified? I feel it should be... 06. usage() Added line is too long. According to program_help_ok(), the output by help should be less than 80. 07. Please run pgindent/pgperltidy, I got some diffs. [1]: https://cirrus-ci.com/task/5210061275922432 Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, 9 Jun 2025 09:34:03 +0000 "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > > 2. The exit-on-abort option and continue-on-error option are mutually exclusive. > > Therefore, I've updated the patch to throw a FATAL error when two options are > > set simultaneously. Corresponding explanation was also added. I don't think that's right since "abort" and "error" are different concept in pgbench. (Here, "abort" refers to the termination of a client, not a transaction abort.) The --exit-on-abort option forces to exit pgbench immediately when any client is aborted due to some error. When the --continue-on-error option is not set, SQL errors other than deadlock or serialization error cause a client to be aborted. On the other hand, when the option is set, clients are not aborted due to any SQL errors; instead they continue to run after them. However, clients can still be aborted for other reasons, such as connection failures or meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option remains useful even when --continue-on-error is enabled. > > (I'm wondering the name of parameter should be continue-on-abort so that users > > understand the two option are mutually exclusive.) For the same reason as above, I believe --continue-on-error is a more accurate description of the option's behavior. > 02. patch separation > How about separating the patch series like: > > 0001 - contains option handling and retry part, and documentation > 0002 - contains accumulation/reporting part > 0003 - contains tests. > > I hope above style is more helpful for reviewers. I'm not sure whether it's necessary to split the patch, as the change doesn't seem very complex. However, the current separation appears inconsistent. For example, patch 0001 modifies canRetryError(), but patch 0002 reverts that change, and so on. > > 04. documentation > ``` > + Client rolls back the failed transaction and starts a new one when its > + transaction fails due to the reason other than the deadlock and > + serialization failure. This allows all clients specified with -c option > + to continuously apply load to the server, even if some transactions fail. > ``` > > I feel the description contains bit redundant part and misses the default behavior. > How about: > ``` > <para> > Clients survive when their transactions are aborted, and they continue > their run. Without the option, clients exit when transactions they run > are aborted. > </para> > <para> > Note that serialization failures or deadlock failures do not abort the > client, so they are not affected by this option. > See <xref linkend="failures-and-retries"/> for more information. > </para> > ``` I think we can make it clearer as follows: Allows clients to continue their run even if an SQL statement fails due to errors other than serialization or deadlock. Without this option, the client is aborted after such errors. Note that serialization and deadlock failures never cause the client to be aborted, so they are not affected by this option. See <xref linkend="failures-and-retries"/> for more information. That said, a review by a native English speaker would still be appreciated. Also, we would need to update several parts of the documentation. For example, the "Failures and Serialization/Deadlock Retries" section should be revised to describe the behavior change. In addition, we should update the explanations of output result examples and logging, the description of the --failures-detailed option, and so on. If transactions are not retried after SQL errors other than serialization or deadlock, this should also be explicitly documented. > 05. StatsData > ``` > + * When continue-on-error option is specified, > + * failed (the number of failed transactions) = > + * 'other_sql_failures' (they got a error when continue-on-error option > + * was specified). > ``` > > Let me confirm one point; can serialization_failures and deadlock_failures be > counted when continue-on-error is true? If so, the comment seems not correct for me. > The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures' > in the case. +1 > 06. StatsData > Another point; can other_sql_failures be counted when the continue-on-error is NOT > specified? I feel it should be... We could do that. However, if an SQL error other than a serialization or deadlock error occurs when --continue-on-error is not set, pgbench will be aborted midway and the printed results will be incomplete. Therefore, this might not make much sense. > 06. usage() > Added line is too long. According to program_help_ok(), the output by help should > be less than 80. +1 Here are additional comments from me. @@ -4548,6 +4570,8 @@ getResultString(bool skipped, EStatus estatus) return "serialization"; case ESTATUS_DEADLOCK_ERROR: return "deadlock"; + case ESTATUS_OTHER_SQL_ERROR: + return "error (except serialization/deadlock)"; Strings returned by getResultString() are printed in the "time" field of the log when both the -l and --failures-detailed options are set. Therefore, they should be single words that do not contain any space characters. I wonder if something like "other" or "other_sql_error" would be appropriate. @@ -4099,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) * can retry the error. */ st->state = timer_exceeded ? CSTATE_FINISHED : + continue_on_error ? CSTATE_FAILURE : doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE; } else This fix is not necessary because doRetry() (and canRetryError(), which is called within it) will return false when continue_on_error is set (after applying patch 0002). case PGRES_NONFATAL_ERROR: case PGRES_FATAL_ERROR: st->estatus = getSQLErrorStatus(PQresultErrorField(res, PG_DIAG_SQLSTATE)); if (canRetryError(st->estatus)) { if (verbose_errors) commandError(st, PQerrorMessage(st->con)); goto error; } /* fall through */ default: /* anything else is unexpected */ pg_log_error("client %d script %d aborted in command %d query %d: %s", st->id, st->use_file, st->command, qrynum, PQerrorMessage(st->con)); goto error; } When an SQL error other than a serialization or deadlock error occurs, an error message is output via pg_log_error in this code path. However, I think this should be reported only when verbose_errors is set, similar to how serialization and deadlock errors are handled when --continue-on-error is enabled Best regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Nagata-san, > > > 2. The exit-on-abort option and continue-on-error option are mutually > exclusive. > > > Therefore, I've updated the patch to throw a FATAL error when two options > are > > > set simultaneously. Corresponding explanation was also added. > > I don't think that's right since "abort" and "error" are different concept in pgbench. > (Here, "abort" refers to the termination of a client, not a transaction abort.) > > The --exit-on-abort option forces to exit pgbench immediately when any client is > aborted > due to some error. When the --continue-on-error option is not set, SQL errors > other than > deadlock or serialization error cause a client to be aborted. On the other hand, > when the option > is set, clients are not aborted due to any SQL errors; instead they continue to run > after them. > However, clients can still be aborted for other reasons, such as connection > failures or > meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option > remains > useful even when --continue-on-error is enabled. To clarify: another approach is that allow --continue-on-error option to continue running even when clients meet such errors. Which one is better? > > 02. patch separation > > How about separating the patch series like: > > > > 0001 - contains option handling and retry part, and documentation > > 0002 - contains accumulation/reporting part > > 0003 - contains tests. > > > > I hope above style is more helpful for reviewers. > > I'm not sure whether it's necessary to split the patch, as the change doesn't seem > very > complex. However, the current separation appears inconsistent. For example, > patch 0001 > modifies canRetryError(), but patch 0002 reverts that change, and so on. Either way is fine for me if they are changed from the current method. > > > > > 04. documentation > > ``` > > + Client rolls back the failed transaction and starts a new one when its > > + transaction fails due to the reason other than the deadlock and > > + serialization failure. This allows all clients specified with -c option > > + to continuously apply load to the server, even if some transactions > fail. > > ``` > > > > I feel the description contains bit redundant part and misses the default > behavior. > > How about: > > ``` > > <para> > > Clients survive when their transactions are aborted, and they continue > > their run. Without the option, clients exit when transactions they run > > are aborted. > > </para> > > <para> > > Note that serialization failures or deadlock failures do not abort the > > client, so they are not affected by this option. > > See <xref linkend="failures-and-retries"/> for more information. > > </para> > > ``` > > I think we can make it clearer as follows: I do not have confident for English, native speaker is needed.... > > 06. usage() > > Added line is too long. According to program_help_ok(), the output by help > should > > be less than 80. > > +1 FYI - I posted a patch which adds the test. You can apply and confirm how the function says. [1]: https://www.postgresql.org/message-id/OSCPR01MB1496610451F5896375B2562E6F56BA%40OSCPR01MB14966.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED
Hi, Thank you very much for your valuable comments and kind advice. I'm currently working on revising the previous patch based on the feedback received. I would like to share my thoughts regarding the conditions under which the --continue-on-error option should initiate a new transaction or a new connection. In my opinion, when the --continue-on-error option is enabled, pgbench clients does not need to start new transactions after network errors and other errors except for SQL-level errors. Network errors are relatively rare, except in failover scenarios. Outside of failover, any network issues should be resolved rather than worked around. In the context of failover, the key metric is not TPS, but system downtime. While one might infer the timing of a failover by observing by using --progress option, you can easily determine the downtime by executing simple SQL query such as `psql -c 'SELECT 1` every second. On 2025/06/26 18:47, Yugo Nagata wrote: > On Thu, 26 Jun 2025 05:45:12 +0000 > "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > >> Dear Nagata-san, >> >>> As I understand it, the current patch aims to allow continuation only >>> after >>> SQL-level >>> errors, such as constraint violations. That seems reasonable, as it >>> can simulate >>> the >>> behavior of applications that ignore or retry such errors (even >>> though retries are >>> not >>> implemented in the current patch). >> >> Yes, no one has objections to retry in this case. This is a main part >> of the proposal., > > As I understand it, the proposed --continue-on-error option does not > retry the transaction > in any case; it simply gives up on the transaction. That is, when an > SQL-level error occurs, > the transaction is reported as "failed" rather than "retried", and the > random state is discarded. Retrying the failed transaction is not necessary when the transaction failed due to SQL-level errors. Unlike real-world applications, pgbench does not need to complete specific transaction successfully. In the case of unique constraint violations, retrying the same transaction will likely to result in the same error again. I want to hear your thoughts on this. Best regards, Rintaro Ikeda