Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors - Mailing list pgsql-hackers
From | Yugo NAGATA |
---|---|
Subject | Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors |
Date | |
Msg-id | 20210622235459.4b2e38e268488014226b3262@sraoss.co.jp Whole thread Raw |
In response to | Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors (Yugo NAGATA <nagata@sraoss.co.jp>) |
Responses |
Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors |
List | pgsql-hackers |
Hi hackers, On Mon, 24 May 2021 11:29:10 +0900 Yugo NAGATA <nagata@sraoss.co.jp> wrote: > Hi hackers, > > On Tue, 10 Mar 2020 09:48:23 +1300 > Thomas Munro <thomas.munro@gmail.com> wrote: > > > On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > >> Thank you very much! I'm going to send a new patch set until the end of > > > >> this week (I'm sorry I was very busy in the release of Postgres Pro > > > >> 11...). > > > > > > > > Is anyone interested in rebasing this, and summarising what needs to > > > > be done to get it in? It's arguably a bug or at least quite > > > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard > > > > that a couple of forks already ship Marina's patch set. > > I got interested in this and now looking into the patch and the past discussion. > If anyone other won't do it and there are no objection, I would like to rebase > this. Is that okay? I rebased and fixed the previous patches (v11) rewtten by Marina Polyakova, and attached the revised version (v12). v12-0001-Pgbench-errors-use-the-Variables-structure-for-c.patch - a patch for the Variables structure (this is used to reset client variables during the repeating of transactions after serialization/deadlock failures). v12-0002-Pgbench-errors-and-serialization-deadlock-retrie.patch - the main patch for handling client errors and repetition of transactions with serialization/deadlock failures (see the detailed description in the file). These are the revised versions from v11-0002 and v11-0003. v11-0001 (for the RandomState structure) is not included because this has been already committed (40923191944). V11-0004 (for a separate error reporting function) is not included neither because pgbench now uses common logging APIs (30a3e772b40). In addition to rebase on master, I updated the patch according with the review from Fabien COELHO [1] and discussions after this. Also, I added some other fixes through my reviewing the previous patch. [1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809081450100.10506%40lancre Following are fixes according with Fabian's review. > * Features > As far as the actual retry feature is concerned, I'd say we are nearly > there. However I have an issue with changing the behavior on meta command > and other sql errors, which I find not desirable. ... > I do not think that these changes of behavior are desirable. Meta command and > miscellaneous SQL errors should result in immediatly aborting the whole run, > because the client test code itself could not run correctly or the SQL sent > was somehow wrong, which is also the client's fault, and the server > performance bench does not make much sense in such conditions. > > ISTM that the focus of this patch should only be to handle some server > runtime errors that can be retryed, but not to change pgbench behavior on > other kind of errors. If these are to be changed, ISTM that it would be a > distinct patch and would require some discussion, and possibly an option > to enable it or not if some use case emerge. AFA this patch is concerned, > I'd suggest to let that out. Previously, all SQL and meta command errors could be retried, but I fixed to allow only serialization & deadlock errors to be retried. > Doc says "you cannot use an infinite number of retries without latency-limit..." > > Why should this be forbidden? At least if -T timeout takes precedent and > shortens the execution, ISTM that there could be good reason to test that. > Maybe it could be blocked only under -t if this would lead to an non-ending > run. I fixed to allow to use --max-tries with -T option even if latency-limit is not used. > As "--print-errors" is really for debug, maybe it could be named > "--debug-errors". I'm not sure that having "--debug" implying this option > is useful: As there are two distinct options, the user may be allowed > to trigger one or the other as they wish? print-errors was renamed to debug-errors. > makeVariableValue error message is not for debug, but must be kept in all > cases, and the false returned must result in an immediate abort. Same thing about > lookupCreateVariable, an invalid name is a user error which warrants an immediate > abort. Same thing again about coerce* functions or evalStandardFunc... > Basically, most/all added "debug_level >= DEBUG_ERRORS" are not desirable. "DEBUG_ERRORS" messages unrelated to serialization & deadlock errors were removed. > sendRollback(): I'd suggest to simplify. The prepare/extended statement stuff is > really about the transaction script, not dealing with errors, esp as there is no > significant advantage in preparing a "ROLLBACK" statement which is short and has > no parameters. I'd suggest to remove this function and just issue > PQsendQuery("ROLLBACK;") in all cases. Now, we just issue PQsendQuery("ROLLBACK;"). > In copyVariables, I'd simplify > > + if (source_var->svalue == NULL) > + dest_var->svalue = NULL; > + else > + dest_var->svalue = pg_strdup(source_var->svalue); > >as: > dest_var->value = (source_var->svalue == NULL) ? NULL : pg_strdup(source_var->svalue); Fixed using a ternary operator. > + if (sqlState) -> if (sqlState != NULL) ? Fixed. > Function getTransactionStatus name does not seem to correspond fully to what the > function does. There is a passthru case which should be either avoided or > clearly commented. This was renamed to checkTransactionStatus according with [2]. [2] https://www.postgresql.org/message-id/c262e889315625e0fc0d77ca78fe2eac%40postgrespro.ru > - commandFailed(st, "SQL", "perhaps the backend died while processing"); > + clientAborted(st, > + "perhaps the backend died while processing"); > > keep on one line? This fix that replaced commandFailed with clientAborted was removed. (See below) > + if (doRetry(st, &now)) > + st->state = CSTATE_RETRY; > + else > + st->state = CSTATE_FAILURE; > > -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE; Fixed using a ternary operator. > * Comments > "There're different types..." -> "There are different types..." > "after the errors and"... -> "after errors and"... > "the default value of max_tries is set to 1" -> "the default value > of max_tries is 1" > "We cannot retry the transaction" -> "We cannot retry a transaction" > "may ultimately succeed or get a failure," -> "may ultimately succeed or fail," Fixed. > Overall, the comment text in StatsData is very clear. However they are not > clearly linked to the struct fields. I'd suggest that earch field when used > should be quoted, so as to separate English from code, and the struct name > should always be used explicitely when possible. The comment in StatsData was fixed to clarify what each filed in this struct represents. > I'd insist in a comment that "cnt" does not include "skipped" transactions > (anymore). StatsData.cnt has a comment "number of successful transactions, not including 'skipped'", and CState.cnt has a comment "skipped and failed transactions are also counted here". > * Documentation: > ISTM that there are too many "the": > - "turns on the option ..." -> "turns on option ..." > - "When the option ..." -> "When option ..." > - "By default the option ..." -> "By default option ..." > - "only if the option ..." -> "only if option ..." > - "combined with the option ..." -> "combined with option ..." > - "without the option ..." -> "without option ..." The previous patch used a lot of "the option xxxx", but I fixed them to "the xxxx option" because I found that the documentation uses such way for referring to a certain option. For example, - You can (and, for most purposes, probably should) increase the number of rows by using the <option>-s</option> (scale factor) option. - The prefix can be changed by using the <option>--log-prefix</option> option. - If the <option>-j</option> option is 2 or higher, so that there are multiple worker threads, > - "is the sum of all the retries" -> "is the sum of all retries" > "infinite" -> "unlimited" > "not retried at all" -> "not retried" (maybe several times). > "messages of all errors" -> "messages about all errors". > "It is assumed that the scripts used do not contain" -> > "It is assumed that pgbench scripts do not contai Fixed. Following are additional fixes based on my review on the previous patch. * About error reporting In the previous patch, commandFailed() was changed to report an error that doesn't immediately abort the client, and clientAborted() was added to report an abortion of the client. In the attached patch, behaviors around errors other than serialization and deadlock are not changed and such errors cause the client to abort, so commandFaile() is used without any changes to report a client abortion, and commandError() is added to report an error that can be retried under --debug-error. * About progress reporting In the previous patch, the number of failures was reported only when any transaction was failed, and statistics of retry was reported only when any transaction was retried. This means, the number of columns in the reporting were different depending on the interval. This was odd and harder to parse the output. In the attached patch, the number of failures is always reported, and the retry statistic is reported when max-tries is not 1. * About result outputs In the previous patch, the number of failed transaction, the number of retried transaction, and the number of total retries were reported as: number of failures: 324 (3.240%) ... number of retried: 5629 (56.290%) number of retries: 103299 I think this was confusable. Especially, it was unclear for me what "retried" and "retries" represent repectively. Therefore, in the attached patch, they are reported as: number of transactions failed: 324 (3.240%) ... number of transactions retried: 5629 (56.290%) number of total retries: 103299 which clarify that first two are the numbers of transactions and the last one is the number of retries over all transactions. * Abourt average connection time In the previous patch, this was calculated as "conn_total_duration / total->cnt" where conn_total_duration is the cumulated connection time sumed over threads and total->cnt is the number of transaction that is successfully processed. However, the average connection time could be overestimated because conn_total_duration includes a connection time of failed transaction due to serialization and deadlock errors. So, in the attached patch, this is calculated as "conn_total_duration / total->cnt + failures". Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
pgsql-hackers by date: