Thread: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Hello, hackers! Now in pgbench we can test only transactions with Read Committed isolation level because client sessions are disconnected forever on serialization failures. There were some proposals and discussions about it (see message here [1] and thread here [2]). I suggest a patch where pgbench client sessions are not disconnected because of serialization or deadlock failures and these failures are mentioned in reports. In details: - transaction with one of these failures continue run normally, but its result is rolled back; - if there were these failures during script execution this "transaction" is marked appropriately in logs; - numbers of "transactions" with these failures are printed in progress, in aggregation logs and in the end with other results (all and for each script); Advanced options: - mostly for testing built-in scripts: you can set the default transaction isolation level by the appropriate benchmarking option (-I); - for more detailed reports: to know per-statement serialization and deadlock failures you can use the appropriate benchmarking option (--report-failures). Also: TAP tests for new functionality and changed documentation with new examples. Patches are attached. Any suggestions are welcome! P.S. Does this use case (do not retry transaction with serialization or deadlock failure) is most interesting or failed transactions should be retried (and how much times if there seems to be no hope of success...)? [1] https://www.postgresql.org/message-id/4EC65830020000250004323F%40gw.wicourts.gov [2] https://www.postgresql.org/message-id/flat/alpine.DEB.2.02.1305182259550.1473%40localhost6.localdomain6#alpine.DEB.2.02.1305182259550.1473@localhost6.localdomain6 -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jun 14, 2017 at 4:48 AM, Marina Polyakova <m.polyakova@postgrespro.ru> wrote: > Now in pgbench we can test only transactions with Read Committed isolation > level because client sessions are disconnected forever on serialization > failures. There were some proposals and discussions about it (see message > here [1] and thread here [2]). > > I suggest a patch where pgbench client sessions are not disconnected because > of serialization or deadlock failures and these failures are mentioned in > reports. In details: > - transaction with one of these failures continue run normally, but its > result is rolled back; > - if there were these failures during script execution this "transaction" is > marked > appropriately in logs; > - numbers of "transactions" with these failures are printed in progress, in > aggregation logs and in the end with other results (all and for each > script); > > Advanced options: > - mostly for testing built-in scripts: you can set the default transaction > isolation level by the appropriate benchmarking option (-I); > - for more detailed reports: to know per-statement serialization and > deadlock failures you can use the appropriate benchmarking option > (--report-failures). > > Also: TAP tests for new functionality and changed documentation with new > examples. > > Patches are attached. Any suggestions are welcome! Sounds like a good idea. Please add to the next CommitFest and review somebody else's patch in exchange for having your own patch reviewed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Sounds like a good idea. Thank you! > Please add to the next CommitFest Done: https://commitfest.postgresql.org/14/1170/ > and review > somebody else's patch in exchange for having your own patch reviewed. Of course, I remember about it. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote: > Now in pgbench we can test only transactions with Read Committed isolation > level because client sessions are disconnected forever on serialization > failures. There were some proposals and discussions about it (see message > here [1] and thread here [2]). > I suggest a patch where pgbench client sessions are not disconnected because > of serialization or deadlock failures and these failures are mentioned in > reports. I think that's a good idea and sorely needed. In details: > - if there were these failures during script execution this "transaction" is > marked > appropriately in logs; > - numbers of "transactions" with these failures are printed in progress, in > aggregation logs and in the end with other results (all and for each > script); I guess that'll include a "rolled-back %' or 'retried %' somewhere? > Advanced options: > - mostly for testing built-in scripts: you can set the default transaction > isolation level by the appropriate benchmarking option (-I); I'm less convinced of the need of htat, you can already set arbitrary connection options with PGOPTIONS='-c default_transaction_isolation=serializable' pgbench > P.S. Does this use case (do not retry transaction with serialization or > deadlock failure) is most interesting or failed transactions should be > retried (and how much times if there seems to be no hope of success...)? I can't quite parse that sentence, could you restate? - Andres
On Thu, Jun 15, 2017 at 2:16 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote: >> I suggest a patch where pgbench client sessions are not disconnected because >> of serialization or deadlock failures and these failures are mentioned in >> reports. > > I think that's a good idea and sorely needed. +1 >> P.S. Does this use case (do not retry transaction with serialization or >> deadlock failure) is most interesting or failed transactions should be >> retried (and how much times if there seems to be no hope of success...)? > > I can't quite parse that sentence, could you restate? The way I read it was that the most interesting solution would retry a transaction from the beginning on a serialization failure or deadlock failure. Most people who use serializable transactions (at least in my experience) run though a framework that does that automatically, regardless of what client code initiated the transaction. These retries are generally hidden from the client code -- it just looks like the transaction took a bit longer. Sometimes people will have a limit on the number of retries. I never used such a limit and never had a problem, because our implementation of serializable transactions will not throw a serialization failure error until one of the transactions involved in causing it has successfully committed -- meaning that the retry can only hit this again on a *new* set of transactions. Essentially, the transaction should only count toward the TPS rate when it eventually completes without a serialization failure. Marina, did I understand you correctly? -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
Kevin Grittner wrote: > On Thu, Jun 15, 2017 at 2:16 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote: > >> P.S. Does this use case (do not retry transaction with serialization or > >> deadlock failure) is most interesting or failed transactions should be > >> retried (and how much times if there seems to be no hope of success...)? > > > > I can't quite parse that sentence, could you restate? > > The way I read it was that the most interesting solution would retry > a transaction from the beginning on a serialization failure or > deadlock failure. As far as I understand her proposal, it is exactly the opposite -- if a transaction fails, it is discarded. And this P.S. note is asking whether this is a good idea, or would we prefer that failing transactions are retried. I think it's pretty obvious that transactions that failed with some serializability problem should be retried. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 16, 2017 at 9:18 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Kevin Grittner wrote: >> On Thu, Jun 15, 2017 at 2:16 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote: > >> >> P.S. Does this use case (do not retry transaction with serialization or >> >> deadlock failure) is most interesting or failed transactions should be >> >> retried (and how much times if there seems to be no hope of success...)? >> > >> > I can't quite parse that sentence, could you restate? >> >> The way I read it was that the most interesting solution would retry >> a transaction from the beginning on a serialization failure or >> deadlock failure. > > As far as I understand her proposal, it is exactly the opposite -- if a > transaction fails, it is discarded. And this P.S. note is asking > whether this is a good idea, or would we prefer that failing > transactions are retried. > > I think it's pretty obvious that transactions that failed with > some serializability problem should be retried. +1 for retry with reporting of retry rates -- Thomas Munro http://www.enterprisedb.com
On Thu, Jun 15, 2017 at 4:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Kevin Grittner wrote: > As far as I understand her proposal, it is exactly the opposite -- if a > transaction fails, it is discarded. And this P.S. note is asking > whether this is a good idea, or would we prefer that failing > transactions are retried. > > I think it's pretty obvious that transactions that failed with > some serializability problem should be retried. Agreed all around. -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
> Hi, Hello! > I think that's a good idea and sorely needed. Thanks, I'm very glad to hear it! >> - if there were these failures during script execution this >> "transaction" is >> marked >> appropriately in logs; >> - numbers of "transactions" with these failures are printed in >> progress, in >> aggregation logs and in the end with other results (all and for each >> script); > > I guess that'll include a "rolled-back %' or 'retried %' somewhere? Not exactly, see documentation: + If transaction has serialization / deadlock failure or them both (last thing + is possible if used script contains several transactions; see + <xref linkend="transactions-and-scripts" + endterm="transactions-and-scripts-title"> for more information), its + <replaceable>time</> will be reported as <literal>serialization failure</> / + <literal>deadlock failure</> / + <literal>serialization and deadlock failures</> appropriately. + Example with serialization, deadlock and both these failures: +<screen> +1 128 24968 0 1496759158 426984 +0 129 serialization failure 0 1496759158 427023 +3 129 serialization failure 0 1496759158 432662 +2 128 serialization failure 0 1496759158 432765 +0 130 deadlock failure 0 1496759159 460070 +1 129 serialization failure 0 1496759160 485188 +2 129 serialization and deadlock failures 0 1496759160 485339 +4 130 serialization failure 0 1496759160 485465 +</screen> I have understood proposals in next messages of this thread that the most interesting case is to retry failed transaction. Do you think it's better to write for example "rolled-back after % retries (serialization failure)' or "time (retried % times, serialization and deadlock failures)'? >> Advanced options: >> - mostly for testing built-in scripts: you can set the default >> transaction >> isolation level by the appropriate benchmarking option (-I); > > I'm less convinced of the need of htat, you can already set arbitrary > connection options with > PGOPTIONS='-c default_transaction_isolation=serializable' pgbench Oh, thanks, I forgot about it =[ >> P.S. Does this use case (do not retry transaction with serialization >> or >> deadlock failure) is most interesting or failed transactions should be >> retried (and how much times if there seems to be no hope of >> success...)? > I can't quite parse that sentence, could you restate? Álvaro Herrera later in this thread has understood my text right: > As far as I understand her proposal, it is exactly the opposite -- if a > transaction fails, it is discarded. And this P.S. note is asking > whether this is a good idea, or would we prefer that failing > transactions are retried. With his explanation has my text become clearer? -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>>> P.S. Does this use case (do not retry transaction with serialization >>> or >>> deadlock failure) is most interesting or failed transactions should >>> be >>> retried (and how much times if there seems to be no hope of >>> success...)? >> >> I can't quite parse that sentence, could you restate? > > The way I read it was that the most interesting solution would retry > a transaction from the beginning on a serialization failure or > deadlock failure. Most people who use serializable transactions (at > least in my experience) run though a framework that does that > automatically, regardless of what client code initiated the > transaction. These retries are generally hidden from the client > code -- it just looks like the transaction took a bit longer. > Sometimes people will have a limit on the number of retries. I > never used such a limit and never had a problem, because our > implementation of serializable transactions will not throw a > serialization failure error until one of the transactions involved > in causing it has successfully committed -- meaning that the retry > can only hit this again on a *new* set of transactions. > > Essentially, the transaction should only count toward the TPS rate > when it eventually completes without a serialization failure. > > Marina, did I understand you correctly? Álvaro Herrera in next message of this thread has understood my text right: > As far as I understand her proposal, it is exactly the opposite -- if a > transaction fails, it is discarded. And this P.S. note is asking > whether this is a good idea, or would we prefer that failing > transactions are retried. And thank you very much for your explanation how and why transactions with failures should be retried! I'll try to implement all of it. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>> >> P.S. Does this use case (do not retry transaction with serialization or >> >> deadlock failure) is most interesting or failed transactions should be >> >> retried (and how much times if there seems to be no hope of success...)? >> > >> > I can't quite parse that sentence, could you restate? >> >> The way I read it was that the most interesting solution would retry >> a transaction from the beginning on a serialization failure or >> deadlock failure. > > As far as I understand her proposal, it is exactly the opposite -- if a > transaction fails, it is discarded. And this P.S. note is asking > whether this is a good idea, or would we prefer that failing > transactions are retried. Yes, I have meant this, thank you! > I think it's pretty obvious that transactions that failed with > some serializability problem should be retried. Thank you voted :) -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Jun 16, 2017 at 5:31 AM, Marina Polyakova <m.polyakova@postgrespro.ru> wrote: > And thank you very much for your explanation how and why transactions with > failures should be retried! I'll try to implement all of it. To be clear, part of "retrying from the beginning" means that if a result from one statement is used to determine the content (or whether to run) a subsequent statement, that first statement must be run in the new transaction and the results evaluated again to determine what to use for the later statement. You can't simply replay the statements that were run during the first try. For examples, to help get a feel of why that is, see: https://wiki.postgresql.org/wiki/SSI -- Kevin Grittner VMware vCenter Server https://www.vmware.com/
> To be clear, part of "retrying from the beginning" means that if a
> result from one statement is used to determine the content (or
> whether to run) a subsequent statement, that first statement must be
> run in the new transaction and the results evaluated again to
> determine what to use for the later statement. You can't simply
> replay the statements that were run during the first try. For
> examples, to help get a feel of why that is, see:
>
Thank you again! :))
--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hello Marina, A few comments about the submitted patches. I agree that improving the error handling ability of pgbench is a good thing, although I'm not sure about the implications... About the "retry" discussion: I agree that retry is the relevant option from an application point of view. ISTM that the retry implementation should be implemented somehow in the automaton, restarting the same script for the beginning. As pointed out in the discussion, the same values/commands should be executed, which suggests that random generated values should be the same on the retry runs, so that for a simple script the same operations are attempted. This means that the random generator state must be kept & reinstated for a client on retries. Currently the random state is in the thread, which is not convenient for this purpose, so it should be moved in the client so that it can be saved at transaction start and reinstated on retries. The number of retries and maybe failures should be counted, maybe with some adjustable maximum, as suggested. About 0001: In accumStats, just use one level if, the two levels bring nothing. In doLog, added columns should be at the end of the format. The number of column MUST NOT change when different issues arise, so that it works well with cut/... unix commands, so inserting a sentence such as "serialization and deadlock failures" is a bad idea. threadRun: the point of the progress format is to fit on one not too wide line on a terminal and to allow some simple automatic processing. Adding a verbose sentence in the middle of it is not the way to go. About tests: I do not understand why test 003 includes 2 transactions. It would seem more logical to have two scripts. About 0003: I'm not sure that there should be an new option to report failures, the information when relevant should be integrated in a clean format into the existing reports... Maybe the "per command latency" report/option should be renamed if it becomes more general. About 0004: The documentation must not be in a separate patch, but in the same patch as their corresponding code. -- Fabien.
> Hello Marina, Hello, Fabien! > A few comments about the submitted patches. Thank you very much for them! > I agree that improving the error handling ability of pgbench is a good > thing, although I'm not sure about the implications... Could you tell a little bit more exactly.. What implications are you worried about? > About the "retry" discussion: I agree that retry is the relevant > option from an application point of view. I'm glad to hear it! > ISTM that the retry implementation should be implemented somehow in > the automaton, restarting the same script for the beginning. If there are several transactions in this script - don't you think that we should restart only the failed transaction?.. > As pointed out in the discussion, the same values/commands should be > executed, which suggests that random generated values should be the > same on the retry runs, so that for a simple script the same > operations are attempted. This means that the random generator state > must be kept & reinstated for a client on retries. Currently the > random state is in the thread, which is not convenient for this > purpose, so it should be moved in the client so that it can be saved > at transaction start and reinstated on retries. I think about it in the same way =) > The number of retries and maybe failures should be counted, maybe with > some adjustable maximum, as suggested. If we fix the maximum number of attempts the maximum number of failures for one script execution will be bounded above (number_of_transactions_in_script * maximum_number_of_attempts). Do you think we should make the option in program to limit this number much more? > About 0001: > > In accumStats, just use one level if, the two levels bring nothing. Thanks, I agree =[ > In doLog, added columns should be at the end of the format. I have inserted it earlier because these columns are not optional. Do you think they should be optional? > The number > of column MUST NOT change when different issues arise, so that it > works well with cut/... unix commands, so inserting a sentence such as > "serialization and deadlock failures" is a bad idea. Thanks, I agree again. > threadRun: the point of the progress format is to fit on one not too > wide line on a terminal and to allow some simple automatic processing. > Adding a verbose sentence in the middle of it is not the way to go. I was thinking about it.. Thanks, I'll try to make it shorter. > About tests: I do not understand why test 003 includes 2 transactions. > It would seem more logical to have two scripts. Ok! > About 0003: > > I'm not sure that there should be an new option to report failures, > the information when relevant should be integrated in a clean format > into the existing reports... Maybe the "per command latency" > report/option should be renamed if it becomes more general. I have tried do not change other parts of program as much as possible. But if you think that it will be more useful to change the option I'll do it. > About 0004: > > The documentation must not be in a separate patch, but in the same > patch as their corresponding code. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, >> I agree that improving the error handling ability of pgbench is a good >> thing, although I'm not sure about the implications... > > Could you tell a little bit more exactly.. What implications are you worried > about? The current error handling is either "close connection" or maybe in some cases even "exit". If this is changed, then the client may continue execution in some unforseen state and behave unexpectedly. We'll see. >> ISTM that the retry implementation should be implemented somehow in >> the automaton, restarting the same script for the beginning. > > If there are several transactions in this script - don't you think that we > should restart only the failed transaction?.. On some transaction failures based on their status. My point is that the retry process must be implemented clearly with a new state in the client automaton. Exactly when the transition to this new state must be taken is another issue. >> The number of retries and maybe failures should be counted, maybe with >> some adjustable maximum, as suggested. > > If we fix the maximum number of attempts the maximum number of failures for > one script execution will be bounded above (number_of_transactions_in_script > * maximum_number_of_attempts). Do you think we should make the option in > program to limit this number much more? Probably not. I think that there should be a configurable maximum of retries on a transaction, which may be 0 by default if we want to be upward compatible with the current behavior, or maybe something else. >> In doLog, added columns should be at the end of the format. > > I have inserted it earlier because these columns are not optional. Do you > think they should be optional? I think that new non-optional columns it should be at the end of the existing non-optional columns so that existing scripts which may process the output may not need to be updated. >> I'm not sure that there should be an new option to report failures, >> the information when relevant should be integrated in a clean format >> into the existing reports... Maybe the "per command latency" >> report/option should be renamed if it becomes more general. > > I have tried do not change other parts of program as much as possible. But if > you think that it will be more useful to change the option I'll do it. I think that the option should change if its naming becomes less relevant, which is to be determined. AFAICS, ISTM that new measures should be added to the various existing reports unconditionnaly (i.e. without a new option), so maybe no new option would be needed. -- Fabien.
> The current error handling is either "close connection" or maybe in > some cases even "exit". If this is changed, then the client may > continue execution in some unforseen state and behave unexpectedly. > We'll see. Thanks, now I understand this. >>> ISTM that the retry implementation should be implemented somehow in >>> the automaton, restarting the same script for the beginning. >> >> If there are several transactions in this script - don't you think >> that we should restart only the failed transaction?.. > > On some transaction failures based on their status. My point is that > the retry process must be implemented clearly with a new state in the > client automaton. Exactly when the transition to this new state must > be taken is another issue. About it, I agree with you that it should be done in this way. >>> The number of retries and maybe failures should be counted, maybe >>> with >>> some adjustable maximum, as suggested. >> >> If we fix the maximum number of attempts the maximum number of >> failures for one script execution will be bounded above >> (number_of_transactions_in_script * maximum_number_of_attempts). Do >> you think we should make the option in program to limit this number >> much more? > > Probably not. I think that there should be a configurable maximum of > retries on a transaction, which may be 0 by default if we want to be > upward compatible with the current behavior, or maybe something else. I propose the option --max-attempts-number=NUM which NUM cannot be less than 1. I propose it because I think that, for example, --max-attempts-number=100 is better than --max-retries-number=99. And maybe it's better to set its default value to 1 too because retrying of shell commands can produce new errors.. >>> In doLog, added columns should be at the end of the format. >> >> I have inserted it earlier because these columns are not optional. Do >> you think they should be optional? > > I think that new non-optional columns it should be at the end of the > existing non-optional columns so that existing scripts which may > process the output may not need to be updated. Thanks, I agree with you :) >>> I'm not sure that there should be an new option to report failures, >>> the information when relevant should be integrated in a clean format >>> into the existing reports... Maybe the "per command latency" >>> report/option should be renamed if it becomes more general. >> >> I have tried do not change other parts of program as much as possible. >> But if you think that it will be more useful to change the option I'll >> do it. > > I think that the option should change if its naming becomes less > relevant, which is to be determined. AFAICS, ISTM that new measures > should be added to the various existing reports unconditionnaly (i.e. > without a new option), so maybe no new option would be needed. Thanks! I didn't think about it in this way.. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>>>> The number of retries and maybe failures should be counted, maybe with >>>> some adjustable maximum, as suggested. >>> >>> If we fix the maximum number of attempts the maximum number of failures >>> for one script execution will be bounded above >>> (number_of_transactions_in_script * maximum_number_of_attempts). Do you >>> think we should make the option in program to limit this number much more? >> >> Probably not. I think that there should be a configurable maximum of >> retries on a transaction, which may be 0 by default if we want to be >> upward compatible with the current behavior, or maybe something else. > > I propose the option --max-attempts-number=NUM which NUM cannot be less than > 1. I propose it because I think that, for example, --max-attempts-number=100 > is better than --max-retries-number=99. And maybe it's better to set its > default value to 1 too because retrying of shell commands can produce new > errors.. Personnaly, I like counting retries because it also counts the number of time the transaction actually failed for some reason. But this is a marginal preference, and one can be switchted to the other easily. -- Fabien.
On Thu, Jun 15, 2017 at 10:16 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:
> Advanced options:
> - mostly for testing built-in scripts: you can set the default transaction
> isolation level by the appropriate benchmarking option (-I);
I'm less convinced of the need of htat, you can already set arbitrary
connection options with
PGOPTIONS='-c default_transaction_isolation=serializable' pgbench
Right, there is already way to specify default isolation level using environment variables.
However, once we make pgbench work with various isolation levels, users may want to run pgbench multiple times in a row with different isolation levels. Command line option would be very convenient in this case.
In addition, isolation level is vital parameter to interpret benchmark results correctly. Often, graphs with pgbench results are entitled with pgbench command line. Having, isolation level specified in command line would naturally fit into this entitling scheme.
Of course, this is solely usability question and it's fair enough to live without such a command line option. But I'm +1 to add this option.
The Russian Postgres Company
Hello everyone! There's the second version of my patch for pgbench. Now transactions with serialization and deadlock failures are rolled back and retried until they end successfully or their number of attempts reaches maximum. In details: - You can set the maximum number of attempts by the appropriate benchmarking option (--max-attempts-number). Its default value is 1 partly because retrying of shell commands can produce new errors. - Statistics of attempts and failures is printed in progress, in transaction / aggregation logs and in the end with other results (all and for each script). The transaction failure is reported here only if the last retry of this transaction fails. - Also failures and average numbers of transactions attempts are printed per-command with average latencies if you use the appropriate benchmarking option (--report-per-command, -r) (it replaces the option --report-latencies as I was advised here [1]). Average numbers of transactions attempts are printed only for commands which start transactions. As usual: TAP tests for new functionality and changed documentation with new examples. Patch is attached. Any suggestions are welcome! [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1707031321370.3419%40lancre -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Marina, > There's the second version of my patch for pgbench. Now transactions > with serialization and deadlock failures are rolled back and retried > until they end successfully or their number of attempts reaches maximum. > In details: > - You can set the maximum number of attempts by the appropriate > benchmarking option (--max-attempts-number). Its default value is 1 > partly because retrying of shell commands can produce new errors. > > - Statistics of attempts and failures is printed in progress, in > transaction / aggregation logs and in the end with other results (all > and for each script). The transaction failure is reported here only if > the last retry of this transaction fails. > > - Also failures and average numbers of transactions attempts are printed > per-command with average latencies if you use the appropriate > benchmarking option (--report-per-command, -r) (it replaces the option > --report-latencies as I was advised here [1]). Average numbers of > transactions attempts are printed only for commands which start > transactions. > As usual: TAP tests for new functionality and changed documentation with > new examples. Here are a round of comments on the current version of the patch: * About the feature There is a latent issue about what is a transaction. For pgbench a transaction is a full script execution. For postgresql, it is a statement or a BEGIN/END block, several of which may appear in a script. From a retry perspective, you may retry from a SAVEPOINT within a BEGIN/END block... I'm not sure how to make general sense of all this, so this is just a comment without attached action for now. As the default is not to retry, which is the upward compatible behavior, I think that the changes should not change much the current output bar counting the number of failures. I would consider using "try/tries" instead of "attempt/attempts" as it is shorter. An English native speaker opinion would be welcome on that point. * About the code ISTM that the code interacts significantly with various patches under review or ready for committers. Not sure how to deal with that, there will be some rebasing work... I'm fine with renaming "is_latencies" to "report_per_command", which is more logical & generic. "max_attempt_number": I'm against typing fields again in their name, aka "hungarian naming". I'd suggest "max_tries" or "max_attempts". "SimpleStats attempts": I disagree with using this floating poiunt oriented structures to count integers. I would suggest "int64 tries" instead, which should be enough for the purpose. LastBeginState -> RetryState? I'm not sure why this state is a pointer in CState. Putting the struct would avoid malloc/free cycles. Index "-1" may be used to tell it is not set if necessary. "CSTATE_RETRY_FAILED_TRANSACTION" -> "CSTATE_RETRY" is simpler and clear enough. In CState and some code, a failure is a failure, maybe one boolean would be enough. It need only be differentiated when counting, and you have (deadlock_failure || serialization_failure) everywhere. Some variables, such as "int attempt_number", should be in the client structure, not in the client? Generally, try to use block variables if possible to keep the state clearly disjoints. If there could be NO new variable at the doCustom level that would be great, because that would ensure that there is no machine state mixup hidden in these variables. I wondering whether the RETRY & FAILURE states could/should be merged: on RETRY: -> count retry -> actually retry if < max_tries (reset client state, jump to command) -> else countfailure and skip to end of script The start and end of transaction detection seem expensive (malloc, ...) and assume a one statement per command (what about "BEGIN \; ... \; COMMIT;", which is not necessarily the case, this limitation should be documented. ISTM that the space normalization should be avoided, and something simpler/lighter should be devised? Possibly it should consider handling SAVEPOINT. I disagree about exit in ParseScript if the transaction block is not completed, especially as it misses out on combined statements/queries (BEGIN \; stuff... \; COMMIT") and would break an existing feature. There are strange characters things in comments, eg "??ontinuous". Option "max-attempt-number" -> "max-tries" I would put the client random state initialization with the state intialization, not with the connection. * About tracing Progress is expected to be short, not detailed. Only add the number of failures and retries if max retry is not 1. * About reporting I think that too much is reported. I advised to do that, but nevertheless it is a little bit steep. At least, it should not report the number of tries/attempts when the max number is one. Simple counting should be reported for failures, not floats... I would suggest a more compact one-line report about failures: "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)" * About the TAP tests They are too expensive, with 3 initdb. I think that they should be integrated in the existing tests, as a patch has been submitted to rework the whole pgbench tap test infrastructure. For now, at most one initdb and several small tests inside. * About the documentation I'm not sure that the feature needs pre-emminence in the documentation, because most of the time there is no retry as none is needed, there is no failure, so this rather a special (although useful) case for people playing with serializable and other advanced features. Smaller updates, without dedicated examples, should be enough. If a transaction is skipped, there was no tries, so the corresponding number of attempts is 0, not one. -- Fabien.
> LastBeginState -> RetryState? I'm not sure why this state is a pointer in > CState. Putting the struct would avoid malloc/free cycles. Index "-1" may be > used to tell it is not set if necessary. Another detail I forgot about this point: there may be a memory leak on variables copies, ISTM that the "variables" array is never freed. I was not convinced by the overall memory management around variables to begin with, and it is even less so with their new copy management. Maybe having a clean "Variables" data structure could help improve the situation. -- Fabien.
> Here are a round of comments on the current version of the patch: Thank you very much again! > There is a latent issue about what is a transaction. For pgbench a > transaction is a full script execution. > For postgresql, it is a statement or a BEGIN/END block, several of > which may appear in a script. From a retry > perspective, you may retry from a SAVEPOINT within a BEGIN/END > block... I'm not sure how to make general sense > of all this, so this is just a comment without attached action for now. Yes it is. That's why I wrote several notes about it in documentation where there may be a misunderstanding: + Transactions with serialization or deadlock failures (or with both + of them if used script contains several transactions; see + <xref linkend="transactions-and-scripts" + endterm="transactions-and-scripts-title"> for more information) are + marked separately and their time is not reported as for skipped + transactions. + <refsect2 id="transactions-and-scripts"> + <title id="transactions-and-scripts-title">What is the <quote>Transaction</> Actually Performed in <application>pgbench</application>?</title> + If a transaction has serialization and/or deadlock failures, its + <replaceable>time</> will be reported as <literal>serialization failure</>, + <literal>deadlock failure</>, or + <literal>serialization and deadlock failures</>, respectively. </para> + <note> + <para> + Transactions can have both serialization and deadlock failures if the + used script contained several transactions. See + <xref linkend="transactions-and-scripts" + endterm="transactions-and-scripts-title"> for more information. + </para> + </note> + <note> + <para> + The number of transactions attempts within the interval can be greater than + the number of transactions within this interval multiplied by the maximum + attempts number. See <xref linkend="transactions-and-scripts" + endterm="transactions-and-scripts-title"> for more information. + </para> + </note> + <note> + <para>The total sum of per-command failures of each type can be greater + than the number of transactions with reported failures. + See <xref linkend="transactions-and-scripts" + endterm="transactions-and-scripts-title"> for more information. + </para> + </note> And I didn't make rollbacks to savepoints after the failure because they cannot help for serialization failures at all: after rollback to savepoint a new attempt will be always unsuccessful. > I would consider using "try/tries" instead of "attempt/attempts" as it > is shorter. An English native speaker > opinion would be welcome on that point. Thank you, I'll change it. > I'm fine with renaming "is_latencies" to "report_per_command", which > is more logical & generic. Glad to hear it! > "max_attempt_number": I'm against typing fields again in their name, > aka "hungarian naming". I'd suggest > "max_tries" or "max_attempts". Ok! > "SimpleStats attempts": I disagree with using this floating poiunt > oriented structures to count integers. > I would suggest "int64 tries" instead, which should be enough for the > purpose. I'm not sure that it is enough. Firstly it may be several transactions in script so to count the average attempts number you should know the total number of runned transactions. Secondly I think that stddev for attempts number can be quite interesting and often it is not close to zero. > LastBeginState -> RetryState? I'm not sure why this state is a pointer > in CState. Putting the struct would avoid malloc/free cycles. Index > "-1" may be used to tell it is not set if necessary. Thanks, I agree that it's better to do in this way. > "CSTATE_RETRY_FAILED_TRANSACTION" -> "CSTATE_RETRY" is simpler and > clear enough. Ok! > In CState and some code, a failure is a failure, maybe one boolean > would be enough. It need only be differentiated when counting, and you > have (deadlock_failure || serialization_failure) everywhere. I agree with you. I'll change it. > Some variables, such as "int attempt_number", should be in the client > structure, not in the client? Generally, try to use block variables if > possible to keep the state clearly disjoints. If there could be NO new > variable at the doCustom level that would be great, because that would > ensure that there is no machine state mixup hidden in these variables. Do you mean the code cleanup for doCustom function? Because if I do so there will be two code styles for state blocks and their variables in this function.. > I wondering whether the RETRY & FAILURE states could/should be merged: > > on RETRY: > -> count retry > -> actually retry if < max_tries (reset client state, jump to > command) > -> else count failure and skip to end of script > > The start and end of transaction detection seem expensive (malloc, > ...) and assume a one statement per command (what about "BEGIN \; ... > \; COMMIT;", which is not necessarily the case, this limitation should > be documented. ISTM that the space normalization should be avoided, > and something simpler/lighter should be devised? Possibly it should > consider handling SAVEPOINT. I divided these states because if there's a failed transaction block you should end it before retrying. It means to go to states CSTATE_START_COMMAND -> CSTATE_WAIT_RESULT -> CSTATE_END_COMMAND with the appropriate command. How do you propose not to go to these states? About malloc - I agree with you that it should be done without malloc/free. About savepoints - as I wrote you earlier I didn't make rollbacks to savepoints after the failure. Because they cannot help for serialization failures at all: after rollback to savepoint a new attempt will be always unsuccessful. > I disagree about exit in ParseScript if the transaction block is not > completed, especially as it misses out on combined statements/queries > (BEGIN \; stuff... \; COMMIT") and would break an existing feature. Thanks, I'll fix it for usual transaction blocks that don't end in the scripts. > There are strange characters things in comments, eg "??ontinuous". Oh, I'm sorry. I'll fix it too. > Option "max-attempt-number" -> "max-tries" > I would put the client random state initialization with the state > intialization, not with the connection. > * About tracing > > Progress is expected to be short, not detailed. Only add the number of > failures and retries if max retry is not 1. Ok! > * About reporting > > I think that too much is reported. I advised to do that, but > nevertheless it is a little bit steep. > > At least, it should not report the number of tries/attempts when the > max number is one. Ok! > Simple counting should be reported for failures, > not floats... > > I would suggest a more compact one-line report about failures: > > "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)" I think, there may be a misunderstanding. Because script can contain several transactions and get both failures. > * About the TAP tests > > They are too expensive, with 3 initdb. I think that they should be > integrated in the existing tests, as a patch has been submitted to > rework the whole pgbench tap test infrastructure. > > For now, at most one initdb and several small tests inside. Ok! > * About the documentation > > I'm not sure that the feature needs pre-emminence in the > documentation, because most of the time there is no retry as none is > needed, there is no failure, so this rather a special (although > useful) case for people playing with serializable and other advanced > features. > > Smaller updates, without dedicated examples, should be enough. Maybe there should be some examples to prepare people what they can see in the output of the program? Of course now failures are special cases because they disconnect its clients to the end of the program and ruin all the results. I hope that if this patch is committed there will be much more cases with retried failures. > If a transaction is skipped, there was no tries, so the corresponding > number of attempts is 0, not one. Oh, I'm sorry, it is a typo in the documentation. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> Another detail I forgot about this point: there may be a memory leak > on variables copies, ISTM that the "variables" array is never freed. > > I was not convinced by the overall memory management around variables > to begin with, and it is even less so with their new copy management. > Maybe having a clean "Variables" data structure could help improve the > situation. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello, > [...] I didn't make rollbacks to savepoints after the failure because > they cannot help for serialization failures at all: after rollback to > savepoint a new attempt will be always unsuccessful. Not necessarily? It depends on where the locks triggering the issue are set, if they are all set after the savepoint it could work on a second attempt. >> "SimpleStats attempts": I disagree with using this floating poiunt >> oriented structures to count integers. I would suggest "int64 tries" >> instead, which should be enough for the purpose. > > I'm not sure that it is enough. Firstly it may be several transactions in > script so to count the average attempts number you should know the total > number of runned transactions. Secondly I think that stddev for attempts > number can be quite interesting and often it is not close to zero. I would prefer to have a real motivation to add this complexity in the report and in the code. Without that, a simple int seems better for now. It can be improved later if the need really arises. >> Some variables, such as "int attempt_number", should be in the client >> structure, not in the client? Generally, try to use block variables if >> possible to keep the state clearly disjoints. If there could be NO new >> variable at the doCustom level that would be great, because that would >> ensure that there is no machine state mixup hidden in these variables. > > Do you mean the code cleanup for doCustom function? Because if I do so there > will be two code styles for state blocks and their variables in this > function.. I think that any variable shared between state is a recipee for bugs if it is not reset properly, so they should be avoided. Maybe there are already too many of them, then too bad, not a reason to add more. The status before the automaton was a nightmare. >> I wondering whether the RETRY & FAILURE states could/should be merged: > > I divided these states because if there's a failed transaction block you > should end it before retrying. Hmmm. Maybe I'm wrong. I'll think about it. >> I would suggest a more compact one-line report about failures: >> >> "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)" > > I think, there may be a misunderstanding. Because script can contain several > transactions and get both failures. I do not understand. Both failures number are on the compact line I suggested. -- Fabien.
>> I was not convinced by the overall memory management around variables >> to begin with, and it is even less so with their new copy management. >> Maybe having a clean "Variables" data structure could help improve the >> situation. > > Ok! Note that there is something for psql (src/bin/psql/variable.c) which may or may not be shared. It should be checked before recoding eventually the same thing. -- Fabien.
On 13-07-2017 19:32, Fabien COELHO wrote: > Hello, Hi! >> [...] I didn't make rollbacks to savepoints after the failure because >> they cannot help for serialization failures at all: after rollback to >> savepoint a new attempt will be always unsuccessful. > > Not necessarily? It depends on where the locks triggering the issue > are set, if they are all set after the savepoint it could work on a > second attempt. Don't you mean the deadlock failures where can really help rollback to savepoint? And could you, please, give an example where a rollback to savepoint can help to end its subtransaction successfully after a serialization failure? >>> "SimpleStats attempts": I disagree with using this floating poiunt >>> oriented structures to count integers. I would suggest "int64 tries" >>> instead, which should be enough for the purpose. >> >> I'm not sure that it is enough. Firstly it may be several transactions >> in script so to count the average attempts number you should know the >> total number of runned transactions. Secondly I think that stddev for >> attempts number can be quite interesting and often it is not close to >> zero. > > I would prefer to have a real motivation to add this complexity in the > report and in the code. Without that, a simple int seems better for > now. It can be improved later if the need really arises. Ok! >>> Some variables, such as "int attempt_number", should be in the client >>> structure, not in the client? Generally, try to use block variables >>> if >>> possible to keep the state clearly disjoints. If there could be NO >>> new >>> variable at the doCustom level that would be great, because that >>> would >>> ensure that there is no machine state mixup hidden in these >>> variables. >> >> Do you mean the code cleanup for doCustom function? Because if I do so >> there will be two code styles for state blocks and their variables in >> this function.. > > I think that any variable shared between state is a recipee for bugs > if it is not reset properly, so they should be avoided. Maybe there > are already too many of them, then too bad, not a reason to add more. > The status before the automaton was a nightmare. Ok! >>> I would suggest a more compact one-line report about failures: >>> >>> "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)" >> >> I think, there may be a misunderstanding. Because script can contain >> several transactions and get both failures. > > I do not understand. Both failures number are on the compact line I > suggested. I mean that the sum of transactions with serialization failure and transactions with deadlock failure can be greater then the totally sum of transactions with failures. But if you think it's ok I'll change it and write the appropriate note in documentation. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>>> I was not convinced by the overall memory management around variables >>> to begin with, and it is even less so with their new copy management. >>> Maybe having a clean "Variables" data structure could help improve >>> the >>> situation. > > Note that there is something for psql (src/bin/psql/variable.c) which > may or may not be shared. It should be checked before recoding > eventually the same thing. Thank you very much for pointing this file! As I checked this is another structure: here there's a simple list, while in pgbench we should know if the list is sorted and the number of elements in the list. How do you think, is it a good idea to name a variables structure in pgbench in the same way (VariableSpace) or it should be different not to be confused (Variables, for example)? -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, >> Not necessarily? It depends on where the locks triggering the issue >> are set, if they are all set after the savepoint it could work on a >> second attempt. > > Don't you mean the deadlock failures where can really help rollback to Yes, I mean deadlock failures can rollback to a savepoint and work on a second attempt. > And could you, please, give an example where a rollback to savepoint can > help to end its subtransaction successfully after a serialization > failure? I do not know whether this is possible with about serialization failures. It might be if the stuff before and after the savepoint are somehow unrelated... > [...] I mean that the sum of transactions with serialization failure and > transactions with deadlock failure can be greater then the totally sum > of transactions with failures. Hmmm. Ok. A "failure" is a transaction (in the sense of pgbench) that could not made it to the end, even after retries. If there is a rollback and the a retry which works, it is not a failure. Now deadlock or serialization errors, which trigger retries, are worth counting as well, although they are not "failures". So my format proposal was over optimistic, and the number of deadlocks and serializations should better be on a retry count line. Maybe something like: ... number of failures: 12 (0.004%) number of retries: 64 (deadlocks: 29, serialization: 35) -- Fabien.
>> Note that there is something for psql (src/bin/psql/variable.c) which >> may or may not be shared. It should be checked before recoding >> eventually the same thing. > > Thank you very much for pointing this file! As I checked this is another > structure: here there's a simple list, while in pgbench we should know > if the list is sorted and the number of elements in the list. How do you > think, is it a good idea to name a variables structure in pgbench in the > same way (VariableSpace) or it should be different not to be confused > (Variables, for example)? Given that the number of variables of a pgbench script is expected to be pretty small, I'm not sure that the sorting stuff is worth the effort. My suggestion is really to look at both implementations and to answer the question "should pgbench share its variable implementation with psql?". If the answer is yes, then the relevant part of the implementation should be moved to fe_utils, and that's it. If the answer is no, then implement something in pgbench directly. -- Fabien.
>>> Not necessarily? It depends on where the locks triggering the issue >>> are set, if they are all set after the savepoint it could work on a >>> second attempt. >> >> Don't you mean the deadlock failures where can really help rollback to > > Yes, I mean deadlock failures can rollback to a savepoint and work on > a second attempt. > >> And could you, please, give an example where a rollback to savepoint >> can help to end its subtransaction successfully after a serialization >> failure? > > I do not know whether this is possible with about serialization > failures. > It might be if the stuff before and after the savepoint are somehow > unrelated... If you mean, for example, the updates of different tables - a rollback to savepoint doesn't help. And I'm not sure that we should do all the stuff for savepoints rollbacks because: - as I see it now it only makes sense for the deadlock failures; - if there's a failure what savepoint we should rollback to and start the execution again? Maybe to go to the last one, if it is not successful go to the previous one etc. Retrying the entire transaction may take less time.. >> [...] I mean that the sum of transactions with serialization failure >> and transactions with deadlock failure can be greater then the totally >> sum of transactions with failures. > > Hmmm. Ok. > > A "failure" is a transaction (in the sense of pgbench) that could not > made it to the end, even after retries. If there is a rollback and the > a retry which works, it is not a failure. > > Now deadlock or serialization errors, which trigger retries, are worth > counting as well, although they are not "failures". So my format > proposal was over optimistic, and the number of deadlocks and > serializations should better be on a retry count line. > > Maybe something like: > ... > number of failures: 12 (0.004%) > number of retries: 64 (deadlocks: 29, serialization: 35) Ok! How to you like the idea to use the same format (the total number of transactions with failures and the number of retries for each failure type) in other places (log, aggregation log, progress) if the values are not "default" (= no failures and no retries)? -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> Given that the number of variables of a pgbench script is expected to > be pretty small, I'm not sure that the sorting stuff is worth the > effort. I think it is a good insurance if there're many variables.. > My suggestion is really to look at both implementations and to answer > the question "should pgbench share its variable implementation with > psql?". > > If the answer is yes, then the relevant part of the implementation > should be moved to fe_utils, and that's it. > > If the answer is no, then implement something in pgbench directly. The structure of variables is different, the container structure of the variables is different, so I think that the answer is no. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>> If the answer is no, then implement something in pgbench directly. > > The structure of variables is different, the container structure of the > variables is different, so I think that the answer is no. Ok, fine. My point was just to check before proceeding. -- Fabien.
> And I'm not sure that we should do all the stuff for savepoints rollbacks > because: > - as I see it now it only makes sense for the deadlock failures; > - if there's a failure what savepoint we should rollback to and start the > execution again? ISTM that it is the point of having savepoint in the first place, the ability to restart the transaction at that point if something failed? > Maybe to go to the last one, if it is not successful go to the previous > one etc. Retrying the entire transaction may take less time.. Well, I do not know that. My 0.02 € is that if there was a savepoint then this is natural the restarting point of a transaction which has some recoverable error. Well, the short version may be to only do a full transaction retry and to document that for now savepoints are not handled, and to let that for future work if need arises. >> Maybe something like: >> ... >> number of failures: 12 (0.004%) >> number of retries: 64 (deadlocks: 29, serialization: 35) > > Ok! How to you like the idea to use the same format (the total number of > transactions with failures and the number of retries for each failure type) > in other places (log, aggregation log, progress) if the values are not > "default" (= no failures and no retries)? For progress the output must be short and readable, and probably we do not care about whether retries came from this or that, so I would let that out. For log and aggregated log possibly that would make more sense, but it must stay easy to parse. -- Fabien.
> Ok, fine. My point was just to check before proceeding. And I'm very grateful for that :) -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> Well, the short version may be to only do a full transaction retry and > to document that for now savepoints are not handled, and to let that > for future work if need arises. I agree with you. > For progress the output must be short and readable, and probably we do > not care about whether retries came from this or that, so I would let > that out. > > For log and aggregated log possibly that would make more sense, but it > must stay easy to parse. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello again! Here is the third version of the patch for pgbench thanks to Fabien Coelho comments. As in the previous one, transactions with serialization and deadlock failures are rolled back and retried until they end successfully or their number of tries reaches maximum. Differences from the previous version: * Some code cleanup :) In particular, the Variables structure for managing client variables and only one new tap tests file (as they were recommended here [1] and here [2]). * There's no error if the last transaction in the script is not completed. But the transactions started in the previous scripts and/or not ending in the current script, are not rolled back and retried after the failure. Such script try is reported as failed because it contains a failure that was not rolled back and retried. * Usually the retries and/or failures are printed if they are not equal to zeros. In transaction/aggregation logs the failures are always printed and the retries are printed if max_tries is greater than 1. It is done for the general format of the log during the execution of the program. Patch is attached. Any suggestions are welcome! [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121338090.12795%40lancre [2] https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote: > Here is the third version of the patch for pgbench thanks to Fabien Coelho > comments. As in the previous one, transactions with serialization and > deadlock failures are rolled back and retried until they end successfully or > their number of tries reaches maximum. Just had a need for this feature, and took this to a short test drive. So some comments: - it'd be useful to display a retry percentage of all transactions, similar to what's displayed for failed transactions. - it appears that we now unconditionally do not disregard a connection after a serialization / deadlock failure. Good. Butthat's useful far beyond just deadlocks / serialization errors, and should probably be exposed. - it'd be useful to also conveniently display the number of retried transactions, rather than the total number of retries. Nice feature! - Andres
On Fri, Aug 11, 2017 at 10:50 PM, Andres Freund <andres@anarazel.de> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.
Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions,
similar to what's displayed for failed transactions.
- it appears that we now unconditionally do not disregard a connection
after a serialization / deadlock failure. Good. But that's useful far
beyond just deadlocks / serialization errors, and should probably be exposed.
Yes, it would be nice to don't disregard a connection after other errors too. However, I'm not sure if we should retry the *same* transaction on errors beyond deadlocks / serialization errors. For example, in case of division by zero or unique violation error it would be more natural to give up with current transaction and continue with next one.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
> Hi, Hello! > Just had a need for this feature, and took this to a short test > drive. So some comments: > - it'd be useful to display a retry percentage of all transactions, > similar to what's displayed for failed transactions. > - it'd be useful to also conveniently display the number of retried > transactions, rather than the total number of retries. Ok! > - it appears that we now unconditionally do not disregard a connection > after a serialization / deadlock failure. Good. But that's useful far > beyond just deadlocks / serialization errors, and should probably be > exposed. I agree that it will be useful. But how do you propose to print the results if there are many types of errors? I'm afraid that the progress report can be very long although it is expected that it will be rather short [1]. The per statement report can also be very long.. > Nice feature! Thanks and thank you for your comments :) [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello, > Here is the third version of the patch for pgbench thanks to Fabien Coelho > comments. As in the previous one, transactions with serialization and > deadlock failures are rolled back and retried until they end successfully or > their number of tries reaches maximum. Here is some partial review. Patch applies cleanly. It compiles with warnings, please fix them: pgbench.c:2624:28: warning: ‘failure_status’ may be used uninitialized in this function pgbench.c:2697:34: warning: ‘command’ may be used uninitialized in this function I do not think that the error handling feature needs preeminence in the final report, compare to scale, number of clients and so. The number of tries should be put further on. I would spell "number of tries" instead of "tries number" which seems to suggest that each try is attributed a number. "sql" -> "SQL". For the per statement latency final report, I do not think it is worth distinguishing the kind of retry at this level, because ISTM that serialization & deadlocks are unlikely to appear simultaneously. I would just report total failures and total tries on this report. We only have 2 errors now, but if more are added I'm pretty sure that we would not want to have more columns... Moreover the 25 characters alignment is ugly, better use a much smaller alignment. I'm okay with having details shown in the "log to file" group report. The documentation does not seem consistent. It discusses "the very last fields" and seem to suggest that there are two, but the example trace below just adds one field. If you want a paragraph you should add <para>, skipping a line does not work (around "All values are computed for ..."). I do not understand the second note of the --max-tries documentation. It seems to suggest that some script may not end their own transaction... which should be an error in my opinion? Some explanations would be welcome. I'm not sure that "Retries" deserves a type of its own for two counters. The "retries" in RetriesState may be redundant with these. The failures are counted on simple counters while retries have a type, this is not consistent. I suggest to just use simple counters everywhere. I'm ok with having the detail report tell about failures & retries only when some occured. typo: sucessufully -> successfully If a native English speaker could provide an opinion on that, and more generally review the whole documentation, it would be great. I think that the rand functions should really take a random_state pointer argument, not a Thread or Client. I'm at odds that FailureStatus does not have a clean NO_FAILURE state, and that it is merged with misc failures. I'm not sure that initRetries, mergeRetries, getAllRetries really deserve a function. I do not thing that there should be two accum Functions. Just extend the existing one, and adding zero to zero is not a problem. I guess that in the end pgbench & psql variables will have to be merged if pgbench expression engine is to be used by psql as well, but this is not linked to this patch. The tap tests seems over-complicated and heavy with two pgbench run in parallel... I'm not sure we really want all that complexity for this somehow small feature. Moreover pgbench can run several scripts, I'm not sure why two pgbench would need to be invoked. Could something much simpler and lighter be proposed instead to test the feature? The added code does not conform to Pg C style. For instance, if brace should be aligned to the if. Please conform the project style. The is_transaction_block_end seems simplistic. ISTM that it would not work with compound commands. It should be clearly documented somewhere. Also find attached two scripts I used for some testing: psql < dl_init.sql pgbench -f dl_trans.sql -c 8 -T 10 -P 1 -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> Hello, Hi! I'm very sorry that I did not answer for so long, I was very busy in the release of Postgres Pro 10 :( >> Here is the third version of the patch for pgbench thanks to Fabien >> Coelho comments. As in the previous one, transactions with >> serialization and deadlock failures are rolled back and retried until >> they end successfully or their number of tries reaches maximum. > > Here is some partial review. Thank you very much for it! > It compiles with warnings, please fix them: > > pgbench.c:2624:28: warning: ‘failure_status’ may be used > uninitialized in this function > pgbench.c:2697:34: warning: ‘command’ may be used uninitialized in > this function Ok! > I do not think that the error handling feature needs preeminence in the > final report, compare to scale, number of clients and so. The number > of tries should be put further on. I added it here only because both this field and field "transaction type" are transaction characteristics. I have some doubts where to add it. On the one hand, the number of clients, the number of transactions per client and the number of transactions actually processed form a good logical block which I don't want to divide. On the other hand, the number of clients and the number of transactions per client are parameters, but the number of transactions actually processed is one of the program results. Where, in your opinion, would it be better to add the maximum number of transaction tries? > I would spell "number of tries" instead of "tries number" which seems > to > suggest that each try is attributed a number. "sql" -> "SQL". Ok! > For the per statement latency final report, I do not think it is worth > distinguishing the kind of retry at this level, because ISTM that > serialization & deadlocks are unlikely to appear simultaneously. I > would just report total failures and total tries on this report. We > only have 2 errors now, but if more are added I'm pretty sure that we > would not want to have more columns... Thanks, I agree with you. > Moreover the 25 characters > alignment is ugly, better use a much smaller alignment. The variables for the numbers of failures and retries are of type int64 since the variable for the total number of transactions has the same type. That's why such a large alignment (as I understand it now, enough 20 characters). Do you prefer floating alignemnts, depending on the maximum number of failures/retries for any command in any script? > I'm okay with having details shown in the "log to file" group report. I think that the output format of retries statistics should be same everywhere, so I would just like to output the total number of retries here. > The documentation does not seem consistent. It discusses "the very last > fields" > and seem to suggest that there are two, but the example trace below > just > adds one field. I'm sorry, I do not understand what you are talking about. I used commands and the files from the end of your message ("psql < dl_init.sql" and "pgbench -f dl_trans.sql -c 8 -T 10 -P 1"), and I got this output from pgbench: starting vacuum...ERROR: relation "pgbench_branches" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_tellers" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_history" does not exist (ignoring this error and continuing anyway) end. progress: 1.0 s, 14.0 tps, lat 9.094 ms stddev 5.304 progress: 2.0 s, 25.0 tps, lat 284.934 ms stddev 450.692, 1 failed progress: 3.0 s, 21.0 tps, lat 337.942 ms stddev 473.210, 1 failed progress: 4.0 s, 11.0 tps, lat 459.041 ms stddev 499.908, 2 failed progress: 5.0 s, 28.0 tps, lat 220.219 ms stddev 411.390, 2 failed progress: 6.0 s, 5.0 tps, lat 402.695 ms stddev 492.526, 2 failed progress: 7.0 s, 24.0 tps, lat 343.249 ms stddev 626.181, 2 failed progress: 8.0 s, 14.0 tps, lat 505.396 ms stddev 501.836, 1 failed progress: 9.0 s, 40.0 tps, lat 180.080 ms stddev 381.335, 1 failed progress: 10.0 s, 1.0 tps, lat 0.000 ms stddev 0.000, 1 failed transaction type: dl_trans.sql transaction maximum tries number: 1 scaling factor: 1 query mode: simple number of clients: 8 number of threads: 1 duration: 10 s number of transactions actually processed: 191 number of failures: 14 (7.330 %) latency average = 356.701 ms latency stddev = 564.942 ms tps = 18.735807 (including connections establishing) tps = 18.744898 (excluding connections establishing) As I understand it, in the documentation "the very last fields" refer to the aggregation logging which is not used here. So what's the problem? > If you want a paragraph you should add <para>, skipping a line does not > work (around "All values are computed for ..."). Sorry, thanks =[ > I do not understand the second note of the --max-tries documentation. > It seems to suggest that some script may not end their own > transaction... > which should be an error in my opinion? Some explanations would be > welcome. As you told me here [1], "I disagree about exit in ParseScript if the transaction block is not completed <...> and would break an existing feature.". Maybe it's be better to say this: In pgbench you can use scripts in which the transaction blocks do not end. Be careful in this case because transactions that span over more than one script are not rolled back and will not be retried in case of an error. In such cases, the script in which the error occurred is reported as failed. ? > I'm not sure that "Retries" deserves a type of its own for two > counters. Ok! > The "retries" in RetriesState may be redundant with these. The "retries" in RetriesState have a different goal: they sum up not all the retries during the execution of the current script but the retries for the current transaction. > The failures are counted on simple counters while retries have a type, > this is not consistent. I suggest to just use simple counters > everywhere. Ok! > I'm ok with having the detail report tell about failures & retries only > when some occured. Ok! > typo: sucessufully -> successfully Thanks! =[ > If a native English speaker could provide an opinion on that, and more > generally review the whole documentation, it would be great. I agree with you)) > I think that the rand functions should really take a random_state > pointer > argument, not a Thread or Client. Thanks, I agree. > I'm at odds that FailureStatus does not have a clean NO_FAILURE state, > and that it is merged with misc failures. :) It is funny but for the code it really did not matter) > I'm not sure that initRetries, mergeRetries, getAllRetries really > deserve a function. Ok! > I do not thing that there should be two accum Functions. Just extend > the existing one, and adding zero to zero is not a problem. Ok! > I guess that in the end pgbench & psql variables will have to be merged > if pgbench expression engine is to be used by psql as well, but this is > not linked to this patch. Ok! > The tap tests seems over-complicated and heavy with two pgbench run in > parallel... I'm not sure we really want all that complexity for this > somehow small feature. Moreover pgbench can run several scripts, I'm > not > sure why two pgbench would need to be invoked. Could something much > simpler and lighter be proposed instead to test the feature? Firstly, two pgbench need to be invoked because we don't know which of them will get a deadlock failure. Secondly, I tried much simplier tests but all of them failed sometimes although everything was ok: - tests in which pgbench runs 5 clients and 10 transactions per client for a serialization/deadlock failure on any client (sometimes there are no failures when it is expected that they will be) - tests in which pgbench runs 30 clients and 400 transactions per client for a serialization/deadlock failure on any client (sometimes there are no failures when it is expected that they will be) - tests in which the psql session starts concurrently and you use sleep commands to wait pgbench for 10 seconds (sometimes it does not work) Only advisory locks help me not to get such errors in the tests :( > The added code does not conform to Pg C style. For instance, if brace > should be aligned to the if. Please conform the project style. I'm sorry, thanks =[ > The is_transaction_block_end seems simplistic. ISTM that it would not > work with compound commands. It should be clearly documented somewhere. Thanks, I'll fix it. > Also find attached two scripts I used for some testing: > > psql < dl_init.sql > pgbench -f dl_trans.sql -c 8 -T 10 -P 1 [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> I suggest a patch where pgbench client sessions are not disconnected because of > serialization or deadlock failures and these failures are mentioned in reports. > In details: > - transaction with one of these failures continue run normally, but its result > is rolled back; > - if there were these failures during script execution this "transaction" is marked > appropriately in logs; > - numbers of "transactions" with these failures are printed in progress, in > aggregation logs and in the end with other results (all and for each script); Hm, I took a look on both thread about patch and it seems to me now it's overcomplicated. With recently committed enhancements of pgbench (\if, \when) it becomes close to impossible to retry transation in case of failure. So, initial approach just to rollback such transaction looks more attractive. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Hm, I took a look on both thread about patch and it seems to me now it's > overcomplicated. With recently committed enhancements of pgbench (\if, > \when) it becomes close to impossible to retry transation in case of > failure. So, initial approach just to rollback such transaction looks > more attractive. Yep. I think that the best approach for now is simply to reset (command zero, random generator) and start over the whole script, without attempting to be more intelligent. The limitations should be clearly documented (one transaction per script), though. That would be a significant enhancement already. -- Fabien.
On 25-03-2018 15:23, Fabien COELHO wrote: >> Hm, I took a look on both thread about patch and it seems to me now >> it's overcomplicated. With recently committed enhancements of pgbench >> (\if, \when) it becomes close to impossible to retry transation in >> case of failure. So, initial approach just to rollback such >> transaction looks more attractive. > > Yep. Many thanks to both of you! I'm working on a patch in this direction.. > I think that the best approach for now is simply to reset (command > zero, random generator) and start over the whole script, without > attempting to be more intelligent. The limitations should be clearly > documented (one transaction per script), though. That would be a > significant enhancement already. I'm not sure that we can always do this, because we can get new errors until we finish the failed transaction block, and we need destroy the conditional stack.. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, > Many thanks to both of you! I'm working on a patch in this direction.. > >> I think that the best approach for now is simply to reset (command >> zero, random generator) and start over the whole script, without >> attempting to be more intelligent. The limitations should be clearly >> documented (one transaction per script), though. That would be a >> significant enhancement already. > > I'm not sure that we can always do this, because we can get new errors until > we finish the failed transaction block, and we need destroy the conditional > stack.. Sure. I'm suggesting so as to simplify that on failures the retry would always restarts from the beginning of the script by resetting everything, indeed including the conditional stack, the random generator state, the variable values, and so on. This mean enforcing somehow one script is one transaction. If the user does not do that, it would be their decision and the result becomes unpredictable on errors (eg some sub-transactions could be executed more than once). Then if more is needed, that could be for another patch. -- Fabien.
On 26-03-2018 18:53, Fabien COELHO wrote: > Hello Marina, Hello! >> Many thanks to both of you! I'm working on a patch in this direction.. >> >>> I think that the best approach for now is simply to reset (command >>> zero, random generator) and start over the whole script, without >>> attempting to be more intelligent. The limitations should be clearly >>> documented (one transaction per script), though. That would be a >>> significant enhancement already. >> >> I'm not sure that we can always do this, because we can get new errors >> until we finish the failed transaction block, and we need destroy the >> conditional stack.. > > Sure. I'm suggesting so as to simplify that on failures the retry > would always restarts from the beginning of the script by resetting > everything, indeed including the conditional stack, the random > generator state, the variable values, and so on. > > This mean enforcing somehow one script is one transaction. > > If the user does not do that, it would be their decision and the > result becomes unpredictable on errors (eg some sub-transactions could > be executed more than once). > > Then if more is needed, that could be for another patch. Here is the fifth version of the patch for pgbench (based on the commit 4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement these ideas, thanks to your comments and those of Teodor Sigaev. Since we may need to execute commands to complete a failed transaction block, the script is now always executed completely. If there is a serialization/deadlock failure which can be retried, the script is executed again with the same random state and array of variables as before its first run. Meta commands errors as well as all SQL errors do not cause the aborting of the client. The first failure in the current script execution determines whether the script run will be retried or not, so only such failures (they have a retry) or errors (they are not retried) are reported. I tried to make fixes in accordance with your previous reviews ([1], [2], [3]): > I'm unclear about the added example added in the documentation. There > are 71% errors, but 100% of transactions are reported as processed. If > there were errors, then it is not a success, so the transaction were > not > processed? To me it looks inconsistent. Also, while testing, it seems > that > failed transactions are counted in tps, which I think is not > appropriate: > > > About the feature: > > sh> PGOPTIONS='-c default_transaction_isolation=serializable' \ > ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4 > starting vacuum...end. > progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed > # NOT 10845.8 TPS... > progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed > progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed > ... > number of transactions actually processed: 32028 # NO! > number of errors: 30969 (96.694 %) > latency average = 2.833 ms > latency stddev = 1.508 ms > tps = 10666.720870 (including connections establishing) # NO > tps = 10683.034369 (excluding connections establishing) # NO > ... > > For me this is all wrong. I think that the tps report is about > transactions > that succeeded, not mere attempts. I cannot say that a transaction > which aborted > was "actually processed"... as it was not. Fixed > The order of reported elements is not logical: > > maximum number of transaction tries: 100 > scaling factor: 10 > query mode: prepared > number of clients: 4 > number of threads: 2 > duration: 3 s > number of transactions actually processed: 967 > number of errors: 152 (15.719 %) > latency average = 9.630 ms > latency stddev = 13.366 ms > number of transactions retried: 623 (64.426 %) > number of retries: 32272 > > I would suggest to group everything about error handling in one block, > eg something like: > > scaling factor: 10 > query mode: prepared > number of clients: 4 > number of threads: 2 > duration: 3 s > number of transactions actually processed: 967 > number of errors: 152 (15.719 %) > number of transactions retried: 623 (64.426 %) > number of retries: 32272 > maximum number of transaction tries: 100 > latency average = 9.630 ms > latency stddev = 13.366 ms Fixed > Also, percent character should be stuck to its number: 15.719% to have > the style more homogeneous (although there seems to be pre-existing > inhomogeneities). > > I would replace "transaction tries/retried" by "tries/retried", > everything > is about transactions in the report anyway. > > Without reading the documentation, the overall report semantics is > unclear, > especially given the absurd tps results I got with the my first > attempt, > as failing transactions are counted as "processed". Fixed > About the code: > > I'm at lost with the 7 states added to the automaton, where I would > have hoped > that only 2 (eg RETRY & FAIL, or even less) would be enough. Fixed > I'm wondering whether the whole feature could be simplified by > considering that one script is one "transaction" (it is from the > report point of view at least), and that any retry is for the full > script only, from its beginning. That would remove the trying to guess > at transactions begin or end, avoid scanning manually for subcommands, > and so on. > - Would it make sense? > - Would it be ok for your use case? Fixed > The proposed version of the code looks unmaintainable to me. There are > 3 levels of nested "switch/case" with state changes at the deepest > level. > I cannot even see it on my screen which is not wide enough. Fixed > There should be a typedef for "random_state", eg something like: > > typedef struct { unsigned short data[3]; } RandomState; > > Please keep "const" declarations, eg "commandFailed". > > I think that choosing script should depend on the thread random state, > not > the client random state, so that a run would generate the same pattern > per > thread, independently of which client finishes first. > > I'm sceptical of the "--debug-fails" options. ISTM that --debug is > already there > and should just be reused. Fixed > I agree that function naming style is a already a mess, but I think > that > new functions you add should use a common style, eg "is_compound" vs > "canRetry". Fixed > Translating error strings to their enum should be put in a function. Removed > I'm not sure this whole thing should be done anyway. The processing of compound commands is removed. > The "node" is started but never stopped. Fixed > For file contents, maybe the << 'EOF' here-document syntax would help > instead > of using concatenated backslashed strings everywhere. I'm sorry, but I could not get it to work with regular expressions :( > I'd start by stating (i.e. documenting) that the features assumes that > one > script is just *one* transaction. > > Note that pgbench somehow already assumes that one script is one > transaction when it reports performance anyway. > > If you want 2 transactions, then you have to put them in two scripts, > which looks fine with me. Different transactions are expected to be > independent, otherwise they should be merged into one transaction. Fixed > Under these restrictions, ISTM that a retry is something like: > > case ABORTED: > if (we want to retry) { > // do necessary stats > // reset the initial state (random, vars, current command) > state = START_TX; // loop > } > else { > // count as failed... > state = FINISHED; // or done. > } > break; ... > I'm fine with having END_COMMAND skipping to START_TX if it can be done > easily and cleanly, esp without code duplication. I did not want to add the additional if-expressions possibly to most of the code in CSTATE_START_TX/CSTATE_END_TX/CSTATE_END_COMMAND, so CSTATE_FAILURE is used instead of CSTATE_END_COMMAND in case of failure, and CSTATE_RETRY is called before CSTATE_END_TX if there was a failure during the current script execution. > ISTM that ABORTED & FINISHED are currently exactly the same. That would > put a particular use to aborted. Also, there are many points where the > code may go to "aborted" state, so reusing it could help avoid > duplicating > stuff on each abort decision. To end and rollback the failed transaction block the script is always executed completely, and after the failure the following script command is executed.. [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre [2] https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121309300.10810%40lancre [3] https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121607310.13422%40lancre -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Conception of max-retry option seems strange for me. if number of retries reaches max-retry option, then we just increment counter of failed transaction and try again (possibly, with different random numbers). At the end we should distinguish number of error transaction and failed transaction, to found this difference documentation suggests to rerun pgbench with debugging on. May be I didn't catch an idea, but it seems to me max-tries should be removed. On transaction searialization or deadlock error pgbench should increment counter of failed transaction, resets conditional stack, variables, etc but not a random generator and then start new transaction for the first line of script. Marina Polyakova wrote: > On 26-03-2018 18:53, Fabien COELHO wrote: >> Hello Marina, > > Hello! > >>> Many thanks to both of you! I'm working on a patch in this direction.. >>> >>>> I think that the best approach for now is simply to reset (command >>>> zero, random generator) and start over the whole script, without >>>> attempting to be more intelligent. The limitations should be clearly >>>> documented (one transaction per script), though. That would be a >>>> significant enhancement already. >>> >>> I'm not sure that we can always do this, because we can get new errors until >>> we finish the failed transaction block, and we need destroy the conditional >>> stack.. >> >> Sure. I'm suggesting so as to simplify that on failures the retry >> would always restarts from the beginning of the script by resetting >> everything, indeed including the conditional stack, the random >> generator state, the variable values, and so on. >> >> This mean enforcing somehow one script is one transaction. >> >> If the user does not do that, it would be their decision and the >> result becomes unpredictable on errors (eg some sub-transactions could >> be executed more than once). >> >> Then if more is needed, that could be for another patch. > > Here is the fifth version of the patch for pgbench (based on the commit > 4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement these > ideas, thanks to your comments and those of Teodor Sigaev. Since we may need to > execute commands to complete a failed transaction block, the script is now > always executed completely. If there is a serialization/deadlock failure which > can be retried, the script is executed again with the same random state and > array of variables as before its first run. Meta commands errors as well as all > SQL errors do not cause the aborting of the client. The first failure in the > current script execution determines whether the script run will be retried or > not, so only such failures (they have a retry) or errors (they are not retried) > are reported. > > I tried to make fixes in accordance with your previous reviews ([1], [2], [3]): > >> I'm unclear about the added example added in the documentation. There >> are 71% errors, but 100% of transactions are reported as processed. If >> there were errors, then it is not a success, so the transaction were >> not >> processed? To me it looks inconsistent. Also, while testing, it seems >> that >> failed transactions are counted in tps, which I think is not >> appropriate: >> >> >> About the feature: >> >> sh> PGOPTIONS='-c default_transaction_isolation=serializable' \ >> ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4 >> starting vacuum...end. >> progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed >> # NOT 10845.8 TPS... >> progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed >> progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed >> ... >> number of transactions actually processed: 32028 # NO! >> number of errors: 30969 (96.694 %) >> latency average = 2.833 ms >> latency stddev = 1.508 ms >> tps = 10666.720870 (including connections establishing) # NO >> tps = 10683.034369 (excluding connections establishing) # NO >> ... >> >> For me this is all wrong. I think that the tps report is about >> transactions >> that succeeded, not mere attempts. I cannot say that a transaction >> which aborted >> was "actually processed"... as it was not. > > Fixed > >> The order of reported elements is not logical: >> >> maximum number of transaction tries: 100 >> scaling factor: 10 >> query mode: prepared >> number of clients: 4 >> number of threads: 2 >> duration: 3 s >> number of transactions actually processed: 967 >> number of errors: 152 (15.719 %) >> latency average = 9.630 ms >> latency stddev = 13.366 ms >> number of transactions retried: 623 (64.426 %) >> number of retries: 32272 >> >> I would suggest to group everything about error handling in one block, >> eg something like: >> >> scaling factor: 10 >> query mode: prepared >> number of clients: 4 >> number of threads: 2 >> duration: 3 s >> number of transactions actually processed: 967 >> number of errors: 152 (15.719 %) >> number of transactions retried: 623 (64.426 %) >> number of retries: 32272 >> maximum number of transaction tries: 100 >> latency average = 9.630 ms >> latency stddev = 13.366 ms > > Fixed > >> Also, percent character should be stuck to its number: 15.719% to have >> the style more homogeneous (although there seems to be pre-existing >> inhomogeneities). >> >> I would replace "transaction tries/retried" by "tries/retried", >> everything >> is about transactions in the report anyway. >> >> Without reading the documentation, the overall report semantics is >> unclear, >> especially given the absurd tps results I got with the my first >> attempt, >> as failing transactions are counted as "processed". > > Fixed > >> About the code: >> >> I'm at lost with the 7 states added to the automaton, where I would >> have hoped >> that only 2 (eg RETRY & FAIL, or even less) would be enough. > > Fixed > >> I'm wondering whether the whole feature could be simplified by >> considering that one script is one "transaction" (it is from the >> report point of view at least), and that any retry is for the full >> script only, from its beginning. That would remove the trying to guess >> at transactions begin or end, avoid scanning manually for subcommands, >> and so on. >> - Would it make sense? >> - Would it be ok for your use case? > > Fixed > >> The proposed version of the code looks unmaintainable to me. There are >> 3 levels of nested "switch/case" with state changes at the deepest >> level. >> I cannot even see it on my screen which is not wide enough. > > Fixed > >> There should be a typedef for "random_state", eg something like: >> >> typedef struct { unsigned short data[3]; } RandomState; >> >> Please keep "const" declarations, eg "commandFailed". >> >> I think that choosing script should depend on the thread random state, >> not >> the client random state, so that a run would generate the same pattern >> per >> thread, independently of which client finishes first. >> >> I'm sceptical of the "--debug-fails" options. ISTM that --debug is >> already there >> and should just be reused. > > Fixed > >> I agree that function naming style is a already a mess, but I think >> that >> new functions you add should use a common style, eg "is_compound" vs >> "canRetry". > > Fixed > >> Translating error strings to their enum should be put in a function. > > Removed > >> I'm not sure this whole thing should be done anyway. > > The processing of compound commands is removed. > >> The "node" is started but never stopped. > > Fixed > >> For file contents, maybe the << 'EOF' here-document syntax would help >> instead >> of using concatenated backslashed strings everywhere. > > I'm sorry, but I could not get it to work with regular expressions :( > >> I'd start by stating (i.e. documenting) that the features assumes that one >> script is just *one* transaction. >> >> Note that pgbench somehow already assumes that one script is one >> transaction when it reports performance anyway. >> >> If you want 2 transactions, then you have to put them in two scripts, >> which looks fine with me. Different transactions are expected to be >> independent, otherwise they should be merged into one transaction. > > Fixed > >> Under these restrictions, ISTM that a retry is something like: >> >> case ABORTED: >> if (we want to retry) { >> // do necessary stats >> // reset the initial state (random, vars, current command) >> state = START_TX; // loop >> } >> else { >> // count as failed... >> state = FINISHED; // or done. >> } >> break; > ... >> I'm fine with having END_COMMAND skipping to START_TX if it can be done >> easily and cleanly, esp without code duplication. > > I did not want to add the additional if-expressions possibly to most of the code > in CSTATE_START_TX/CSTATE_END_TX/CSTATE_END_COMMAND, so CSTATE_FAILURE is used > instead of CSTATE_END_COMMAND in case of failure, and CSTATE_RETRY is called > before CSTATE_END_TX if there was a failure during the current script execution. > >> ISTM that ABORTED & FINISHED are currently exactly the same. That would >> put a particular use to aborted. Also, there are many points where the >> code may go to "aborted" state, so reusing it could help avoid duplicating >> stuff on each abort decision. > > To end and rollback the failed transaction block the script is always executed > completely, and after the failure the following script command is executed.. > > [1] > https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre > [2] > https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121309300.10810%40lancre > [3] > https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121607310.13422%40lancre > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Conception of max-retry option seems strange for me. if number of retries > reaches max-retry option, then we just increment counter of failed > transaction and try again (possibly, with different random numbers). At the > end we should distinguish number of error transaction and failed transaction, > to found this difference documentation suggests to rerun pgbench with > debugging on. > > May be I didn't catch an idea, but it seems to me max-tries should be > removed. On transaction searialization or deadlock error pgbench should > increment counter of failed transaction, resets conditional stack, variables, > etc but not a random generator and then start new transaction for the first > line of script. ISTM that there is the idea is that the client application should give up at some point are report an error to the end user, kind of a "timeout" on trying, and that max-retry would implement this logic of giving up: the transaction which was intented, represented by a given initial random generator state, could not be committed as if after some iterations. Maybe the max retry should rather be expressed in time rather than number of attempts, or both approach could be implemented? But there is a logic of retrying the same (try again what the client wanted) vs retrying something different (another client need is served). -- Fabien.
On 29-03-2018 22:39, Fabien COELHO wrote: >> Conception of max-retry option seems strange for me. if number of >> retries reaches max-retry option, then we just increment counter of >> failed transaction and try again (possibly, with different random >> numbers). Then the client starts another script, but by chance or by the number of scripts it can be the same. >> At the end we should distinguish number of error transaction and >> failed transaction, to found this difference documentation suggests >> to rerun pgbench with debugging on. If I understood you correctly, this difference is the total number of retries and this is included in all reports. >> May be I didn't catch an idea, but it seems to me max-tries should be >> removed. On transaction searialization or deadlock error pgbench >> should increment counter of failed transaction, resets conditional >> stack, variables, etc but not a random generator and then start new >> transaction for the first line of script. When I sent the first version of the patch there were only rollbacks, and the idea to retry failed transactions was approved (see [1], [2], [3], [4]). And thank you, I fixed the patch to reset the client variables in case of errors too, and not only in case of retries (see attached, it is based on the commit 3da7502cd00ddf8228c9a4a7e4a08725decff99c). > ISTM that there is the idea is that the client application should give > up at some point are report an error to the end user, kind of a > "timeout" on trying, and that max-retry would implement this logic of > giving up: the transaction which was intented, represented by a given > initial random generator state, could not be committed as if after > some iterations. > > Maybe the max retry should rather be expressed in time rather than > number of attempts, or both approach could be implemented? But there > is a logic of retrying the same (try again what the client wanted) vs > retrying something different (another client need is served). I'm afraid that we will have a problem in debugging mode: should we report a failure (which will be retried) or an error (which will not be retried)? Because only after executing the following script commands (to rollback this transaction block) we will know the time that we spent on the execution of the current script.. [1] https://www.postgresql.org/message-id/CACjxUsOfbn72EaH4i_OuzdY-0PUYfg1Y3o8G27tEA8fJOaPQEw%40mail.gmail.com [2] https://www.postgresql.org/message-id/20170615211806.sfkpiy2acoavpovl%40alvherre.pgsql [3] https://www.postgresql.org/message-id/CAEepm%3D3TRTc9Fy%3DfdFThDa4STzPTR6w%3DRGfYEPikEkc-Lcd%2BMw%40mail.gmail.com [4] https://www.postgresql.org/message-id/CACjxUsOQw%3DvYjPWZQ29GmgWU8ZKj336OGiNQX5Z2W-AcV12%2BNw%40mail.gmail.com -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello, hackers! Here there's a seventh version of the patch for error handling and retrying of transactions with serialization/deadlock failures in pgbench (based on the commit a08dc711952081d63577fc182fcf955958f70add). I added the option --max-tries-time which is an implemetation of Fabien Coelho's proposal in [1]: the transaction with serialization or deadlock failure can be retried if the total time of all its tries is less than this limit (in ms). This option can be combined with the option --max-tries. But if none of them are used, failed transactions are not retried at all. Also: * Now when the first failure occurs in the transaction it is always reported as a failure since only after the remaining commands of this transaction are executed we find out whether we can try again or not. Therefore add the messages about retrying or ending the failed transaction to the "fails" debugging level so you can distinguish failures (which are retried) and errors (which are not retried). * Fix a report on the latency average because the total time includes time for both errors and successful transactions. * Code cleanup (including tests). [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1803292134380.16472%40lancre > Maybe the max retry should rather be expressed in time rather than > number > of attempts, or both approach could be implemented? -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, 04 Apr 2018 16:07:25 +0300 Marina Polyakova <m.polyakova@postgrespro.ru> wrote: > Hello, hackers! > > Here there's a seventh version of the patch for error handling and > retrying of transactions with serialization/deadlock failures in > pgbench (based on the commit > a08dc711952081d63577fc182fcf955958f70add). I added the option > --max-tries-time which is an implemetation of Fabien Coelho's > proposal in [1]: the transaction with serialization or deadlock > failure can be retried if the total time of all its tries is less > than this limit (in ms). This option can be combined with the option > --max-tries. But if none of them are used, failed transactions are > not retried at all. > > Also: > * Now when the first failure occurs in the transaction it is always > reported as a failure since only after the remaining commands of this > transaction are executed we find out whether we can try again or not. > Therefore add the messages about retrying or ending the failed > transaction to the "fails" debugging level so you can distinguish > failures (which are retried) and errors (which are not retried). > * Fix a report on the latency average because the total time includes > time for both errors and successful transactions. > * Code cleanup (including tests). > > [1] > https://www.postgresql.org/message-id/alpine.DEB.2.20.1803292134380.16472%40lancre > > > Maybe the max retry should rather be expressed in time rather than > > number > > of attempts, or both approach could be implemented? > Hi, I did a little review of your patch. It seems to work as expected, documentation and tests are there. Still I have few comments. There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with corresponding fprintf(stderr..) I think it's time to do it like in the main code, wrap with some function like log(level, msg). In CSTATE_RETRY state used_time is used only in printing but calculated more than needed. In my opinion Debuglevel should be renamed to DebugLevel that looks nicer, also there DEBUGLEVEl (where last letter is in lower case) which is very confusing. I have checked overall functionality of this patch, but haven't checked any special cases yet. -- --- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company
> Hi, I did a little review of your patch. It seems to work as > expected, documentation and tests are there. Still I have few comments. Hello! Thank you very much! I attached the fixed version of the patch (based on the commit 94c1f9ba11d1241a2b3b2be7177604b26b08bc3d) + thanks to Fabien Coelho's comments outside of this thread, I removed the option --max-tries-time and the option --latency-limit can be used to limit the time of transaction tries. > There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with > corresponding fprintf(stderr..) I think it's time to do it like in the > main code, wrap with some function like log(level, msg). I agree, fixed. > In CSTATE_RETRY state used_time is used only in printing but calculated > more than needed. Sorry, fixed. > In my opinion Debuglevel should be renamed to DebugLevel that looks > nicer, also there DEBUGLEVEl (where last letter is in lower case) which > is very confusing. Sorry for this typos =[ Fixed. > I have checked overall functionality of this patch, but haven't checked > any special cases yet. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Marina, FYI the v8 patch does not apply anymore, mostly because of a recent perl reindentation. I think that I'll have time for a round of review in the first half of July. Providing a rebased patch before then would be nice. -- Fabien.
Fabien COELHO wrote: > I think that I'll have time for a round of review in the first half of July. > Providing a rebased patch before then would be nice. Note that even in the absence of a rebased patch, you can apply to an older checkout if you have some limited window of time for a review. Looking over the diff, I find that this patch tries to do too much and needs to be split up. At a minimum there is a preliminary patch that introduces the error reporting stuff (errstart etc); there are other thread-related changes (for example to the random generation functions) that probably belong in a separate one too. Not sure if there are other smaller patches hidden inside the rest. On elog/errstart: we already have a convention for what ereport() calls look like; I suggest to use that instead of inventing your own. With that, is there a need for elog()? In the backend we have it because $HISTORY but there's no need for that here -- I propose to lose elog() and use only ereport everywhere. Also, I don't see that you need errmsg_internal() at all; let's lose it too. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Alvaro, >> I think that I'll have time for a round of review in the first half of July. >> Providing a rebased patch before then would be nice. > Note that even in the absence of a rebased patch, you can apply to an > older checkout if you have some limited window of time for a review. Yes, sure. I'd like to bring this feature to be committable, so it will have to be rebased at some point anyway. > Looking over the diff, I find that this patch tries to do too much and > needs to be split up. Yep, I agree that it would help the reviewing process. On the other hand I have bad memories about maintaining dependent patches which interfere significantly. Maybe it may not the case with this feature. Thanks for the advices. -- Fabien.
Hello, Fabien COELHO wrote: > > Looking over the diff, I find that this patch tries to do too much and > > needs to be split up. > > Yep, I agree that it would help the reviewing process. On the other hand I > have bad memories about maintaining dependent patches which interfere > significantly. Sure. I suggest not posting these patches separately -- instead, post as a series of commits in a single email, attaching files from "git format-patch". -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello! Fabien and Alvaro, thank you very much! And sorry for such a late reply (I was a bit busy and making of ereport took some time..) :-( Below is a rebased version of the patch (commit 9effb63e0dd12b0704cd8e11106fe08ff5c9d685) divided into several smaller patches: v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch - a patch for the RandomState structure (this is used to reset a client's random seed during the repeating of transactions after serialization/deadlock failures). v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch - a patch for the Variables structure (this is used to reset client variables during the repeating of transactions after serialization/deadlock failures). v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch - a patch for the ereport() macro (this is used to report client failures that do not cause an aborts and this depends on the level of debugging). - implementation: if possible, use the local ErrorData structure during the errstart()/errmsg()/errfinish() calls. Otherwise use a static variable protected by a mutex if necessary. To do all of this export the function appendPQExpBufferVA from libpq. v9-0004-Pgbench-errors-and-serialization-deadlock-retries.patch - the main patch for handling client errors and repetition of transactions with serialization/deadlock failures (see the detailed description in the file). Any suggestions are welcome! On 08-05-2018 9:00, Fabien COELHO wrote: > Hello Marina, > > FYI the v8 patch does not apply anymore, mostly because of a recent > perl reindentation. > > I think that I'll have time for a round of review in the first half of > July. Providing a rebased patch before then would be nice. They are attached, but a little delayed due to testing.. On 08-05-2018 13:58, Alvaro Herrera wrote: > Looking over the diff, I find that this patch tries to do too much and > needs to be split up. At a minimum there is a preliminary patch that > introduces the error reporting stuff (errstart etc); there are other > thread-related changes (for example to the random generation functions) > that probably belong in a separate one too. Not sure if there are > other > smaller patches hidden inside the rest. Here is a try to do it.. > On elog/errstart: we already have a convention for what ereport() calls > look like; I suggest to use that instead of inventing your own. With > that, is there a need for elog()? In the backend we have it because > $HISTORY but there's no need for that here -- I propose to lose elog() > and use only ereport everywhere. Also, I don't see that you need > errmsg_internal() at all; let's lose it too. I agree, done. But there're some changes to make such a design thread-safe.. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Marina, > v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch > - a patch for the RandomState structure (this is used to reset a client's > random seed during the repeating of transactions after serialization/deadlock > failures). A few comments about this first patch. Patch applies cleanly, compiles, global & pgbench "make check" ok. I'm mostly ok with the changes, which cleanly separate the different use of random between threads (script choice, throttle delay, sampling...) and client (random*() calls). This change is necessary so that a client can restart a transaction deterministically (at the client level at least), which is the ultimate aim of the patch series. A few remarks: The RandomState struct is 6 bytes, which will induce some padding when used. This is life and pre-existing. No problem. ISTM that the struct itself does not need a name, ie. "typedef struct { ... } RandomState" is enough. There could be clear comments, say in the TState and CState structs, about what randomness is impacted (i.e. script choices, etc.). getZipfianRand, computeHarmonicZipfian: The "thread" parameter was justified because it was used for two fieds. As the random state is separated, I'd suggest that the other argument should be a zipfcache pointer. While reading your patch, it occurs to me that a run is not deterministic at the thread level under throttling and sampling, because the random state is sollicited differently depending on when transaction ends. This suggest that maybe each thread random_state use should have its own random state. In passing, and totally unrelated to this patch: I've always been a little puzzled about why a quite small 48-bit internal state random generator is used. I understand the need for pg to have a portable & state-controlled thread-safe random generator, but why this particular small one fails me. The source code (src/port/erand48.c, copyright in 1993...) looks optimized for 16 bits architectures, which is probably pretty inefficent to run on 64 bits architectures. Maybe this could be updated with something more consistent with today's processors, providing more quality at a lower cost. -- Fabien.
Hello Marina, > v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch > - a patch for the Variables structure (this is used to reset client variables > during the repeating of transactions after serialization/deadlock failures). About this second patch: This extracts the variable holding structure, so that it is somehow easier to reset them to their initial state on transaction failures, the management of which is the ultimate aim of this patch series. It is also cleaner this way. Patch applies cleanly on top of the previous one (there is no real interactions with it). It compiles cleanly. Global & pgbench "make check" are both ok. The structure typedef does not need a name. "typedef struct { } V...". I tend to disagree with naming things after their type, eg "array". I'd suggest "vars" instead. "nvariables" could be "nvars" for consistency with that and "vars_sorted", and because "foo.variables->nvariables" starts looking heavy. I'd suggest but "Variables" type declaration just after "Variable" type declaration in the file. -- Fabien.
On 09-06-2018 9:55, Fabien COELHO wrote: > Hello Marina, Hello! >> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch >> - a patch for the RandomState structure (this is used to reset a >> client's random seed during the repeating of transactions after >> serialization/deadlock failures). > > A few comments about this first patch. Thank you very much! > Patch applies cleanly, compiles, global & pgbench "make check" ok. > > I'm mostly ok with the changes, which cleanly separate the different > use of random between threads (script choice, throttle delay, > sampling...) and client (random*() calls). Glad to hear it :) > This change is necessary so that a client can restart a transaction > deterministically (at the client level at least), which is the > ultimate aim of the patch series. > > A few remarks: > > The RandomState struct is 6 bytes, which will induce some padding when > used. This is life and pre-existing. No problem. > > ISTM that the struct itself does not need a name, ie. "typedef struct > { ... } RandomState" is enough. Ok! > There could be clear comments, say in the TState and CState structs, > about what randomness is impacted (i.e. script choices, etc.). Thank you, I'll add them. > getZipfianRand, computeHarmonicZipfian: The "thread" parameter was > justified because it was used for two fieds. As the random state is > separated, I'd suggest that the other argument should be a zipfcache > pointer. I agree with you and I will change it. > While reading your patch, it occurs to me that a run is not > deterministic at the thread level under throttling and sampling, > because the random state is sollicited differently depending on when > transaction ends. This suggest that maybe each thread random_state use > should have its own random state. Thank you, I'll fix this. > In passing, and totally unrelated to this patch: > > I've always been a little puzzled about why a quite small 48-bit > internal state random generator is used. I understand the need for pg > to have a portable & state-controlled thread-safe random generator, > but why this particular small one fails me. The source code > (src/port/erand48.c, copyright in 1993...) looks optimized for 16 bits > architectures, which is probably pretty inefficent to run on 64 bits > architectures. Maybe this could be updated with something more > consistent with today's processors, providing more quality at a lower > cost. This sounds interesting, thanks! *went to look for a multiplier and a summand that are large enough and are mutually prime..* -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 09-06-2018 16:31, Fabien COELHO wrote: > Hello Marina, Hello! >> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch >> - a patch for the Variables structure (this is used to reset client >> variables during the repeating of transactions after >> serialization/deadlock failures). > > About this second patch: > > This extracts the variable holding structure, so that it is somehow > easier to reset them to their initial state on transaction failures, > the management of which is the ultimate aim of this patch series. > > It is also cleaner this way. > > Patch applies cleanly on top of the previous one (there is no real > interactions with it). It compiles cleanly. Global & pgbench "make > check" are both ok. :-) > The structure typedef does not need a name. "typedef struct { } V...". Ok! > I tend to disagree with naming things after their type, eg "array". > I'd suggest "vars" instead. "nvariables" could be "nvars" for > consistency with that and "vars_sorted", and because > "foo.variables->nvariables" starts looking heavy. > > I'd suggest but "Variables" type declaration just after "Variable" > type declaration in the file. Thank you, I agree and I'll fix all this. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, > v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch > - a patch for the ereport() macro (this is used to report client failures > that do not cause an aborts and this depends on the level of debugging). ISTM that abort() is called under FATAL. > - implementation: if possible, use the local ErrorData structure during the > errstart()/errmsg()/errfinish() calls. Otherwise use a static variable > protected by a mutex if necessary. To do all of this export the function > appendPQExpBufferVA from libpq. This patch applies cleanly on top of the other ones (there are minimal interactions), compiles cleanly, global & pgbench "make check" are ok. IMO this patch is more controversial than the other ones. It is not really related to the aim of the patch series, which could do without, couldn't it? Moreover, it changes pgbench current behavior, which might be admissible, but should be discussed clearly. I'd suggest that it should be an independent submission, unrelated to the pgbench error management patch. The code adapts/duplicates existing server-side "ereport" stuff and brings it to the frontend, where the logging needs are somehow quite different. I'd prefer to avoid duplication and/or have some code sharing. If it really needs to be duplicated, I'd suggest to put all this stuff in separated files. If we want to do that, I think that it would belong to fe_utils, and where it could/should be used by all front-end programs. I do not understand why names are changed, eg ELEVEL_FATAL instead of FATAL. ISTM that part of the point of the move would be to be homogeneous, which suggests that the same names should be reused. For logging purposes, ISTM that the "elog" macro interface is nicer, closer to the existing "fprintf(stderr", as it would not introduce the additional parentheses hack for "rest". I see no actual value in creating on the fly a dynamic buffer through plenty macros and functions as the end result is just to print the message out to stderr in the end. errfinishImpl: fprintf(stderr, "%s", error->message.data); This looks like overkill. From reading the code, this does not look like an improvement: fprintf(stderr, "invalid socket: %s", PQerrorMessage(st->con)); vs ereport(ELEVEL_LOG, (errmsg("invalid socket: %s", PQerrorMessage(st->con)))); The whole complexity of the server-side interface only make sense because TRY/CATCH stuff and complex logging requirements (eg several outputs) in the backend. The patch adds quite some code and complexity without clear added value that I can see. The semantics of the existing code is changed, the FATAL levels calls abort() and replace existing exit(1) calls. Maybe you want an ERROR level as well. My 0.02€: maybe you just want to turn fprintf(stderr, format, ...); // then possibly exit or abort depending... into elog(level, format, ...); which maybe would exit or abort depending on level, and possibly not actually report under some levels and/or some conditions. For that, it could enough to just provide an nice "elog" function. In conclusion, which you can disagree with because maybe I have missed something... anyway I currently think that: - it should be an independent submission - possibly at "fe_utils" level - possibly just a nice "elog" function is enough, if so just do that. -- Fabien.
On 10-06-2018 10:38, Fabien COELHO wrote: > Hello Marina, Hello! >> v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch >> - a patch for the ereport() macro (this is used to report client >> failures that do not cause an aborts and this depends on the level of >> debugging). > > ISTM that abort() is called under FATAL. If you mean abortion of the client, this is not an abortion of the main program. >> - implementation: if possible, use the local ErrorData structure >> during the errstart()/errmsg()/errfinish() calls. Otherwise use a >> static variable protected by a mutex if necessary. To do all of this >> export the function appendPQExpBufferVA from libpq. > > This patch applies cleanly on top of the other ones (there are minimal > interactions), compiles cleanly, global & pgbench "make check" are ok. :-) > IMO this patch is more controversial than the other ones. > > It is not really related to the aim of the patch series, which could > do without, couldn't it? > I'd suggest that it should be an independent submission, unrelated to > the pgbench error management patch. I suppose that this is related; because of my patch there may be a lot of such code (see v7 in [1]): - fprintf(stderr, - "malformed variable \"%s\" value: \"%s\"\n", - var->name, var->svalue); + if (debug_level >= DEBUG_FAILS) + { + fprintf(stderr, + "malformed variable \"%s\" value: \"%s\"\n", + var->name, var->svalue); + } - if (debug) + if (debug_level >= DEBUG_ALL) fprintf(stderr, "client %d sending %s\n", st->id, sql); That's why it was suggested to make the error function which hides all these things (see [2]): There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with corresponding fprintf(stderr..) I think it's time to do it like in the main code, wrap with some function like log(level, msg). > Moreover, it changes pgbench current > behavior, which might be admissible, but should be discussed clearly. > The semantics of the existing code is changed, the FATAL levels calls > abort() and replace existing exit(1) calls. Maybe you want an ERROR > level as well. Oh, thanks, I agree with you. And I do not want to change the program exit code without good reasons, but I'm sorry I may not know all pros and cons in this matter.. Or did you also mean other changes? > The code adapts/duplicates existing server-side "ereport" stuff and > brings it to the frontend, where the logging needs are somehow quite > different. > > I'd prefer to avoid duplication and/or have some code sharing. I was recommended to use the same interface in [3]: On elog/errstart: we already have a convention for what ereport() calls look like; I suggest to use that instead of inventing your own. > If it > really needs to be duplicated, I'd suggest to put all this stuff in > separated files. If we want to do that, I think that it would belong > to fe_utils, and where it could/should be used by all front-end > programs. I'll try to do it.. > I do not understand why names are changed, eg ELEVEL_FATAL instead of > FATAL. ISTM that part of the point of the move would be to be > homogeneous, which suggests that the same names should be reused. Ok! > For logging purposes, ISTM that the "elog" macro interface is nicer, > closer to the existing "fprintf(stderr", as it would not introduce the > additional parentheses hack for "rest". I was also recommended to use ereport() instead of elog() in [3]: With that, is there a need for elog()? In the backend we have it because $HISTORY but there's no need for that here -- I propose to lose elog() and use only ereport everywhere. > I see no actual value in creating on the fly a dynamic buffer through > plenty macros and functions as the end result is just to print the > message out to stderr in the end. > > errfinishImpl: fprintf(stderr, "%s", error->message.data); > > This looks like overkill. From reading the code, this does not look > like an improvement: > > fprintf(stderr, "invalid socket: %s", PQerrorMessage(st->con)); > > vs > > ereport(ELEVEL_LOG, (errmsg("invalid socket: %s", > PQerrorMessage(st->con)))); > > The whole complexity of the server-side interface only make sense > because TRY/CATCH stuff and complex logging requirements (eg several > outputs) in the backend. The patch adds quite some code and complexity > without clear added value that I can see. > My 0.02€: maybe you just want to turn > > fprintf(stderr, format, ...); > // then possibly exit or abort depending... > > into > > elog(level, format, ...); > > which maybe would exit or abort depending on level, and possibly not > actually report under some levels and/or some conditions. For that, it > could enough to just provide an nice "elog" function. I agree that elog() can be coded in this way. To use ereport() I need a structure to store the error level as a condition to exit. > In conclusion, which you can disagree with because maybe I have missed > something... anyway I currently think that: > > - it should be an independent submission > > - possibly at "fe_utils" level > > - possibly just a nice "elog" function is enough, if so just do that. I hope I answered all this above.. [1] https://www.postgresql.org/message-id/453fa52de88477df2c4a2d82e09e461c%40postgrespro.ru [2] https://www.postgresql.org/message-id/20180405180807.0bc1114f%40wp.localdomain [3] https://www.postgresql.org/message-id/20180508105832.6o3uf3npfpjgk5m7%40alvherre.pgsql -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, > I suppose that this is related; because of my patch there may be a lot of > such code (see v7 in [1]): > > - fprintf(stderr, > - "malformed variable \"%s\" value: > \"%s\"\n", > - var->name, var->svalue); > + if (debug_level >= DEBUG_FAILS) > + { > + fprintf(stderr, > + "malformed variable \"%s\" > value: \"%s\"\n", > + var->name, var->svalue); > + } > > - if (debug) > + if (debug_level >= DEBUG_ALL) > fprintf(stderr, "client %d sending %s\n", st->id, > sql); I'm not sure that debug messages needs to be kept after debug, if it is about debugging pgbench itself. That is debatable. > That's why it was suggested to make the error function which hides all these > things (see [2]): > >> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with >> corresponding fprintf(stderr..) I think it's time to do it like in the >> main code, wrap with some function like log(level, msg). Yep. I did not wrote that, but I agree with an "elog" suggestion to switch if (...) { fprintf(...); exit/abort/continue/... } to a simpler: elog(level, ...) >> Moreover, it changes pgbench current behavior, which might be >> admissible, but should be discussed clearly. > >> The semantics of the existing code is changed, the FATAL levels calls >> abort() and replace existing exit(1) calls. Maybe you want an ERROR >> level as well. > > Oh, thanks, I agree with you. And I do not want to change the program exit > code without good reasons, but I'm sorry I may not know all pros and cons in > this matter.. > > Or did you also mean other changes? AFAICR I meant switching exit to abort in some cases. >> The code adapts/duplicates existing server-side "ereport" stuff and >> brings it to the frontend, where the logging needs are somehow quite >> different. >> >> I'd prefer to avoid duplication and/or have some code sharing. > > I was recommended to use the same interface in [3]: > >>> On elog/errstart: we already have a convention for what ereport() >>> calls look like; I suggest to use that instead of inventing your own. The "elog" interface already exists, it is not an invention. "ereport" is a hack which is somehow necessary in some cases. I prefer a simple function call if possible for the purpose, and ISTM that this is the case. >> If it really needs to be duplicated, I'd suggest to put all this stuff >> in separated files. If we want to do that, I think that it would belong >> to fe_utils, and where it could/should be used by all front-end >> programs. > > I'll try to do it.. Dunno. If you only need one "elog" function which prints a message to stderr and decides whether to abort/exit/whatevrer, maybe it can just be kept in pgbench. If there are are several complicated functions and macros, better with a file. So I'd say it depends. >> For logging purposes, ISTM that the "elog" macro interface is nicer, >> closer to the existing "fprintf(stderr", as it would not introduce the >> additional parentheses hack for "rest". > > I was also recommended to use ereport() instead of elog() in [3]: Probably. Are you hoping that advises from different reviewers should be consistent? That seems optimistic:-) >>> With that, is there a need for elog()? In the backend we have it >>> because $HISTORY but there's no need for that here -- I propose to >>> lose elog() and use only ereport everywhere. See commit 8a07ebb3c172 which turns some ereport into elog... >> My 0.02€: maybe you just want to turn >> >> fprintf(stderr, format, ...); >> // then possibly exit or abort depending... >> >> into >> >> elog(level, format, ...); >> >> which maybe would exit or abort depending on level, and possibly not >> actually report under some levels and/or some conditions. For that, it >> could enough to just provide an nice "elog" function. > > I agree that elog() can be coded in this way. To use ereport() I need a > structure to store the error level as a condition to exit. Yep. That is a lot of complication which are justified server side where logging requirements are special, but in this case I see it as overkill. So my current view is that if you only need an "elog" function, it is simpler to add it to "pgbench.c". -- Fabien.
On 2018-Jun-13, Fabien COELHO wrote: > > > > With that, is there a need for elog()? In the backend we have > > > > it because $HISTORY but there's no need for that here -- I > > > > propose to lose elog() and use only ereport everywhere. > > See commit 8a07ebb3c172 which turns some ereport into elog... For context: in the backend, elog() is only used for internal messages (i.e. "can't-happen" conditions), and ereport() is used for user-facing messages. There are many things ereport() has that elog() doesn't, such as additional message fields (HINT, DETAIL, etc) that I think could have some use in pgbench as well. If you use elog() then you can't have that. Another difference is that in the backend, elog() messages are never translated, while ereport() message are translated. Since pgbench is translatable I think it would be best to keep those things in sync, to avoid confusion. (Although of course you could do it differently in pgbench than backend.) One thing that just came to mind is that pgbench uses some src/fe_utils stuff. I hope having ereport() doesn't cause a conflict with that ... BTW I think abort() is not the right thing, as it'll cause core dumps if enabled. Why not just exit(1)? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 13-06-2018 22:59, Alvaro Herrera wrote: > For context: in the backend, elog() is only used for internal messages > (i.e. "can't-happen" conditions), and ereport() is used for user-facing > messages. There are many things ereport() has that elog() doesn't, > such > as additional message fields (HINT, DETAIL, etc) that I think could > have > some use in pgbench as well. If you use elog() then you can't have > that. AFAIU originally it was not supposed that the pgbench error messages have these fields, so will it be good to change the final output to stderr?.. For example: - fprintf(stderr, "%s", PQerrorMessage(con)); - fprintf(stderr, "(ignoring this error and continuing anyway)\n"); + ereport(LOG, + (errmsg("Ignoring the server error and continuing anyway"), + errdetail("%s", PQerrorMessage(con)))); - fprintf(stderr, "%s", PQerrorMessage(con)); - if (sqlState && strcmp(sqlState, ERRCODE_UNDEFINED_TABLE) == 0) - { - fprintf(stderr, "Perhaps you need to do initialization (\"pgbench -i\") in database \"%s\"\n", PQdb(con)); - } - - exit(1); + ereport(ERROR, + (errmsg("Server error"), + errdetail("%s", PQerrorMessage(con)), + sqlState && strcmp(sqlState, ERRCODE_UNDEFINED_TABLE) == 0 ? + errhint("Perhaps you need to do initialization (\"pgbench -i\") in database \"%s\"\n", + PQdb(con)) : 0)); -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 13-06-2018 22:44, Fabien COELHO wrote: > Hello Marina, > >> I suppose that this is related; because of my patch there may be a lot >> of such code (see v7 in [1]): >> >> - fprintf(stderr, >> - "malformed variable \"%s\" value: \"%s\"\n", >> - var->name, var->svalue); >> + if (debug_level >= DEBUG_FAILS) >> + { >> + fprintf(stderr, >> + "malformed variable \"%s\" value: \"%s\"\n", >> + var->name, var->svalue); >> + } >> >> - if (debug) >> + if (debug_level >= DEBUG_ALL) >> fprintf(stderr, "client %d sending %s\n", st->id, sql); > > I'm not sure that debug messages needs to be kept after debug, if it > is about debugging pgbench itself. That is debatable. AFAICS it is not about debugging pgbench itself, but about more detailed information that can be used to understand what exactly happened during its launch. In the case of errors this helps to distinguish between failures or errors by type (including which limit for retries was violated and how far it was exceeded for the serialization/deadlock errors). >>> The code adapts/duplicates existing server-side "ereport" stuff and >>> brings it to the frontend, where the logging needs are somehow quite >>> different. >>> >>> I'd prefer to avoid duplication and/or have some code sharing. >> >> I was recommended to use the same interface in [3]: >> >>>> On elog/errstart: we already have a convention for what ereport() >>>> calls look like; I suggest to use that instead of inventing your >>>> own. > > The "elog" interface already exists, it is not an invention. "ereport" > is a hack which is somehow necessary in some cases. I prefer a simple > function call if possible for the purpose, and ISTM that this is the > case. > That is a lot of complication which are justified server side > where logging requirements are special, but in this case I see it as > overkill. I think we need ereport() if we want to make detailed error messages (see examples in [1]).. >>> If it really needs to be duplicated, I'd suggest to put all this >>> stuff in separated files. If we want to do that, I think that it >>> would belong to fe_utils, and where it could/should be used by all >>> front-end programs. >> >> I'll try to do it.. > > Dunno. If you only need one "elog" function which prints a message to > stderr and decides whether to abort/exit/whatevrer, maybe it can just > be kept in pgbench. If there are are several complicated functions and > macros, better with a file. So I'd say it depends. > So my current view is that if you only need an "elog" function, it is > simpler to add it to "pgbench.c". Thank you! >>> For logging purposes, ISTM that the "elog" macro interface is nicer, >>> closer to the existing "fprintf(stderr", as it would not introduce >>> the >>> additional parentheses hack for "rest". >> >> I was also recommended to use ereport() instead of elog() in [3]: > > Probably. Are you hoping that advises from different reviewers should > be consistent? That seems optimistic:-) To make the patch committable there should be no objection to it.. [1] https://www.postgresql.org/message-id/c89fcc380a19380260b5ea463efc1416%40postgrespro.ru -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Alvaro, > For context: in the backend, elog() is only used for internal messages > (i.e. "can't-happen" conditions), and ereport() is used for user-facing > messages. There are many things ereport() has that elog() doesn't, such > as additional message fields (HINT, DETAIL, etc) that I think could have > some use in pgbench as well. If you use elog() then you can't have that. > [...] Ok. Then forget elog, but I'm pretty against having a kind of ereport which looks greatly overkill to me, because: (1) the syntax is pretty heavy, and does not look like a function. (2) the implementation allocates a string buffer for the message this is greatly overkill for pgbench which only needs to print to stderr once. This makes sense server-side because the generated message may be output several times (eg stderr, file logging, to the client), and the implementation has to work with cpp implementations which do not handle varags (and maybe other reasons). So I would be in favor of having just a simpler error function. Incidentally, one already exists "pgbench_error" and could be improved, extended, replaced. There is also "syntax_error". > One thing that just came to mind is that pgbench uses some src/fe_utils > stuff. I hope having ereport() doesn't cause a conflict with that ... Currently ereport does not exists client-side. I do not think that this patch is the right moment to decide to do that. Also, there are some "elog" in libpq, but they are out with a "#ifndef FRONTEND". > BTW I think abort() is not the right thing, as it'll cause core dumps if > enabled. Why not just exit(1)? Yes, I agree and already reported that. Conclusion: My current opinion is that I'm pretty against bringing "ereport" to the front-end on this specific pgbench patch. I agree with you that "elog" would be misleading there as well, for the arguments you developed above. I'd suggest to have just one clean and simple pgbench internal function to handle errors and possibly exit, debug... Something like void pgb_error(FATAL, "error %d raised", 12); Implemented as void pgb_error(int/enum XXX level, const char * format, ...) { test level and maybe return immediately (eg debug); print to stderr; exit/abort/return depending; } Then if some advanced error handling is introduced for front-end programs, possibly through some macros, then it would be time to improve upon that. -- Fabien.
Hello Marina, > v9-0004-Pgbench-errors-and-serialization-deadlock-retries.patch > - the main patch for handling client errors and repetition of transactions > with serialization/deadlock failures (see the detailed description in the > file). Here is a review for the last part of your v9 version. Patch does not "git apply" (may anymore): error: patch failed: doc/src/sgml/ref/pgbench.sgml:513 error: doc/src/sgml/ref/pgbench.sgml: patch does not apply However I could get it to apply with the "patch" command. Then patch compiles, global & pgbench "make check" are ok. Feature ======= The patch adds the ability to restart transactions (i.e. the full script) on some errors, which is a good thing as it allows to exercice postgres performance in more realistic scenarii. * -d/--debug: I'm not in favor in requiring a mandatory text argument on this option. It is not pratical, the user has to remember it, and it is a change. I'm sceptical of the overall debug handling changes. Maybe we could have multiple -d which lead to higher debug level, but I'm not sure that it can be made to work for this case and still be compatible with the previous behavior. Maybe you need a specific option for your purpose, eg "--debug-retry"? Code ==== * The implementation is less complex that the previous submission, which is a good thing. I'm not sure that all the remaining complexity is still fully needed. * I'm reserved about the whole ereport thing, see comments in other messages. Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me. In particular, the "CLIENT" part is not very useful. If the distinction makes sense, I would have kept "LOG" for the initial one and add other ones for ABORT and PGBENCH, maybe. * There are no comments about "retries" in StatData, CState and Command structures. * Also, for StatData, I would like to understand the logic between cnt, skipped, retries, retried, errors, ... so a clear information about the expected invariant if any would be welcome. One has to go in the code to understand how these fields relate one to the other. * "errors_in_failed_tx" is some subcounter of "errors", for a special case. Why it is there fails me [I finally understood, and I think it should be removed, see end of review]. If we wanted to distinguish, then we should distinguish homogeneously: maybe just count the different error types, eg have things like "deadlock_errors", "serializable_errors", "other_errors", "internal_pgbench_errors" which would be orthogonal one to the other, and "errors" could be recomputed from these. * How "errors" differs from "ecnt" is unclear to me. * FailureStatus states are not homogeneously named. I'd suggest to use *_FAILURE for all cases. The miscellaneous case should probably be the last. I do not understand the distinction between ANOTHER_FAILURE & IN_FAILED_SQL_TRANSACTION. Why should it be needed? [again, see and of review] * I do not understand the comments on CState enum: "First, remember the failure in CSTATE_FAILURE. Then process other commands of the failed transaction if any" Why would other commands be processed at all if the transaction is aborted? For me any error must leads to the rollback and possible retry of the transaction. This comment needs to be clarified. It should also say that on FAILURE, it will go either to RETRY or ABORTED. See below my comments about doCustom. It is unclear to me why their could be several failures within a transaction, as I would have stopped that it would be aborted on the first one. * I do not undestand the purpose of first_failure. The comment should explain why it would need to be remembered. From my point of view, I'm not fully convinced that it should. * commandFailed: I think that it should be kept much simpler. In particular, having errors on errors does not help much: on ELEVEL_FATAL, it ignores the actual reported error and generates another error of the same level, so that the initial issue is hidden. Even if these are can't happen cases, hidding the origin if it occurs looks unhelpful. Just print it directly, and maybe abort if you think that it is a can't happen case. * copyRandomState: just use sizeof(RandomState) instead of making assumptions about the contents of the struct. Also, this function looks pretty useless, why not just do a plain assignment? * copyVariables: lacks comments to explain that the destination is cleaned up and so on. The cleanup phase could probaly be in a distinct function, so that the code would be clearer. Maybe the function variable names are too long. if (current_source->svalue) in the context of a guard for a strdup, maybe: if (current_source->svalue != NULL) * executeCondition: this hides client automaton state changes which were clearly visible beforehand in the switch, and the different handling of if & elif is also hidden. I'm against this unnecessary restructuring and to hide such an information, all state changes should be clearly seen in the state switch so that it is easier to understand and follow. I do not see why touching the conditional stack on internal errors (evaluateExpr failure) brings anything, the whole transaction will be aborted anyway. * doCustom changes. On CSTATE_START_COMMAND, it considers whether to retry on the end. For me, this cannot happen: if some command failed, then it should have skipped directly to the RETRY state, so that you cannot get to the end of the script with an error. Maybe you could assert that the state of the previous command is NO_FAILURE, though. On CSTATE_FAILURE, the next command is possibly started. Although there is some consistency with the previous point, I think that it totally breaks the state automaton where now a command can start while the whole transaction is in failing state anyway. There was no point in starting it in the first place. So, for me, the FAILURE state should record/count the failure, then skip to RETRY if a retry is decided, else proceed to ABORT. Nothing else. This is much clearer that way. Then RETRY should reinstate the global state and proceed to start the *first* command again. The current RETRY state does memory allocations to generate a message with buffer allocation and so on. This looks like a costly and useless operation. If the user required "retries", then this is normal behavior, the retries are counted and will be printed out in the final report, and there is no point in printing out every single one of them. Maybe you want that debugging, but then coslty operations should be guarded. It is unclear to me why backslash command errors are turned to FAILURE instead of ABORTED: there is no way they are going to be retried, so maybe they should/could skip directly to ABORTED? Function executeCondition is a bad idea, as stated above. * reporting The number of transactions above the latency limit report can be simplified. Remove the if and just use one printf f with a %s for the optional comment. I'm not sure this optional comment is useful there. Before the patch, ISTM that all lines relied on one printf. you have changed to a style where a collection of printf is used to compose a line. I'd suggest to keep to the previous one-printf-prints-one-line style, where possible. You have added 20-columns alignment prints. This looks like too much and generates much too large lines. Probably 10 (billion) would be enough. Some people try to parse the output, so it should be deterministic. I'd add the needed columns always if appropriate (i.e. under retry), even if none occured. * processXactStats: An else is replaced by a detailed stats, with the initial "no detailed stats" comment kept. The function is called both in the thenb & else branch. The structure does not make sense anymore. I'm not sure this changed was needed. * getLatencyUsed: declared "double" so "return 0.0". * typo: ruin -> run; probably others, I did not check for them in detail. TAP Tests ========= On my laptop, tests last 5.5 seconds before the patch, and about 13 seconds after. This is much too large. Pgbench TAP tests do not deserve to take over twice as much time as before just on this patch. One reason which explains this large time is there is a new script with a new created instance. I'd suggest to append tests to the existing 2 scripts, depending on whether they need a running instance or not. Secondly, I think that the design of the tests are too heavy. For such a feature, ISTM enough to check that it works, i.e. one test for deadlocks (trigger one or a few deadlocks), idem for serializable, maybe idem for other errors if any. The challenge is to do that reliably and efficiently, i.e. so that the test does not rely on chance and is still quite efficient. The trick you use is to run an interactive psql in parallel to pgbench so as to play with concurrent locks. That is interesting, but deserves more comments and explanatation, eg before the test functions. Maybe this could be achieved within pgbench by using some wait stuff in PL/pgSQL so that concurrent client can wait one another based on data in unlogged table updated by a CALL within an "embedded" transactions? Not sure. Otherwise, maybe (simple) pgbench-side thread barrier could help, but this would require more thinking. Anyway, TAP tests should be much lighter (in total time), and if possible much simpler. The latency limit to 900 ms try is a bad idea because it takes a lot of time. I did such tests before and they were removed by Tom Lane because of determinism and time issues. I would comment this test out for now. Documentation ============= Not looked at in much details for now. Just a few comments: Having the "most important settings" on line 1-6 and 8 (i.e. skipping 7) looks silly. The important ones should simply be the first ones, and the 8th is not that important, or it is in 7th position. I do not understand why there is so much text about in failed sql transaction stuff, while we are mainly interested in serialization & deadlock errors, and this only falls in some "other" category. There seems to be more details about other errors that about deadlocks & serializable errors. The reporting should focus on what is of interest, either all errors, or some detailed split of these errors. The documentation should state clearly what are the counted errors, and then what are their effects on the reported stats. The "Errors and Serialization/Deadlock Retries" section is a good start in that direction, but it does not talk about pgbench internal errors (eg "cos(true)"). I think it should more explicit about errors. Option --max-tries default value should be spelled out in the doc. "Client's run is aborted", do you mean "Pgbench run is aborted"? "If a failed transaction block does not terminate in the current script": this just looks like a very bad idea, and explains my general ranting above about this error condition. ISTM that the only reasonable option is that a pgbench script should be inforced as a transaction, or a set of transactions, but cannot be a "piece" of transaction, i.e. pgbench script with "BEGIN;" but without a corresponding "COMMIT" is a user error and warrants an abort, so that there is no need to manage these "in aborted transaction" errors every where and report about them and document them extensively. This means adding a check when a script is finished or starting that PQtransactionStatus(const PGconn *conn) == PQTRANS_IDLE, and abort if not with a fatal error. Then we can forget about these "in tx errors" counting, reporting and so on, and just have to document the restriction. -- Fabien.
On 09-07-2018 16:05, Fabien COELHO wrote: > Hello Marina, Hello, Fabien! > Here is a review for the last part of your v9 version. Thank you very much for this! > Patch does not "git apply" (may anymore): > error: patch failed: doc/src/sgml/ref/pgbench.sgml:513 > error: doc/src/sgml/ref/pgbench.sgml: patch does not apply Sorry, I'll send a new version soon. > However I could get it to apply with the "patch" command. > > Then patch compiles, global & pgbench "make check" are ok. :-) > Feature > ======= > > The patch adds the ability to restart transactions (i.e. the full > script) > on some errors, which is a good thing as it allows to exercice postgres > performance in more realistic scenarii. > > * -d/--debug: I'm not in favor in requiring a mandatory text argument > on this > option. It is not pratical, the user has to remember it, and it is a > change. > I'm sceptical of the overall debug handling changes. Maybe we could > have > multiple -d which lead to higher debug level, but I'm not sure that it > can be > made to work for this case and still be compatible with the previous > behavior. > Maybe you need a specific option for your purpose, eg "--debug-retry"? As you wrote in [1], adding an additional option is also a bad idea: > I'm sceptical of the "--debug-fails" options. ISTM that --debug is > already there > and should just be reused. Maybe it's better to use an optional argument/arguments for compatibility (--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see only 2 levels, and there's no guarantee that they will no change.. > Code > ==== > > * The implementation is less complex that the previous submission, > which is a good thing. I'm not sure that all the remaining complexity > is still fully needed. > > * I'm reserved about the whole ereport thing, see comments in other > messages. Thank you, I'll try to implement the error reporting in the way you suggested. > Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me. > In particular, the "CLIENT" part is not very useful. If the > distinction makes sense, I would have kept "LOG" for the initial one > and > add other ones for ABORT and PGBENCH, maybe. Ok! > * There are no comments about "retries" in StatData, CState and Command > structures. > > * Also, for StatData, I would like to understand the logic between cnt, > skipped, retries, retried, errors, ... so a clear information about the > expected invariant if any would be welcome. One has to go in the code > to > understand how these fields relate one to the other. > > <...> > > * How "errors" differs from "ecnt" is unclear to me. Thank you, I'll fix this. > * commandFailed: I think that it should be kept much simpler. In > particular, having errors on errors does not help much: on > ELEVEL_FATAL, it ignores the actual reported error and generates > another error of the same level, so that the initial issue is hidden. > Even if these are can't happen cases, hidding the origin if it occurs > looks unhelpful. Just print it directly, and maybe abort if you think > that it is a can't happen case. Oh, thanks, my mistake( > * copyRandomState: just use sizeof(RandomState) instead of making > assumptions > about the contents of the struct. Also, this function looks pretty > useless, > why not just do a plain assignment? > > * copyVariables: lacks comments to explain that the destination is > cleaned up > and so on. The cleanup phase could probaly be in a distinct function, > so that > the code would be clearer. Maybe the function variable names are too > long. Thank you, I'll fix this. > if (current_source->svalue) > > in the context of a guard for a strdup, maybe: > > if (current_source->svalue != NULL) I'm sorry, I'll fix this. > * I do not understand the comments on CState enum: "First, remember the > failure > in CSTATE_FAILURE. Then process other commands of the failed > transaction if any" > Why would other commands be processed at all if the transaction is > aborted? > For me any error must leads to the rollback and possible retry of the > transaction. This comment needs to be clarified. It should also say > that on FAILURE, it will go either to RETRY or ABORTED. See below my > comments about doCustom. > > It is unclear to me why their could be several failures within a > transaction, as I would have stopped that it would be aborted on the > first one. > > * I do not undestand the purpose of first_failure. The comment should > explain > why it would need to be remembered. From my point of view, I'm not > fully > convinced that it should. > > <...> > > * executeCondition: this hides client automaton state changes which > were > clearly visible beforehand in the switch, and the different handling of > if & elif is also hidden. > > I'm against this unnecessary restructuring and to hide such an > information, > all state changes should be clearly seen in the state switch so that it > is > easier to understand and follow. > > I do not see why touching the conditional stack on internal errors > (evaluateExpr failure) brings anything, the whole transaction will be > aborted > anyway. > > * doCustom changes. > > On CSTATE_START_COMMAND, it considers whether to retry on the end. > For me, this cannot happen: if some command failed, then it should have > skipped directly to the RETRY state, so that you cannot get to the end > of the script with an error. Maybe you could assert that the state of > the > previous command is NO_FAILURE, though. > > On CSTATE_FAILURE, the next command is possibly started. Although there > is some > consistency with the previous point, I think that it totally breaks the > state > automaton where now a command can start while the whole transaction is > in failing state anyway. There was no point in starting it in the first > place. > > So, for me, the FAILURE state should record/count the failure, then > skip > to RETRY if a retry is decided, else proceed to ABORT. Nothing else. > This is much clearer that way. > > Then RETRY should reinstate the global state and proceed to start the > *first* > command again. > > <...> > > It is unclear to me why backslash command errors are turned to FAILURE > instead of ABORTED: there is no way they are going to be retried, so > maybe they should/could skip directly to ABORTED? > > Function executeCondition is a bad idea, as stated above. So do you propose to execute the command "ROLLBACK" without calculating its latency etc. if we are in a failed transaction and clear the conditional stack after each failure? Also just to be clear: do you want to have the state CSTATE_ABORTED for client abortion and another state for interrupting the current transaction? > The current RETRY state does memory allocations to generate a message > with buffer allocation and so on. This looks like a costly and useless > operation. If the user required "retries", then this is normal > behavior, > the retries are counted and will be printed out in the final report, > and there is no point in printing out every single one of them. > Maybe you want that debugging, but then coslty operations should be > guarded. I think we need these debugging messages because, for example, if you use the option --latency-limit, you we will never know in advance whether the serialization/deadlock failure will be retried or not. They also help to understand which limit of retries was violated or how close we were to these limits during the execution of a specific transaction. But I agree with you that they are costly and can be skipped if the failure type is never retried. Maybe it is better to split them into multiple error function calls?.. > * reporting > > The number of transactions above the latency limit report can be > simplified. > Remove the if and just use one printf f with a %s for the optional > comment. > I'm not sure this optional comment is useful there. Oh, thanks, my mistake( > Before the patch, ISTM that all lines relied on one printf. you have > changed to a style where a collection of printf is used to compose a > line. I'd suggest to keep to the previous one-printf-prints-one-line > style, where possible. Ok! > You have added 20-columns alignment prints. This looks like too much > and > generates much too large lines. Probably 10 (billion) would be enough. I have already asked you about this in [2]: > The variables for the numbers of failures and retries are of type int64 > since the variable for the total number of transactions has the same > type. That's why such a large alignment (as I understand it now, enough > 20 characters). Do you prefer floating alignemnts, depending on the > maximum number of failures/retries for any command in any script? > Some people try to parse the output, so it should be deterministic. I'd > add > the needed columns always if appropriate (i.e. under retry), even if > none > occured. Ok! > * processXactStats: An else is replaced by a detailed stats, with the > initial > "no detailed stats" comment kept. The function is called both in the > thenb > & else branch. The structure does not make sense anymore. I'm not sure > this changed was needed. > > * getLatencyUsed: declared "double" so "return 0.0". > > * typo: ruin -> run; probably others, I did not check for them in > detail. Oh, thanks, my mistakes( > TAP Tests > ========= > > On my laptop, tests last 5.5 seconds before the patch, and about 13 > seconds > after. This is much too large. Pgbench TAP tests do not deserve to take > over > twice as much time as before just on this patch. > > One reason which explains this large time is there is a new script > with a new created instance. I'd suggest to append tests to the > existing 2 scripts, depending on whether they need a running instance > or not. Ok! All new tests that do not need a running instance are already added to the file 002_pgbench_no_server.pl. > Secondly, I think that the design of the tests are too heavy. For such > a feature, ISTM enough to check that it works, i.e. one test for > deadlocks (trigger one or a few deadlocks), idem for serializable, > maybe idem for other errors if any. > > <...> > > The latency limit to 900 ms try is a bad idea because it takes a lot of > time. > I did such tests before and they were removed by Tom Lane because of > determinism > and time issues. I would comment this test out for now. Ok! If it doesn't bother you - can you tell more about the causes of these determinism issues?.. Tests for some other failures that cannot be retried are already added to 001_pgbench_with_server.pl. > The challenge is to do that reliably and efficiently, i.e. so that the > test does > not rely on chance and is still quite efficient. > > The trick you use is to run an interactive psql in parallel to pgbench > so as to > play with concurrent locks. That is interesting, but deserves more > comments > and explanatation, eg before the test functions. > > Maybe this could be achieved within pgbench by using some wait stuff > in PL/pgSQL so that concurrent client can wait one another based on > data in unlogged table updated by a CALL within an "embedded" > transactions? Not sure. > > <...> > > Anyway, TAP tests should be much lighter (in total time), and if > possible much simpler. I'll try, thank you.. > Otherwise, maybe (simple) pgbench-side thread > barrier could help, but this would require more thinking. Tests must pass if we use --disable-thread-safety.. > Documentation > ============= > > Not looked at in much details for now. Just a few comments: > > Having the "most important settings" on line 1-6 and 8 (i.e. skipping > 7) looks > silly. The important ones should simply be the first ones, and the 8th > is not > that important, or it is in 7th position. Ok! > I do not understand why there is so much text about in failed sql > transaction > stuff, while we are mainly interested in serialization & deadlock > errors, and > this only falls in some "other" category. There seems to be more > details about > other errors that about deadlocks & serializable errors. > > The reporting should focus on what is of interest, either all errors, > or some > detailed split of these errors. > > <...> > > * "errors_in_failed_tx" is some subcounter of "errors", for a special > case. Why it is there fails me [I finally understood, and I think it > should be removed, see end of review]. If we wanted to distinguish, > then we should distinguish homogeneously: maybe just count the > different error types, eg have things like "deadlock_errors", > "serializable_errors", "other_errors", "internal_pgbench_errors" which > would be orthogonal one to the other, and "errors" could be recomputed > from these. Thank you, I agree with you. Unfortunately each new error type adds a new 1 or 2 columns of maximum width 20 to the per-statement report (to report errors and possibly retries of this type in this statement) and we already have 2 new columns for all errors and retries. So I'm not sure that we need add anything other than statistics only about all the errors and all the retries in general. > The documentation should state clearly what > are the counted errors, and then what are their effects on the reported > stats. > The "Errors and Serialization/Deadlock Retries" section is a good start > in that > direction, but it does not talk about pgbench internal errors (eg > "cos(true)"). > I think it should more explicit about errors. Thank you, I'll try to improve it. > Option --max-tries default value should be spelled out in the doc. If you mean that it is set to 1 if neither of the options --max-tries or --latency-limit is explicitly used, I'll fix this. > "Client's run is aborted", do you mean "Pgbench run is aborted"? No, other clients continue their run as usual. > * FailureStatus states are not homogeneously named. I'd suggest to use > *_FAILURE for all cases. The miscellaneous case should probably be the > last. I do not understand the distinction between ANOTHER_FAILURE & > IN_FAILED_SQL_TRANSACTION. Why should it be needed? [again, see and of > review] > > <...> > > "If a failed transaction block does not terminate in the current > script": > this just looks like a very bad idea, and explains my general ranting > above about this error condition. ISTM that the only reasonable option > is that a pgbench script should be inforced as a transaction, or a set > of > transactions, but cannot be a "piece" of transaction, i.e. pgbench > script > with "BEGIN;" but without a corresponding "COMMIT" is a user error and > warrants an abort, so that there is no need to manage these "in aborted > transaction" errors every where and report about them and document them > extensively. > > This means adding a check when a script is finished or starting that > PQtransactionStatus(const PGconn *conn) == PQTRANS_IDLE, and abort if > not > with a fatal error. Then we can forget about these "in tx errors" > counting, > reporting and so on, and just have to document the restriction. Ok! [1] https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre [2] https://www.postgresql.org/message-id/e4c5e8cefa4a8e88f1273b0f1ee29e56@postgrespro.ru -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, >> * -d/--debug: I'm not in favor in requiring a mandatory text argument on >> this option. > > As you wrote in [1], adding an additional option is also a bad idea: Hey, I'm entitled to some internal contradictions:-) >> I'm sceptical of the "--debug-fails" options. ISTM that --debug is >> already there and should just be reused. I was thinking that you could just use the existing --debug, not change its syntax. My point was that --debug exists, and you could just print the messages when under --debug. > Maybe it's better to use an optional argument/arguments for compatibility > (--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see > only 2 levels, and there's no guarantee that they will no change.. Optional arguments to options (!) are not really clean things, so I'd like to avoid going onto this path, esp. as I cannot see any other instance in pgbench or elsewhere in postgres, and I personnaly consider these as a bad idea. So if absolutely necessary, a new option is still better than changing --debug syntax. If not necessary, then it is better:-) >> * I'm reserved about the whole ereport thing, see comments in other >> messages. > > Thank you, I'll try to implement the error reporting in the way you > suggested. Dunno if it is a good idea either. The committer word is the good one in the end:-à > Thank you, I'll fix this. > I'm sorry, I'll fix this. You do not have to thank me or being sorry on every comment I do, once a the former is enough, and there is no need for the later. >> * doCustom changes. >> >> On CSTATE_FAILURE, the next command is possibly started. Although there >> is some consistency with the previous point, I think that it totally >> breaks the state automaton where now a command can start while the >> whole transaction is in failing state anyway. There was no point in >> starting it in the first place. >> >> So, for me, the FAILURE state should record/count the failure, then skip >> to RETRY if a retry is decided, else proceed to ABORT. Nothing else. >> This is much clearer that way. >> >> Then RETRY should reinstate the global state and proceed to start the >> *first* command again. >> <...> >> >> It is unclear to me why backslash command errors are turned to FAILURE >> instead of ABORTED: there is no way they are going to be retried, so >> maybe they should/could skip directly to ABORTED? > So do you propose to execute the command "ROLLBACK" without calculating its > latency etc. if we are in a failed transaction and clear the conditional > stack after each failure? > Also just to be clear: do you want to have the state CSTATE_ABORTED for > client abortion and another state for interrupting the current transaction? I do not understand what "interrupting the current transaction" means. A transaction is either committed or rollbacked, I do not know about "interrupted". When it is rollbacked, probably some stats will be collected in passing, I'm fine with that. If there is an error in a pgbench script, the transaction is aborted, which means for me that the script execution is stopped where it was, and either it is restarted from the beginning (retry) or counted as failure (not retry, just aborted, really). If by interrupted you mean that one script begins a transaction and another ends it, as I said in the review I think that this strange case should be forbidden, so that all the code and documentation trying to manage that can be removed. >> The current RETRY state does memory allocations to generate a message >> with buffer allocation and so on. This looks like a costly and useless >> operation. If the user required "retries", then this is normal behavior, >> the retries are counted and will be printed out in the final report, >> and there is no point in printing out every single one of them. >> Maybe you want that debugging, but then coslty operations should be >> guarded. > > I think we need these debugging messages because, for example, Debugging message should cost only when under debug. When not under debug, there should be no debugging message, and there should be no cost for building and discarding such messages in the executed code path beyond testing whether program is under debug. > if you use the option --latency-limit, you we will never know in advance > whether the serialization/deadlock failure will be retried or not. ISTM that it will be shown final report. If I want debug, I ask for --debug, otherwise I think that the command should do what it was asked for, i.e. run scripts, collect performance statistics and show them at the end. In particular, when running with retries is enabled, the user is expecting deadlock/serialization errors, so that they are not "errors" as such for them. > They also help to understand which limit of retries was violated or how > close we were to these limits during the execution of a specific > transaction. But I agree with you that they are costly and can be > skipped if the failure type is never retried. Maybe it is better to > split them into multiple error function calls?.. Debugging message costs should only be incurred when under --debug, not otherwise. >> You have added 20-columns alignment prints. This looks like too much and >> generates much too large lines. Probably 10 (billion) would be enough. > > I have already asked you about this in [2]: Probably:-) >> The variables for the numbers of failures and retries are of type int64 >> since the variable for the total number of transactions has the same >> type. That's why such a large alignment (as I understand it now, enough >> 20 characters). Do you prefer floating alignemnts, depending on the >> maximum number of failures/retries for any command in any script? An int64 counter is not likely to reach its limit anytime soon:-) If the column display limit is ever reached, ISTM that then the text is just misaligned, which is a minor and rare inconvenience. If very wide columns are used, then it does not fit my terminal and the report text will always be wrapped around, which makes it harder to read, every time. >> The latency limit to 900 ms try is a bad idea because it takes a lot of >> time. I did such tests before and they were removed by Tom Lane because >> of determinism and time issues. I would comment this test out for now. > > Ok! If it doesn't bother you - can you tell more about the causes of these > determinism issues?.. Tests for some other failures that cannot be retried > are already added to 001_pgbench_with_server.pl. Some farm animals are very slow, so you cannot really assume much about time one way or another. >> Otherwise, maybe (simple) pgbench-side thread >> barrier could help, but this would require more thinking. > > Tests must pass if we use --disable-thread-safety.. Sure. My wording was misleading. I just meant a synchronisation barrier between concurrent clients, which could be managed with one thread. Anyway, it is probably overkill for the problem at hand, so just forget. >> I do not understand why there is so much text about in failed sql >> transaction stuff, while we are mainly interested in serialization & >> deadlock errors, and this only falls in some "other" category. There >> seems to be more details about other errors that about deadlocks & >> serializable errors. >> >> The reporting should focus on what is of interest, either all errors, >> or some detailed split of these errors. >> >> <...> >> >> * "errors_in_failed_tx" is some subcounter of "errors", for a special >> case. Why it is there fails me [I finally understood, and I think it >> should be removed, see end of review]. If we wanted to distinguish, >> then we should distinguish homogeneously: maybe just count the >> different error types, eg have things like "deadlock_errors", >> "serializable_errors", "other_errors", "internal_pgbench_errors" which >> would be orthogonal one to the other, and "errors" could be recomputed >> from these. > > Thank you, I agree with you. Unfortunately each new error type adds a new 1 > or 2 columns of maximum width 20 to the per-statement report The fact that some data are collected does not mean that they should all be reported in detail. We can have detailed error count and report the sum of this errors for instance, or have some more verbose/detailed reports as options (eg --latencies does just that). >> <...> >> >> "If a failed transaction block does not terminate in the current script": >> this just looks like a very bad idea, and explains my general ranting >> above about this error condition. ISTM that the only reasonable option >> is that a pgbench script should be inforced as a transaction, or a set of >> transactions, but cannot be a "piece" of transaction, i.e. pgbench script >> with "BEGIN;" but without a corresponding "COMMIT" is a user error and >> warrants an abort, so that there is no need to manage these "in aborted >> transaction" errors every where and report about them and document them >> extensively. >> >> This means adding a check when a script is finished or starting that >> PQtransactionStatus(const PGconn *conn) == PQTRANS_IDLE, and abort if not >> with a fatal error. Then we can forget about these "in tx errors" counting, >> reporting and so on, and just have to document the restriction. > > Ok! Good:-) ISTM that this would remove a significant amount of complexity from the code and documentation. -- Fabien.
On 11-07-2018 16:24, Fabien COELHO wrote: > Hello Marina, > >>> * -d/--debug: I'm not in favor in requiring a mandatory text argument >>> on this option. >> >> As you wrote in [1], adding an additional option is also a bad idea: > > Hey, I'm entitled to some internal contradictions:-) ... and discussions will be continue forever %-) >>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is >>> already there and should just be reused. > > I was thinking that you could just use the existing --debug, not > change its syntax. My point was that --debug exists, and you could > just print > the messages when under --debug. Now I understand you better, thanks. I think it will be useful to receive only messages about failures, because they and progress reports can be lost in many other debug messages such as "client %d sending ..." / "client %d executing ..." / "client %d receiving". >> Maybe it's better to use an optional argument/arguments for >> compatibility (--debug[=fails] or --debug[=NUM])? But if we use the >> numbers, now I can see only 2 levels, and there's no guarantee that >> they will no change.. > > Optional arguments to options (!) are not really clean things, so I'd > like to avoid going onto this path, esp. as I cannot see any other > instance in pgbench or elsewhere in postgres, AFAICS they are used in pg_waldump (option --stats[=record]) and in psql (option --help[=topic]). > and I personnaly > consider these as a bad idea. > So if absolutely necessary, a new option is still better than changing > --debug syntax. If not necessary, then it is better:-) Ok! >>> * I'm reserved about the whole ereport thing, see comments in other >>> messages. >> >> Thank you, I'll try to implement the error reporting in the way you >> suggested. > > Dunno if it is a good idea either. The committer word is the good one > in the end:-à I agree with you that ereport has good reasons to be non-trivial in the backend and it does not have the same in pgbench.. >>> * doCustom changes. > >>> >>> On CSTATE_FAILURE, the next command is possibly started. Although >>> there is some consistency with the previous point, I think that it >>> totally breaks the state automaton where now a command can start >>> while the whole transaction is in failing state anyway. There was no >>> point in starting it in the first place. >>> >>> So, for me, the FAILURE state should record/count the failure, then >>> skip >>> to RETRY if a retry is decided, else proceed to ABORT. Nothing else. >>> This is much clearer that way. >>> >>> Then RETRY should reinstate the global state and proceed to start the >>> *first* command again. >>> <...> >>> >>> It is unclear to me why backslash command errors are turned to >>> FAILURE >>> instead of ABORTED: there is no way they are going to be retried, so >>> maybe they should/could skip directly to ABORTED? > >> So do you propose to execute the command "ROLLBACK" without >> calculating its latency etc. if we are in a failed transaction and >> clear the conditional stack after each failure? > >> Also just to be clear: do you want to have the state CSTATE_ABORTED >> for client abortion and another state for interrupting the current >> transaction? > > I do not understand what "interrupting the current transaction" means. > A transaction is either committed or rollbacked, I do not know about > "interrupted". I mean that IIUC the server usually only reports the error and you must manually send the command "END" or "ROLLBACK" to rollback a failed transaction. > When it is rollbacked, probably some stats will be > collected in passing, I'm fine with that. > > If there is an error in a pgbench script, the transaction is aborted, > which means for me that the script execution is stopped where it was, > and either it is restarted from the beginning (retry) or counted as > failure (not retry, just aborted, really). > > If by interrupted you mean that one script begins a transaction and > another ends it, as I said in the review I think that this strange > case should be forbidden, so that all the code and documentation > trying to > manage that can be removed. Ok! >>> The current RETRY state does memory allocations to generate a message >>> with buffer allocation and so on. This looks like a costly and >>> useless >>> operation. If the user required "retries", then this is normal >>> behavior, >>> the retries are counted and will be printed out in the final report, >>> and there is no point in printing out every single one of them. >>> Maybe you want that debugging, but then coslty operations should be >>> guarded. >> >> I think we need these debugging messages because, for example, > > Debugging message should cost only when under debug. When not under > debug, there should be no debugging message, and there should be no > cost for building and discarding such messages in the executed code > path beyond > testing whether program is under debug. > >> if you use the option --latency-limit, you we will never know in >> advance whether the serialization/deadlock failure will be retried or >> not. > > ISTM that it will be shown final report. If I want debug, I ask for > --debug, otherwise I think that the command should do what it was > asked for, i.e. run scripts, collect performance statistics and show > them at the end. > > In particular, when running with retries is enabled, the user is > expecting deadlock/serialization errors, so that they are not "errors" > as such for > them. > >> They also help to understand which limit of retries was violated or >> how close we were to these limits during the execution of a specific >> transaction. But I agree with you that they are costly and can be >> skipped if the failure type is never retried. Maybe it is better to >> split them into multiple error function calls?.. > > Debugging message costs should only be incurred when under --debug, > not otherwise. Ok! IIUC instead of this part of the code initPQExpBuffer(&errmsg_buf); printfPQExpBuffer(&errmsg_buf, "client %d repeats the failed transaction (try %d", st->id, st->retries + 1); if (max_tries) appendPQExpBuffer(&errmsg_buf, "/%d", max_tries); if (latency_limit) { appendPQExpBuffer(&errmsg_buf, ", %.3f%% of the maximum time of tries was used", getLatencyUsed(st, &now)); } appendPQExpBufferStr(&errmsg_buf, ")\n"); pgbench_error(DEBUG_FAIL, "%s", errmsg_buf.data); termPQExpBuffer(&errmsg_buf); can we try something like this? PGBENCH_ERROR_START(DEBUG_FAIL) { PGBENCH_ERROR("client %d repeats the failed transaction (try %d", st->id, st->retries + 1); if (max_tries) PGBENCH_ERROR("/%d", max_tries); if (latency_limit) { PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used", getLatencyUsed(st, &now)); } PGBENCH_ERROR(")\n"); } PGBENCH_ERROR_END(); >>> You have added 20-columns alignment prints. This looks like too much >>> and >>> generates much too large lines. Probably 10 (billion) would be >>> enough. >> >> I have already asked you about this in [2]: > > Probably:-) > >>> The variables for the numbers of failures and retries are of type >>> int64 >>> since the variable for the total number of transactions has the same >>> type. That's why such a large alignment (as I understand it now, >>> enough >>> 20 characters). Do you prefer floating alignemnts, depending on the >>> maximum number of failures/retries for any command in any script? > > An int64 counter is not likely to reach its limit anytime soon:-) If > the column display limit is ever reached, ISTM that then the text is > just misaligned, which is a minor and rare inconvenience. If very wide > columns are used, then it does not fit my terminal and the report text > will always be wrapped around, which makes it harder to read, every > time. Ok! >>> The latency limit to 900 ms try is a bad idea because it takes a lot >>> of time. I did such tests before and they were removed by Tom Lane >>> because of determinism and time issues. I would comment this test out >>> for now. >> >> Ok! If it doesn't bother you - can you tell more about the causes of >> these determinism issues?.. Tests for some other failures that cannot >> be retried are already added to 001_pgbench_with_server.pl. > > Some farm animals are very slow, so you cannot really assume much > about time one way or another. Thanks! >>> I do not understand why there is so much text about in failed sql >>> transaction stuff, while we are mainly interested in serialization & >>> deadlock errors, and this only falls in some "other" category. There >>> seems to be more details about other errors that about deadlocks & >>> serializable errors. >>> >>> The reporting should focus on what is of interest, either all errors, >>> or some detailed split of these errors. >>> >>> <...> >>> >>> * "errors_in_failed_tx" is some subcounter of "errors", for a special >>> case. Why it is there fails me [I finally understood, and I think it >>> should be removed, see end of review]. If we wanted to distinguish, >>> then we should distinguish homogeneously: maybe just count the >>> different error types, eg have things like "deadlock_errors", >>> "serializable_errors", "other_errors", "internal_pgbench_errors" >>> which >>> would be orthogonal one to the other, and "errors" could be >>> recomputed >>> from these. >> >> Thank you, I agree with you. Unfortunately each new error type adds a >> new 1 or 2 columns of maximum width 20 to the per-statement report > > The fact that some data are collected does not mean that they should > all be reported in detail. We can have detailed error count and report > the sum of this errors for instance, or have some more > verbose/detailed reports > as options (eg --latencies does just that). Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2018-Jul-11, Marina Polyakova wrote: > can we try something like this? > > PGBENCH_ERROR_START(DEBUG_FAIL) > { > PGBENCH_ERROR("client %d repeats the failed transaction (try %d", > st->id, st->retries + 1); > if (max_tries) > PGBENCH_ERROR("/%d", max_tries); > if (latency_limit) > { > PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used", > getLatencyUsed(st, &now)); > } > PGBENCH_ERROR(")\n"); > } > PGBENCH_ERROR_END(); I didn't quite understand what these PGBENCH_ERROR() functions/macros are supposed to do. Care to explain? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Just a quick skim while refreshing what were those error reporting API changes about ... On 2018-May-21, Marina Polyakova wrote: > v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch > - a patch for the RandomState structure (this is used to reset a client's > random seed during the repeating of transactions after > serialization/deadlock failures). LGTM, though I'd rename the random_state struct members so that it wouldn't look as confusing. Maybe that's just me. > v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch > - a patch for the Variables structure (this is used to reset client > variables during the repeating of transactions after serialization/deadlock > failures). Please don't allocate Variable structs one by one. First time allocate some decent number (say 8) and then enlarge by duplicating size. That way you save realloc overhead. We use this technique everywhere else, no reason do different here. Other than that, LGTM. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> can we try something like this? > > PGBENCH_ERROR_START(DEBUG_FAIL) > { > PGBENCH_ERROR("client %d repeats the failed transaction (try %d", Argh, no? I was thinking of something much more trivial: pgbench_error(DEBUG, "message format %d %s...", 12, "hello world"); If you really need some complex dynamic buffer, and I would prefer that you avoid that, then the fallback is: if (level >= DEBUG) { initPQstuff(&msg); ... pgbench_error(DEBUG, "fixed message... %s\n", msg); freePQstuff(&msg); } The point is to avoid building the message with dynamic allocation and so if in the end it is not used. -- Fabien.
On 11-07-2018 20:49, Alvaro Herrera wrote: > On 2018-Jul-11, Marina Polyakova wrote: > >> can we try something like this? >> >> PGBENCH_ERROR_START(DEBUG_FAIL) >> { >> PGBENCH_ERROR("client %d repeats the failed transaction (try %d", >> st->id, st->retries + 1); >> if (max_tries) >> PGBENCH_ERROR("/%d", max_tries); >> if (latency_limit) >> { >> PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used", >> getLatencyUsed(st, &now)); >> } >> PGBENCH_ERROR(")\n"); >> } >> PGBENCH_ERROR_END(); > > I didn't quite understand what these PGBENCH_ERROR() functions/macros > are supposed to do. Care to explain? It is used only to print a string with the given arguments to stderr. Probably it might be just the function pgbench_error and not a macro.. P.S. This is my mistake, I did not think that PGBENCH_ERROR_END does not know the elevel for calling exit(1) if the elevel >= ERROR. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 11-07-2018 21:04, Alvaro Herrera wrote: > Just a quick skim while refreshing what were those error reporting API > changes about ... Thank you! > On 2018-May-21, Marina Polyakova wrote: > >> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch >> - a patch for the RandomState structure (this is used to reset a >> client's >> random seed during the repeating of transactions after >> serialization/deadlock failures). > > LGTM, though I'd rename the random_state struct members so that it > wouldn't look as confusing. Maybe that's just me. IIUC, do you like "xseed" instead of "data"? typedef struct RandomState { - unsigned short data[3]; + unsigned short xseed[3]; } RandomState; Or do you want to rename "random_state" in the structures RetryState / CState / TState? Thanks to Fabien Coelho' comments in [1], TState can contain several RandomStates for different purposes, something like this: /* * Thread state */ typedef struct { ... /* * Separate randomness for each thread. Each thread option uses its own * random state to make all of them independent of each other and therefore * deterministic at the thread level. */ RandomState choose_script_rs; /* random state for selecting a script */ RandomState throttling_rs; /* random state for transaction throttling */ RandomState sampling_rs; /* random state for log sampling */ ... } TState; >> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch >> - a patch for the Variables structure (this is used to reset client >> variables during the repeating of transactions after >> serialization/deadlock >> failures). > > Please don't allocate Variable structs one by one. First time allocate > some decent number (say 8) and then enlarge by duplicating size. That > way you save realloc overhead. We use this technique everywhere else, > no reason do different here. Other than that, LGTM. Ok! [1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1806090810090.5307%40lancre > While reading your patch, it occurs to me that a run is not > deterministic > at the thread level under throttling and sampling, because the random > state is sollicited differently depending on when transaction ends. > This > suggest that maybe each thread random_state use should have its own > random > state. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 11-07-2018 22:34, Fabien COELHO wrote: >> can we try something like this? >> >> PGBENCH_ERROR_START(DEBUG_FAIL) >> { >> PGBENCH_ERROR("client %d repeats the failed transaction (try %d", > > Argh, no? I was thinking of something much more trivial: > > pgbench_error(DEBUG, "message format %d %s...", 12, "hello world"); > > If you really need some complex dynamic buffer, and I would prefer > that you avoid that, then the fallback is: > > if (level >= DEBUG) > { > initPQstuff(&msg); > ... > pgbench_error(DEBUG, "fixed message... %s\n", msg); > freePQstuff(&msg); > } > > The point is to avoid building the message with dynamic allocation and > so > if in the end it is not used. Ok! About avoidance - I'm afraid there's one more piece of debugging code with the same problem: else if (command->type == META_COMMAND) { ... initPQExpBuffer(&errmsg_buf); printfPQExpBuffer(&errmsg_buf, "client %d executing \\%s", st->id, argv[0]); for (i = 1; i < argc; i++) appendPQExpBuffer(&errmsg_buf, " %s", argv[i]); appendPQExpBufferChar(&errmsg_buf, '\n'); ereport(ELEVEL_DEBUG, (errmsg("%s", errmsg_buf.data))); termPQExpBuffer(&errmsg_buf); -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>> The point is to avoid building the message with dynamic allocation and so >> if in the end it is not used. > > Ok! About avoidance - I'm afraid there's one more piece of debugging code > with the same problem: Indeed. I'd like to avoid all instances, so that PQExpBufferData is not needed anywhere, if possible. If not possible, then too bad, but I'd prefer to make do with formatted prints only, for simplicity. -- Fabien.
Hello, hackers! Here there's a tenth version of the patch for error handling and retrying of transactions with serialization/deadlock failures in pgbench (based on the commit e0ee93053998b159e395deed7c42e02b1f921552) thanks to the comments of Fabien Coelho and Alvaro Herrera in this thread. v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch - a patch for the RandomState structure (this is used to reset a client's random seed during the repeating of transactions after serialization/deadlock failures). v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch - a patch for a separate error reporting function (this is used to report client failures that do not cause an aborts and this depends on the level of debugging). v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch - a patch for the Variables structure (this is used to reset client variables during the repeating of transactions after serialization/deadlock failures). v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch - the main patch for handling client errors and repetition of transactions with serialization/deadlock failures (see the detailed description in the file). As Fabien wrote in [5], some of the new tests were too slow. Earlier on my laptop they increased the testing time of pgbench from 5.5 seconds to 12.5 seconds. In the new version the testing time of pgbench takes about 7 seconds. These tests include one test for serialization failure and retry, as well as one test for deadlock failure and retry. Both of them are in file 001_pgbench_with_server.pl, each test uses only one pgbench run, they use PL/pgSQL scripts instead of a parallel psql session. Any suggestions are welcome! All that was fixed from the previous version: [1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1806090810090.5307%40lancre > ISTM that the struct itself does not need a name, ie. "typedef struct { > ... } RandomState" is enough. > There could be clear comments, say in the TState and CState structs, > about > what randomness is impacted (i.e. script choices, etc.). > getZipfianRand, computeHarmonicZipfian: The "thread" parameter was > justified because it was used for two fieds. As the random state is > separated, I'd suggest that the other argument should be a zipfcache > pointer. > While reading your patch, it occurs to me that a run is not > deterministic > at the thread level under throttling and sampling, because the random > state is sollicited differently depending on when transaction ends. > This > suggest that maybe each thread random_state use should have its own > random > state. [2] https://www.postgresql.org/message-id/alpine.DEB.2.21.1806091514060.3655%40lancre > The structure typedef does not need a name. "typedef struct { } V...". > I tend to disagree with naming things after their type, eg "array". I'd > suggest "vars" instead. "nvariables" could be "nvars" for consistency > with > that and "vars_sorted", and because "foo.variables->nvariables" starts > looking heavy. > I'd suggest but "Variables" type declaration just after "Variable" type > declaration in the file. [3] https://www.postgresql.org/message-id/alpine.DEB.2.21.1806100837380.3655%40lancre > The semantics of the existing code is changed, the FATAL levels calls > abort() and replace existing exit(1) calls. Maybe you want an ERROR > level as well. > I do not understand why names are changed, eg ELEVEL_FATAL instead of > FATAL. ISTM that part of the point of the move would be to be > homogeneous, > which suggests that the same names should be reused. [4] https://www.postgresql.org/message-id/alpine.DEB.2.21.1807081014260.17811%40lancre > I'd suggest to have just one clean and simple pgbench internal function > to > handle errors and possibly exit, debug... Something like > > void pgb_error(FATAL, "error %d raised", 12); > > Implemented as > > void pgb_error(int/enum XXX level, const char * format, ...) > { > test level and maybe return immediately (eg debug); > print to stderr; > exit/abort/return depending; > } [5] https://www.postgresql.org/message-id/alpine.DEB.2.21.1807091451520.17811%40lancre > Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me. > In particular, the "CLIENT" part is not very useful. If the > distinction makes sense, I would have kept "LOG" for the initial one > and > add other ones for ABORT and PGBENCH, maybe. > * There are no comments about "retries" in StatData, CState and Command > structures. > * Also, for StatData, I would like to understand the logic between cnt, > skipped, retries, retried, errors, ... so a clear information about the > expected invariant if any would be welcome. One has to go in the code > to > understand how these fields relate one to the other. > * "errors_in_failed_tx" is some subcounter of "errors", for a special > case. Why it is there fails me [I finally understood, and I think it > should be removed, see end of review]. If we wanted to distinguish, > then > we should distinguish homogeneously: maybe just count the different > error > types, eg have things like "deadlock_errors", "serializable_errors", > "other_errors", "internal_pgbench_errors" which would be orthogonal one > to > the other, and "errors" could be recomputed from these. > * How "errors" differs from "ecnt" is unclear to me. > * FailureStatus states are not homogeneously named. I'd suggest to use > *_FAILURE for all cases. The miscellaneous case should probably be the > last. > * I do not understand the comments on CState enum: "First, remember the > failure > in CSTATE_FAILURE. Then process other commands of the failed > transaction if any" > Why would other commands be processed at all if the transaction is > aborted? > For me any error must leads to the rollback and possible retry of the > transaction. > ... > So, for me, the FAILURE state should record/count the failure, then > skip > to RETRY if a retry is decided, else proceed to ABORT. Nothing else. > This is much clearer that way. > > Then RETRY should reinstate the global state and proceed to start the > *first* > command again. > * commandFailed: I think that it should be kept much simpler. In > particular, having errors on errors does not help much: on > ELEVEL_FATAL, > it ignores the actual reported error and generates another error of the > same level, so that the initial issue is hidden. Even if these are > can't > happen cases, hidding the origin if it occurs looks unhelpful. Just > print > it directly, and maybe abort if you think that it is a can't happen > case. > * copyRandomState: just use sizeof(RandomState) instead of making > assumptions > about the contents of the struct. Also, this function looks pretty > useless, > why not just do a plain assignment? > * copyVariables: lacks comments to explain that the destination is > cleaned up > and so on. The cleanup phase could probaly be in a distinct function, > so that > the code would be clearer. Maybe the function variable names are too > long. > > if (current_source->svalue) > > in the context of a guard for a strdup, maybe: > > if (current_source->svalue != NULL) > * executeCondition: this hides client automaton state changes which > were > clearly visible beforehand in the switch, and the different handling of > if & elif is also hidden. > > I'm against this unnecessary restructuring and to hide such an > information, > all state changes should be clearly seen in the state switch so that it > is > easier to understand and follow. > > I do not see why touching the conditional stack on internal errors > (evaluateExpr failure) brings anything, the whole transaction will be > aborted > anyway. > The current RETRY state does memory allocations to generate a message > with buffer allocation and so on. This looks like a costly and useless > operation. If the user required "retries", then this is normal > behavior, > the retries are counted and will be printed out in the final report, > and there is no point in printing out every single one of them. > Maybe you want that debugging, but then coslty operations should be > guarded. > The number of transactions above the latency limit report can be > simplified. > Remove the if and just use one printf f with a %s for the optional > comment. > I'm not sure this optional comment is useful there. > Before the patch, ISTM that all lines relied on one printf. you have > changed to a style where a collection of printf is used to compose a > line. > I'd suggest to keep to the previous one-printf-prints-one-line style, > where possible. > You have added 20-columns alignment prints. This looks like too much > and > generates much too large lines. Probably 10 (billion) would be enough. > > Some people try to parse the output, so it should be deterministic. I'd > add > the needed columns always if appropriate (i.e. under retry), even if > none > occured. > * processXactStats: An else is replaced by a detailed stats, with the > initial > "no detailed stats" comment kept. The function is called both in the > thenb > & else branch. The structure does not make sense anymore. I'm not sure > this changed was needed. > * getLatencyUsed: declared "double" so "return 0.0". > * typo: ruin -> run; probably others, I did not check for them in > detail. > On my laptop, tests last 5.5 seconds before the patch, and about 13 > seconds > after. This is much too large. Pgbench TAP tests do not deserve to take > over > twice as much time as before just on this patch. > > One reason which explains this large time is there is a new script with > a > new created instance. I'd suggest to append tests to the existing 2 > scripts, depending on whether they need a running instance or not. > > Secondly, I think that the design of the tests are too heavy. For such > a > feature, ISTM enough to check that it works, i.e. one test for > deadlocks > (trigger one or a few deadlocks), idem for serializable, maybe idem for > other errors if any. > > The challenge is to do that reliably and efficiently, i.e. so that the > test does > not rely on chance and is still quite efficient. > > The trick you use is to run an interactive psql in parallel to pgbench > so as to > play with concurrent locks. That is interesting, but deserves more > comments > and explanatation, eg before the test functions. > > Maybe this could be achieved within pgbench by using some wait stuff in > PL/pgSQL so that concurrent client can wait one another based on data > in > unlogged table updated by a CALL within an "embedded" transactions? Not > sure. ... > > Anyway, TAP tests should be much lighter (in total time), and if > possible > much simpler. > > The latency limit to 900 ms try is a bad idea because it takes a lot of > time. > I did such tests before and they were removed by Tom Lane because of > determinism > and time issues. I would comment this test out for now. > Documentation > ... > Having the "most important settings" on line 1-6 and 8 (i.e. skipping > 7) looks > silly. The important ones should simply be the first ones, and the 8th > is not > that important, or it is in 7th position. > > I do not understand why there is so much text about in failed sql > transaction > stuff, while we are mainly interested in serialization & deadlock > errors, and > this only falls in some "other" category. There seems to be more > details about > other errors that about deadlocks & serializable errors. > > The reporting should focus on what is of interest, either all errors, > or some > detailed split of these errors. The documentation should state clearly > what > are the counted errors, and then what are their effects on the reported > stats. > The "Errors and Serialization/Deadlock Retries" section is a good start > in that > direction, but it does not talk about pgbench internal errors (eg > "cos(true)"). > I think it should more explicit about errors. > > Option --max-tries default value should be spelled out in the doc. [6] https://www.postgresql.org/message-id/alpine.DEB.2.21.1807111435250.27883%40lancre > So if absolutely necessary, a new option is still better than changing > --debug syntax. If not necessary, then it is better:-) > The fact that some data are collected does not mean that they should > all > be reported in detail. We can have detailed error count and report the > sum > of this errors for instance, or have some more verbose/detailed reports > as options (eg --latencies does just that). [7] https://www.postgresql.org/message-id/20180711180417.3ytmmwmonsr5lra7%40alvherre.pgsql > LGTM, though I'd rename the random_state struct members so that it > wouldn't look as confusing. Maybe that's just me. > Please don't allocate Variable structs one by one. First time allocate > some decent number (say 8) and then enlarge by duplicating size. That > way you save realloc overhead. We use this technique everywhere else, > no reason do different here. Other than that, LGTM. [8] https://www.postgresql.org/message-id/alpine.DEB.2.21.1807112124210.27883%40lancre > If you really need some complex dynamic buffer, and I would prefer > that you avoid that, then the fallback is: > > if (level >= DEBUG) > { > initPQstuff(&msg); > ... > pgbench_error(DEBUG, "fixed message... %s\n", msg); > freePQstuff(&msg); > } > > The point is to avoid building the message with dynamic allocation and > so > if in the end it is not used. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Marina, > v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch > - a patch for the RandomState structure (this is used to reset a client's > random seed during the repeating of transactions after serialization/deadlock > failures). About this v10 part 1: Patch applies cleanly, compile, global & local make check both ok. The random state is cleanly separated so that it will be easy to reset it on client error handling ISTM that the pgbench side is deterministic with the separation of the seeds for different uses. Code is clean, comments are clear. I'm wondering what is the rational for the "xseed" field name? In particular, what does the "x" stands for? -- Fabien.
On 07-08-2018 19:21, Fabien COELHO wrote: > Hello Marina, Hello, Fabien! >> v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch >> - a patch for the RandomState structure (this is used to reset a >> client's random seed during the repeating of transactions after >> serialization/deadlock failures). > > About this v10 part 1: > > Patch applies cleanly, compile, global & local make check both ok. > > The random state is cleanly separated so that it will be easy to reset > it on client error handling ISTM that the pgbench side is > deterministic with > the separation of the seeds for different uses. > > Code is clean, comments are clear. :-) > I'm wondering what is the rational for the "xseed" field name? In > particular, what does the "x" stands for? I called it "...seed" instead of "data" because perhaps the "data" is too general a name for use here (but I'm not entirely sure what Alvaro Herrera meant in [1], see my answer in [2]). I called it "xseed" to combine it with the arguments of the functions _dorand48 / pg_erand48 / pg_jrand48 in the file erand48.c. IIUC they use a linear congruential generator and perhaps "xseed" means the sequence with the name X of pseudorandom values of size 48 bits (X_0, X_1, ... X_n) where X_0 is the seed / the start value. [1] https://www.postgresql.org/message-id/20180711180417.3ytmmwmonsr5lra7@alvherre.pgsql > LGTM, though I'd rename the random_state struct members so that it > wouldn't look as confusing. Maybe that's just me. [2] https://www.postgresql.org/message-id/cb2cde10e4e7a10a38b48e9cae8fbd28%40postgrespro.ru -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, > v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch > - a patch for a separate error reporting function (this is used to report > client failures that do not cause an aborts and this depends on the level of > debugging). Patch applies cleanly, compiles, global & local make check ok. This patch improves/homogenizes logging & error reporting in pgbench, in preparation for another patch which will manage transaction restarts in some cases. However ISTM that it is not as necessary as the previous one, i.e. we could do without it to get the desired feature, so I see it more as a refactoring done "in passing", and I'm wondering whether it is really worth it because it adds some new complexity, so I'm not sure of the net benefit. Anyway, I still have quite a few comments/suggestions on this version. * ErrorLevel If ErrorLevel is used for things which are not errors, its name should not include "Error"? Maybe "LogLevel"? I'm at odds with the proposed levels. ISTM that pgbench internal errors which warrant an immediate exit should be dubbed "FATAL", which would leave the "ERROR" name for... errors, eg SQL errors. I'd suggest to use an INFO level for the PGBENCH_DEBUG function, and to keep LOG for main program messages, so that all use case are separate. Or, maybe the distinction between LOG/INFO is unclear so info is not necessary. I'm unsure about the "log_min_messages" variable name, I'd suggest "log_level". I do not see the asserts on LOG >= log_min_messages as useful, because the level can only be LOG or DEBUG anyway. This point also suggest that maybe "pgbench_error" is misnamed as well (ok, I know I suggested it in place of ereport, but e stands for error there), as it is called on errors, but is also on other things. Maybe "pgbench_log"? Or just simply "log" or "report", as it is really an local function, which does not need a prefix? That would mean that "pgbench_simple_error", which is indeed called on errors, could keep its initial name "pgbench_error", and be called on errors. Alternatively, the debug/logging code could be let as it is (i.e. direct print to stderr) and the function only called when there is some kind of error, in which case it could be named with "error" in its name (or elog/ereport...). * PQExpBuffer I still do not see a positive value from importing PQExpBuffer complexity and cost into pgbench, as the resulting code is not very readable and it adds malloc/free cycles, so I'd try to avoid using PQExpBuf as much as possible. ISTM that all usages could be avoided in the patch, and most should be avoided even if ExpBuffer is imported because it is really useful somewhere. - to call pgbench_error from pgbench_simple_error, you can do a pgbench_log_va(level, format, va_list) version called both from pgbench_error & pgbench_simple_error. - for PGBENCH_DEBUG function, do separate calls per type, the very small partial code duplication is worth avoiding ExpBuf IMO. - for doCustom debug: I'd just let the printf as it is, with a comment, as it is really very internal stuff for debug. Or I'd just snprintf a something in a static buffer. - for syntax_error: it should terminate, so it should call pgbench_error(FATAL, ...). Idem, I'd either keep the printf then call pgbench_error(FATAL, "syntax error found\n") for a final message, or snprintf in a static buffer. - for listAvailableScript: I'd simply call "pgbench_error(LOG" several time, once per line. I see building a string with a format (printfExpBuf..) and then calling the pgbench_error function with just a "%s" format on the result as not very elegant, because the second format is somehow hacked around. * bool client I'm unconvince by this added boolean just to switch the level on encountered errors. I'd suggest to let lookupCreateVariable, putVariable* as they are, call pgbench_error with a level which does not stop the execution, and abort if necessary from the callers with a "aborted because of putVariable/eval/... error" message, as it was done before. pgbench_error calls pgbench_error. Hmmm, why not. -- Fabien.
On 09-08-2018 12:28, Fabien COELHO wrote: > Hello Marina, Hello! >> v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch >> - a patch for a separate error reporting function (this is used to >> report client failures that do not cause an aborts and this depends on >> the level of debugging). > > Patch applies cleanly, compiles, global & local make check ok. :-) > This patch improves/homogenizes logging & error reporting in pgbench, > in preparation for another patch which will manage transaction > restarts in some cases. > > However ISTM that it is not as necessary as the previous one, i.e. we > could do without it to get the desired feature, so I see it more as a > refactoring done "in passing", and I'm wondering whether it is really > worth it because it adds some new complexity, so I'm not sure of the > net benefit. We discussed this starting with [1]: >>>> IMO this patch is more controversial than the other ones. >>>> >>>> It is not really related to the aim of the patch series, which could >>>> do without, couldn't it? >>> >>>> I'd suggest that it should be an independent submission, unrelated >>>> to >>>> the pgbench error management patch. >>> >>> I suppose that this is related; because of my patch there may be a >>> lot >>> of such code (see v7 in [1]): >>> >>> - fprintf(stderr, >>> - "malformed variable \"%s\" value: \"%s\"\n", >>> - var->name, var->svalue); >>> + if (debug_level >= DEBUG_FAILS) >>> + { >>> + fprintf(stderr, >>> + "malformed variable \"%s\" value: \"%s\"\n", >>> + var->name, var->svalue); >>> + } >>> >>> - if (debug) >>> + if (debug_level >= DEBUG_ALL) >>> fprintf(stderr, "client %d sending %s\n", st->id, sql); >> >> I'm not sure that debug messages needs to be kept after debug, if it >> is >> about debugging pgbench itself. That is debatable. > > AFAICS it is not about debugging pgbench itself, but about more > detailed > information that can be used to understand what exactly happened during > its launch. In the case of errors this helps to distinguish between > failures or errors by type (including which limit for retries was > violated and how far it was exceeded for the serialization/deadlock > errors). > >>> That's why it was suggested to make the error function which hides >>> all >>> these things (see [2]): >>> >>> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with >>> corresponding fprintf(stderr..) I think it's time to do it like in >>> the >>> main code, wrap with some function like log(level, msg). >> >> Yep. I did not wrote that, but I agree with an "elog" suggestion to >> switch >> >> if (...) { fprintf(...); exit/abort/continue/... } >> >> to a simpler: >> >> elog(level, ...) > Anyway, I still have quite a few comments/suggestions on this version. Thank you very much for them! > * ErrorLevel > > If ErrorLevel is used for things which are not errors, its name should > not include "Error"? Maybe "LogLevel"? On the one hand, this sounds better for me too. On the other hand, will not this be in some kind of conflict with error level codes in elog.h?.. /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of * decreasing detail. */ #define DEBUG4 11 ... > I'm at odds with the proposed levels. ISTM that pgbench internal > errors which warrant an immediate exit should be dubbed "FATAL", Ok! > which > would leave the "ERROR" name for... errors, eg SQL errors. I'd suggest > to use an INFO level for the PGBENCH_DEBUG function, and to keep LOG > for main program messages, so that all use case are separate. Or, > maybe the distinction between LOG/INFO is unclear so info is not > necessary. The messages of the errors in SQL and meta commands are printed only if the option --debug-fails is used so I'm not sure that they should have a higher error level than main program messages (ERROR vs LOG). About an INFO level for the PGBENCH_DEBUG function - ISTM that some main program messages such as "dropping old tables...\n" or ..." tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)\n" can also use it.. About that all use cases were separate - in the current version the level LOG also includes messages about abortions of the clients. > I'm unsure about the "log_min_messages" variable name, I'd suggest > "log_level". > > I do not see the asserts on LOG >= log_min_messages as useful, because > the level can only be LOG or DEBUG anyway. Ok! > This point also suggest that maybe "pgbench_error" is misnamed as well > (ok, I know I suggested it in place of ereport, but e stands for error > there), as it is called on errors, but is also on other things. Maybe > "pgbench_log"? Or just simply "log" or "report", as it is really an > local function, which does not need a prefix? That would mean that > "pgbench_simple_error", which is indeed called on errors, could keep > its initial name "pgbench_error", and be called on errors. About the name "log" - we already have the function doLog, so perhaps the name "report" will be better.. But like with ErrorLevel will not this be in some kind of conflict with ereport which is also used for the levels DEBUG... / LOG / INFO? > Alternatively, the debug/logging code could be let as it is (i.e. > direct print to stderr) and the function only called when there is > some kind of error, in which case it could be named with "error" in > its name (or elog/ereport...). As I wrote in [2]: > because of my patch there may be a lot > of such code (see v7 in [1]): > > - fprintf(stderr, > - "malformed variable \"%s\" value: \"%s\"\n", > - var->name, var->svalue); > + if (debug_level >= DEBUG_FAILS) > + { > + fprintf(stderr, > + "malformed variable \"%s\" value: \"%s\"\n", > + var->name, var->svalue); > + } > > - if (debug) > + if (debug_level >= DEBUG_ALL) > fprintf(stderr, "client %d sending %s\n", st->id, sql); > > That's why it was suggested to make the error function which hides all > these things (see [2]): > > There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with > corresponding fprintf(stderr..) I think it's time to do it like in the > main code, wrap with some function like log(level, msg). And IIUC macros will not help in the absence of __VA_ARGS__. > * PQExpBuffer > > I still do not see a positive value from importing PQExpBuffer > complexity and cost into pgbench, as the resulting code is not very > readable and it adds malloc/free cycles, so I'd try to avoid using > PQExpBuf as much as possible. ISTM that all usages could be avoided in > the patch, and most should be avoided even if ExpBuffer is imported > because it is really useful somewhere. > > - to call pgbench_error from pgbench_simple_error, you can do a > pgbench_log_va(level, format, va_list) version called both from > pgbench_error & pgbench_simple_error. > > - for PGBENCH_DEBUG function, do separate calls per type, the very > small partial code duplication is worth avoiding ExpBuf IMO. > > - for doCustom debug: I'd just let the printf as it is, with a > comment, as it is really very internal stuff for debug. Or I'd just > snprintf a something in a static buffer. > > - for syntax_error: it should terminate, so it should call > pgbench_error(FATAL, ...). Idem, I'd either keep the printf then call > pgbench_error(FATAL, "syntax error found\n") for a final message, > or snprintf in a static buffer. > > - for listAvailableScript: I'd simply call "pgbench_error(LOG" several > time, once per line. > > I see building a string with a format (printfExpBuf..) and then > calling the pgbench_error function with just a "%s" format on the > result as not very elegant, because the second format is somehow > hacked around. Ok! About using a static buffer in doCustom debug or in syntax_error - I'm not sure that this is always possible because ISTM that the variable name can be quite large. > * bool client > > I'm unconvince by this added boolean just to switch the level on > encountered errors. > > I'd suggest to let lookupCreateVariable, putVariable* as they are, > call pgbench_error with a level which does not stop the execution, and > abort if necessary from the callers with a "aborted because of > putVariable/eval/... error" message, as it was done before. There's one more problem: if this is a client failure, an error message inside any of these functions should be printed at the level DEBUG_FAILS; otherwise it should be printed at the level LOG. Or do you suggest using the error level as an argument for these functions? > pgbench_error calls pgbench_error. Hmmm, why not. [1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1806100837380.3655%40lancre [2] https://www.postgresql.org/message-id/b692de21caaed13c59f31c06d0098488%40postgrespro.ru -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, >> I'd suggest to let lookupCreateVariable, putVariable* as they are, >> call pgbench_error with a level which does not stop the execution, and >> abort if necessary from the callers with a "aborted because of >> putVariable/eval/... error" message, as it was done before. > > There's one more problem: if this is a client failure, an error message > inside any of these functions should be printed at the level DEBUG_FAILS; > otherwise it should be printed at the level LOG. Or do you suggest using the > error level as an argument for these functions? No. I suggest that the called function does only one simple thing, probably "DEBUG", and that the *caller* prints a message if it is unhappy about the failure of the called function, as it is currently done. This allows to provide context as well from the caller, eg "setting variable %s failed while <some specific context>". The user call rerun under debug for precision if they need it. I'm still not over enthousiastic with these changes, and still think that it should be an independent patch, not submitted together with the "retry on error" feature. -- Fabien.
On 10-08-2018 11:33, Fabien COELHO wrote: > Hello Marina, > >>> I'd suggest to let lookupCreateVariable, putVariable* as they are, >>> call pgbench_error with a level which does not stop the execution, >>> and >>> abort if necessary from the callers with a "aborted because of >>> putVariable/eval/... error" message, as it was done before. >> >> There's one more problem: if this is a client failure, an error >> message inside any of these functions should be printed at the level >> DEBUG_FAILS; otherwise it should be printed at the level LOG. Or do >> you suggest using the error level as an argument for these functions? > > No. I suggest that the called function does only one simple thing, > probably "DEBUG", and that the *caller* prints a message if it is > unhappy about the failure of the called function, as it is currently > done. This allows to provide context as well from the caller, eg > "setting variable %s failed while <some specific context>". The user > call rerun under debug for precision if they need it. Ok! > I'm still not over enthousiastic with these changes, and still think > that it should be an independent patch, not submitted together with > the "retry on error" feature. In the next version I will put the error patch last, so it will be possible to compare the "retry on error" feature with it and without it, and let the committer decide how it is better) -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Aug 09, 2018 at 06:17:22PM +0300, Marina Polyakova wrote: > > * ErrorLevel > > > > If ErrorLevel is used for things which are not errors, its name should > > not include "Error"? Maybe "LogLevel"? > > On the one hand, this sounds better for me too. On the other hand, will not > this be in some kind of conflict with error level codes in elog.h?.. I think it shouldn't because those error levels are backends levels. pgbench is a client side utility with its own code, it shares some code with libpq and other utilities, but elog.h isn't one of them. > > This point also suggest that maybe "pgbench_error" is misnamed as well > > (ok, I know I suggested it in place of ereport, but e stands for error > > there), as it is called on errors, but is also on other things. Maybe > > "pgbench_log"? Or just simply "log" or "report", as it is really an > > local function, which does not need a prefix? That would mean that > > "pgbench_simple_error", which is indeed called on errors, could keep > > its initial name "pgbench_error", and be called on errors. > > About the name "log" - we already have the function doLog, so perhaps the > name "report" will be better.. But like with ErrorLevel will not this be in > some kind of conflict with ereport which is also used for the levels > DEBUG... / LOG / INFO? +1 from me to keep initial name "pgbench_error". "pgbench_log" for new function looks nice to me. I think it is better than just "log", because "log" may conflict with natural logarithmic function (see "man 3 log"). > > pgbench_error calls pgbench_error. Hmmm, why not. I agree with Fabien. Calling pgbench_error() inside pgbench_error() could be dangerous. I think "fmt" checking could be removed, or we may use Assert() or fprintf()+exit(1) at least. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On 10-08-2018 15:53, Arthur Zakirov wrote: > On Thu, Aug 09, 2018 at 06:17:22PM +0300, Marina Polyakova wrote: >> > * ErrorLevel >> > >> > If ErrorLevel is used for things which are not errors, its name should >> > not include "Error"? Maybe "LogLevel"? >> >> On the one hand, this sounds better for me too. On the other hand, >> will not >> this be in some kind of conflict with error level codes in elog.h?.. > > I think it shouldn't because those error levels are backends levels. > pgbench is a client side utility with its own code, it shares some code > with libpq and other utilities, but elog.h isn't one of them. I agree with you on this :) I just meant that maybe it would be better to call this group in the same way because they are used in general for the same purpose?.. >> > This point also suggest that maybe "pgbench_error" is misnamed as well >> > (ok, I know I suggested it in place of ereport, but e stands for error >> > there), as it is called on errors, but is also on other things. Maybe >> > "pgbench_log"? Or just simply "log" or "report", as it is really an >> > local function, which does not need a prefix? That would mean that >> > "pgbench_simple_error", which is indeed called on errors, could keep >> > its initial name "pgbench_error", and be called on errors. >> >> About the name "log" - we already have the function doLog, so perhaps >> the >> name "report" will be better.. But like with ErrorLevel will not this >> be in >> some kind of conflict with ereport which is also used for the levels >> DEBUG... / LOG / INFO? > > +1 from me to keep initial name "pgbench_error". "pgbench_log" for new > function looks nice to me. I think it is better than just "log", > because "log" may conflict with natural logarithmic function (see "man > 3 > log"). Do you think that pgbench_log (or another whose name speaks only about logging) will look good, for example, with FATAL? Because this means that the logging function also processes errors and calls exit(1) if necessary.. >> > pgbench_error calls pgbench_error. Hmmm, why not. > > I agree with Fabien. Calling pgbench_error() inside pgbench_error() > could be dangerous. I think "fmt" checking could be removed, or we may > use Assert() I would like not to use Assert in this case because IIUC they are mostly used for testing. > or fprintf()+exit(1) at least. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Aug 10, 2018 at 04:46:04PM +0300, Marina Polyakova wrote: > > +1 from me to keep initial name "pgbench_error". "pgbench_log" for new > > function looks nice to me. I think it is better than just "log", > > because "log" may conflict with natural logarithmic function (see "man 3 > > log"). > > Do you think that pgbench_log (or another whose name speaks only about > logging) will look good, for example, with FATAL? Because this means that > the logging function also processes errors and calls exit(1) if necessary.. Yes, why not. "_log" just means that you want to log some message with the specified log level. Moreover those messages sometimes aren't error: pgbench_error(LOG, "starting vacuum..."); > > I agree with Fabien. Calling pgbench_error() inside pgbench_error() > > could be dangerous. I think "fmt" checking could be removed, or we may > > use Assert() > > I would like not to use Assert in this case because IIUC they are mostly > used for testing. I'd vote to remove this check at all. I don't see any place where it is possible to call pgbench_error() passing empty "fmt". -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On 10-08-2018 17:19, Arthur Zakirov wrote: > On Fri, Aug 10, 2018 at 04:46:04PM +0300, Marina Polyakova wrote: >> > +1 from me to keep initial name "pgbench_error". "pgbench_log" for new >> > function looks nice to me. I think it is better than just "log", >> > because "log" may conflict with natural logarithmic function (see "man 3 >> > log"). >> >> Do you think that pgbench_log (or another whose name speaks only about >> logging) will look good, for example, with FATAL? Because this means >> that >> the logging function also processes errors and calls exit(1) if >> necessary.. > > Yes, why not. "_log" just means that you want to log some message with > the specified log level. Moreover those messages sometimes aren't > error: > > pgbench_error(LOG, "starting vacuum..."); "pgbench_log" is already used as the default filename prefix for transaction logging. >> > I agree with Fabien. Calling pgbench_error() inside pgbench_error() >> > could be dangerous. I think "fmt" checking could be removed, or we may >> > use Assert() >> >> I would like not to use Assert in this case because IIUC they are >> mostly >> used for testing. > > I'd vote to remove this check at all. I don't see any place where it is > possible to call pgbench_error() passing empty "fmt". pgbench_error(..., "%s", PQerrorMessage(con)); ? -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
HEllo Marina, > v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch > - a patch for the Variables structure (this is used to reset client variables > during the repeating of transactions after serialization/deadlock failures). This patch adds an explicit structure to manage Variables, which is useful to reset these on pgbench script retries, which is the purpose of the whole patch series. About part 3: Patch applies cleanly, * typo in comments: "varaibles" * About enlargeVariables: multiple INT_MAX error handling looks strange, especially as this code can never be triggered because pgbench would be dead long before having allocated INT_MAX variables. So I would not bother to add such checks. ISTM that if something is amiss it will fail in pg_realloc anyway. Also I do not like the ExpBuf stuff, as usual. I'm not sure that the size_t cast here and there are useful for any practical values likely to be encountered by pgbench. The exponential allocation seems overkill. I'd simply add a constant number of slots, with a simple rule: /* reallocated with a margin */ if (max_vars < needed) max_vars = needed + 8; So in the end the function should be much simpler. -- Fabien.
> About part 3: > > Patch applies cleanly, I forgot: compiles, global & local "make check" are ok. -- Fabien.
On 12-08-2018 12:14, Fabien COELHO wrote: > HEllo Marina, Hello, Fabien! >> v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch >> - a patch for the Variables structure (this is used to reset client >> variables during the repeating of transactions after >> serialization/deadlock failures). > > This patch adds an explicit structure to manage Variables, which is > useful to reset these on pgbench script retries, which is the purpose > of the whole patch series. > > About part 3: > > Patch applies cleanly, On 12-08-2018 12:17, Fabien COELHO wrote: >> About part 3: >> >> Patch applies cleanly, > > I forgot: compiles, global & local "make check" are ok. I'm glad to hear it :-) > * typo in comments: "varaibles" I'm sorry, I'll fix it. > * About enlargeVariables: > > multiple INT_MAX error handling looks strange, especially as this code > can never be triggered because pgbench would be dead long before > having allocated INT_MAX variables. So I would not bother to add such > checks. > ... > I'm not sure that the size_t cast here and there are useful for any > practical values likely to be encountered by pgbench. Looking at the code of the functions, for example, ParseScript and psql_scan_setup, where the integer variable is used for the size of the entire script - ISTM that you are right.. Therefore size_t casts will also be removed. > ISTM that if something is amiss it will fail in pg_realloc anyway. IIUC and physical RAM is not enough, this may depend on the size of the swap. > Also I do not like the ExpBuf stuff, as usual. > The exponential allocation seems overkill. I'd simply add a constant > number of slots, with a simple rule: > > /* reallocated with a margin */ > if (max_vars < needed) max_vars = needed + 8; > > So in the end the function should be much simpler. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, > v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch > - the main patch for handling client errors and repetition of transactions > with serialization/deadlock failures (see the detailed description in the > file). Patch applies cleanly. It allows retrying a script (considered as a transaction) on serializable and deadlock errors, which is a very interesting extension but also impacts pgbench significantly. I'm waiting for the feature to be right before checking in full the documentation and tests. There are still some issues to resolve before checking that. Anyway, tests look reasonable. Taking advantage of of transactions control from PL/pgsql is a good use of this new feature. A few comments about the doc. According to the documentation, the feature is triggered by --max-tries and --latency-limit. I disagree with the later, because it means that having latency limit without retrying is not supported anymore. Maybe you can allow an "unlimited" max-tries, say with special value zero, and the latency limit does its job if set, over all tries. Doc: "error in meta commands" -> "meta command errors", for homogeneity with other cases? Detailed -r report. I understand from the doc that the retry number on the detailed per-statement report is to identify at what point errors occur? Probably this is more or less always at the same point on a given script, so that the most interesting feature is to report the number of retries at the script level. Doc: "never occur.." -> "never occur", or eventually "...". Doc: "Directly client errors" -> "Direct client errors". I'm still in favor of asserting that the sql connection is idle (no tx in progress) at the beginning and/or end of a script, and report a user error if not, instead of writing complex caveats. If someone has a use-case for that, then maybe it can be changed, but I cannot see any in a benchmarking context, and I can see how easy it is to have a buggy script with this allowed. I do not think that the RETRIES_ENABLED macro is a good thing. I'd suggest to write the condition four times. ISTM that "skipped" transactions are NOT "successful" so there are a problem with comments. I believe that your formula are probably right, it has more to do with what is "success". For cnt decomposition, ISTM that "other transactions" are really "directly successful transactions". I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise "another" does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE". In TState, field "uint32 retries": maybe it would be simpler to count "tries", which can be compared directly to max tries set in the option? ErrorLevel: I have already commented about in review about 10.2. I'm not sure of the LOG -> DEBUG_FAIL changes. I do not understand the name "DEBUG_FAIL", has it is not related to debug, they just seem to be internal errors. META_ERROR maybe? inTransactionBlock: I disagree with any function other than doCustom changing the client state, because it makes understanding the state machine harder. There is already one exception to that (threadRun) that I wish to remove. All state changes must be performed explicitely in doCustom. The automaton skips to FAILURE on every possible error. I'm wondering whether it could do so only on SQL errors, because other fails will lead to ABORTED anyway? If there is no good reason to skip to FAILURE from some errors, I'd suggest to keep the previous behavior. Maybe the good reason is to do some counting, but this means that on eg metacommand errors now the script would loop over instead of aborting, which does not look like a desirable change of behavior. PQexec("ROOLBACK"): you are inserting a synchronous command, for which the thread will have to wait for the result, in a middle of a framework which takes great care to use only asynchronous stuff so that one thread can manage several clients efficiently. You cannot call PQexec there. From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to a new state CSTATE_WAIT_ABORT_RESULT which would be similar to CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead of proceeding to the next command. ISTM that it would be more logical to only get into RETRY if there is a retry, i.e. move the test RETRY/ABORT in FAILURE. For that, instead of "canRetry", maybe you want "doRetry", which tells that a retry is possible (the error is serializable or deadlock) and that the current parameters allow it (timeout, max retries). * Minor C style comments: if / else if / else if ... on *_FAILURE: I'd suggest a switch. The following line removal does not seem useful, I'd have kept it: stats->cnt++; - if (skipped) copyVariables: I'm not convinced that source_vars & nvars variables are that useful. memcpy(&(st->retry_state.random_state), &(st->random_state), sizeof(RandomState)); Is there a problem with "st->retry_state.random_state = st->random_state;" instead of memcpy? ISTM that simple assignments work in C. Idem in the reverse copy under RETRY. if (!copyVariables(&st->retry_state.variables, &st->variables)) { pgbench_error(LOG, "client %d aborted when preparing to execute a transaction\n", st->id); The message could be more precise, eg "client %d failed while copying variables", unless copyVariables already printed a message. As this is really an internal error from pgbench, I'd rather do a FATAL (direct exit) there. ISTM that the only possible failure is OOM here, and pgbench is in a very bad shape if it gets into that. commandFailed: I'm not thrilled by the added boolean, which is partially redundant with the second argument. if (per_script_stats) - accumStats(&sql_script[st->use_file].stats, skipped, latency, lag); + { + accumStats(&sql_script[st->use_file].stats, skipped, latency, lag, + st->failure_status, st->retries); + } } I do not see the point of changing the style here. -- Fabien.
On 15-08-2018 11:50, Fabien COELHO wrote: > Hello Marina, Hello! >> v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch >> - the main patch for handling client errors and repetition of >> transactions with serialization/deadlock failures (see the detailed >> description in the file). > > Patch applies cleanly. > > It allows retrying a script (considered as a transaction) on > serializable and deadlock errors, which is a very interesting > extension but also impacts pgbench significantly. > > I'm waiting for the feature to be right before checking in full the > documentation and tests. There are still some issues to resolve before > checking that. > > Anyway, tests look reasonable. Taking advantage of of transactions > control from PL/pgsql is a good use of this new feature. :-) > A few comments about the doc. > > According to the documentation, the feature is triggered by --max-tries > and > --latency-limit. I disagree with the later, because it means that > having > latency limit without retrying is not supported anymore. > > Maybe you can allow an "unlimited" max-tries, say with special value > zero, > and the latency limit does its job if set, over all tries. > > Doc: "error in meta commands" -> "meta command errors", for homogeneity > with > other cases? > ... > Doc: "never occur.." -> "never occur", or eventually "...". > > Doc: "Directly client errors" -> "Direct client errors". > ... > inTransactionBlock: I disagree with any function other than doCustom > changing > the client state, because it makes understanding the state machine > harder. There > is already one exception to that (threadRun) that I wish to remove. All > state > changes must be performed explicitely in doCustom. > ... > PQexec("ROOLBACK"): you are inserting a synchronous command, for which > the > thread will have to wait for the result, in a middle of a framework > which > takes great care to use only asynchronous stuff so that one thread can > manage several clients efficiently. You cannot call PQexec there. > From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to > a new state CSTATE_WAIT_ABORT_RESULT which would be similar to > CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead > of proceeding to the next command. > ... > memcpy(&(st->retry_state.random_state), &(st->random_state), > sizeof(RandomState)); > > Is there a problem with "st->retry_state.random_state = > st->random_state;" > instead of memcpy? ISTM that simple assignments work in C. Idem in the > reverse > copy under RETRY. Thank you, I'll fix this. > Detailed -r report. I understand from the doc that the retry number on > the > detailed per-statement report is to identify at what point errors > occur? > Probably this is more or less always at the same point on a given > script, > so that the most interesting feature is to report the number of retries > at the > script level. This may depend on various factors.. for example: transaction type: pgbench_test_serialization.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 duration: 10 s number of transactions actually processed: 266 number of errors: 10 (3.623%) number of serialization errors: 10 (3.623%) number of retried: 75 (27.174%) number of retries: 75 maximum number of tries: 2 latency average = 72.734 ms (including errors) tps = 26.501162 (including connections establishing) tps = 26.515082 (excluding connections establishing) statement latencies in milliseconds, errors and retries: 0.012 0 0 \set delta random(-5000, 5000) 0.001 0 0 \set x1 random(1, 100000) 0.001 0 0 \set x3 random(1, 2) 0.001 0 0 \set x2 random(1, 1) 19.837 0 0 UPDATE xy1 SET y = y + :delta WHERE x = :x1; 21.239 5 36 UPDATE xy3 SET y = y + :delta WHERE x = :x3; 21.360 5 39 UPDATE xy2 SET y = y + :delta WHERE x = :x2; And you can always get the number of retries at the script level from the main report (if only one script is used) or from the report for each script (if multiple scripts are used). > I'm still in favor of asserting that the sql connection is idle (no tx > in > progress) at the beginning and/or end of a script, and report a user > error > if not, instead of writing complex caveats. > > If someone has a use-case for that, then maybe it can be changed, but I > cannot see any in a benchmarking context, and I can see how easy it is > to have a buggy script with this allowed. > > I do not think that the RETRIES_ENABLED macro is a good thing. I'd > suggest > to write the condition four times. Ok! > ISTM that "skipped" transactions are NOT "successful" so there are a > problem > with comments. I believe that your formula are probably right, it has > more to do > with what is "success". For cnt decomposition, ISTM that "other > transactions" > are really "directly successful transactions". I agree with you, but I also think that skipped transactions should not be considered errors. So we can write something like this: All the transactions are divided into several types depending on their execution. Firstly, they can be divided into transactions that we started to execute, and transactions which were skipped (it was too late to execute them). Secondly, running transactions fall into 2 main types: is there any command that got a failure during the last execution of the transaction script or not? Thus the number of all transactions = skipped (it was too late to execute them) cnt (the number of successful transactions) + ecnt (the number of failed transactions). A successful transaction can have several unsuccessful tries before a successfull run. Thus cnt (the number of successful transactions) = retried (they got a serialization or a deadlock failure(s), but were successfully retried from the very beginning) + directly successfull transactions (they were successfully completed on the first try). > I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise > "another" > does not make sense yet. Maybe firstly put a general group, and then special cases?... > I'd suggest to name it "OTHER_SQL_FAILURE". Ok! > In TState, field "uint32 retries": maybe it would be simpler to count > "tries", > which can be compared directly to max tries set in the option? If you mean retries in CState - on the one hand, yes, on the other hand, statistics always use the number of retries... > ErrorLevel: I have already commented about in review about 10.2. I'm > not sure of > the LOG -> DEBUG_FAIL changes. I do not understand the name > "DEBUG_FAIL", has it > is not related to debug, they just seem to be internal errors. > META_ERROR maybe? As I wrote to you in [1]: >> I'm at odds with the proposed levels. ISTM that pgbench internal >> errors which warrant an immediate exit should be dubbed "FATAL", > > Ok! > >> which >> would leave the "ERROR" name for... errors, eg SQL errors. >> ... > > The messages of the errors in SQL and meta commands are printed only if > the option --debug-fails is used so I'm not sure that they should have > a > higher error level than main program messages (ERROR vs LOG). Perhaps we can rename the levels DEBUG_FAIL and LOG to LOG and LOG_PGBENCH respectively. In this case the client error messages do not use debug error levels and the term "logging" is already used for transaction/aggregation logging... Therefore perhaps we can also combine the options --errors-detailed and --debug-fails into the option --fails-detailed=none|groups|all_messages. Here --fails-detailed=groups can be used to group errors in reports or logs by basic types. --fails-detailed=all_messages can add to this all error messages in the SQL/meta commands, and messages for processing the failed transaction (its end/retry). > The automaton skips to FAILURE on every possible error. I'm wondering > whether > it could do so only on SQL errors, because other fails will lead to > ABORTED > anyway? If there is no good reason to skip to FAILURE from some errors, > I'd > suggest to keep the previous behavior. Maybe the good reason is to do > some > counting, but this means that on eg metacommand errors now the script > would > loop over instead of aborting, which does not look like a desirable > change > of behavior. Even in the case of meta command errors we must prepare for CSTATE_END_TX and the execution of the next script: if necessary, clear the conditional stack and rollback the current transaction block. > ISTM that it would be more logical to only get into RETRY if there is a > retry, > i.e. move the test RETRY/ABORT in FAILURE. For that, instead of > "canRetry", > maybe you want "doRetry", which tells that a retry is possible (the > error > is serializable or deadlock) and that the current parameters allow it > (timeout, max retries). > > * Minor C style comments: > > if / else if / else if ... on *_FAILURE: I'd suggest a switch. > > The following line removal does not seem useful, I'd have kept it: > > stats->cnt++; > - > if (skipped) > > copyVariables: I'm not convinced that source_vars & nvars variables are > that > useful. > if (!copyVariables(&st->retry_state.variables, &st->variables)) { > pgbench_error(LOG, "client %d aborted when preparing to execute a > transaction\n", st->id); > > The message could be more precise, eg "client %d failed while copying > variables", unless copyVariables already printed a message. As this is > really > an internal error from pgbench, I'd rather do a FATAL (direct exit) > there. > ISTM that the only possible failure is OOM here, and pgbench is in a > very bad > shape if it gets into that. Ok! > commandFailed: I'm not thrilled by the added boolean, which is > partially > redundant with the second argument. Do you mean that it is partially redundant with the argument "cmd" and, for example, the meta commands errors always do not cause the abortions of the client? > if (per_script_stats) > - accumStats(&sql_script[st->use_file].stats, skipped, > latency, lag); > + { > + accumStats(&sql_script[st->use_file].stats, skipped, > latency, lag, > + st->failure_status, st->retries); > + } > } > > I do not see the point of changing the style here. If in such cases one command is placed on several lines, ISTM that the code is more understandable if curly brackets are used... [1] https://www.postgresql.org/message-id/fcc2512cdc9e6bc49d3b489181f454da%40postgrespro.ru -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, >> Detailed -r report. I understand from the doc that the retry number on >> the detailed per-statement report is to identify at what point errors >> occur? Probably this is more or less always at the same point on a >> given script, so that the most interesting feature is to report the >> number of retries at the script level. > > This may depend on various factors.. for example: > [...] > 21.239 5 36 UPDATE xy3 SET y = y + :delta WHERE x > = :x3; > 21.360 5 39 UPDATE xy2 SET y = y + :delta WHERE x > = :x2; Ok, not always the same point, and you confirm that it identifies where the error is raised which leads to a retry. > And you can always get the number of retries at the script level from the > main report (if only one script is used) or from the report for each script > (if multiple scripts are used). Ok. >> ISTM that "skipped" transactions are NOT "successful" so there are a >> problem with comments. I believe that your formula are probably right, >> it has more to do with what is "success". For cnt decomposition, ISTM >> that "other transactions" are really "directly successful >> transactions". > > I agree with you, but I also think that skipped transactions should not be > considered errors. I'm ok with having a special category for them in the explanations, which is neither success nor error. > So we can write something like this: > All the transactions are divided into several types depending on their > execution. Firstly, they can be divided into transactions that we started to > execute, and transactions which were skipped (it was too late to execute > them). Secondly, running transactions fall into 2 main types: is there any > command that got a failure during the last execution of the transaction > script or not? Thus Here is an attempt at having a more precise and shorter version, not sure it is much better than yours, though: """ Transactions are counted depending on their execution and outcome. First a transaction may have started or not: skipped transactions occur under --rate and --latency-limit when the client is too late to execute them. Secondly, a started transaction may ultimately succeed or fail on some error, possibly after some retries when --max-tries is not one. Thus """ > the number of all transactions = > skipped (it was too late to execute them) > cnt (the number of successful transactions) + > ecnt (the number of failed transactions). > > A successful transaction can have several unsuccessful tries before a > successfull run. Thus > > cnt (the number of successful transactions) = > retried (they got a serialization or a deadlock failure(s), but were > successfully retried from the very beginning) + > directly successfull transactions (they were successfully completed on > the first try). These above description is clearer for me. >> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise >> "another" does not make sense yet. > > Maybe firstly put a general group, and then special cases?... I understand it more as a catch all default "none of the above" case. >> In TState, field "uint32 retries": maybe it would be simpler to count >> "tries", which can be compared directly to max tries set in the option? > > If you mean retries in CState - on the one hand, yes, on the other hand, > statistics always use the number of retries... Ok. >> The automaton skips to FAILURE on every possible error. I'm wondering >> whether it could do so only on SQL errors, because other fails will >> lead to ABORTED anyway? If there is no good reason to skip to FAILURE >> from some errors, I'd suggest to keep the previous behavior. Maybe the >> good reason is to do some counting, but this means that on eg >> metacommand errors now the script would loop over instead of aborting, >> which does not look like a desirable change of behavior. > > Even in the case of meta command errors we must prepare for CSTATE_END_TX and > the execution of the next script: if necessary, clear the conditional stack > and rollback the current transaction block. Seems ok. >> commandFailed: I'm not thrilled by the added boolean, which is partially >> redundant with the second argument. > > Do you mean that it is partially redundant with the argument "cmd" and, for > example, the meta commands errors always do not cause the abortions of the > client? Yes. And also I'm not sure we should want this boolean at all. > [...] > If in such cases one command is placed on several lines, ISTM that the code > is more understandable if curly brackets are used... Hmmm. Such basic style changes are avoided because they break backpatching, so we try to avoid gratuitous changes unless there is a strong added value, which does not seem to be the case here. -- Fabien.
On 17-08-2018 10:49, Fabien COELHO wrote: > Hello Marina, > >>> Detailed -r report. I understand from the doc that the retry number >>> on the detailed per-statement report is to identify at what point >>> errors occur? Probably this is more or less always at the same point >>> on a given script, so that the most interesting feature is to report >>> the number of retries at the script level. >> >> This may depend on various factors.. for example: >> [...] >> 21.239 5 36 UPDATE xy3 SET y = y + :delta >> WHERE x = :x3; >> 21.360 5 39 UPDATE xy2 SET y = y + :delta >> WHERE x = :x2; > > Ok, not always the same point, and you confirm that it identifies > where the error is raised which leads to a retry. Yes, I confirm this. I'll try to write more clearly about this in the documentation... >> So we can write something like this: > >> All the transactions are divided into several types depending on their >> execution. Firstly, they can be divided into transactions that we >> started to execute, and transactions which were skipped (it was too >> late to execute them). Secondly, running transactions fall into 2 main >> types: is there any command that got a failure during the last >> execution of the transaction script or not? Thus > > Here is an attempt at having a more precise and shorter version, not > sure it is much better than yours, though: > > """ > Transactions are counted depending on their execution and outcome. > First > a transaction may have started or not: skipped transactions occur > under --rate and --latency-limit when the client is too late to > execute them. Secondly, a started transaction may ultimately succeed > or fail on some error, possibly after some retries when --max-tries is > not one. Thus > """ Thank you! >>> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, >>> otherwise "another" does not make sense yet. >> >> Maybe firstly put a general group, and then special cases?... > > I understand it more as a catch all default "none of the above" case. Ok! >>> commandFailed: I'm not thrilled by the added boolean, which is >>> partially >>> redundant with the second argument. >> >> Do you mean that it is partially redundant with the argument "cmd" >> and, for example, the meta commands errors always do not cause the >> abortions of the client? > > Yes. And also I'm not sure we should want this boolean at all. Perhaps we can use a separate function to print the messages about client's abortion, something like this (it is assumed that all abortions happen when processing SQL commands): static void clientAborted(CState *st, const char *message) { pgbench_error(..., "client %d aborted in command %d (SQL) of script %d; %s\n", st->id, st->command, st->use_file, message); } Or perhaps we can use a more detailed failure status so for each type of failure we always know the command name (argument "cmd") and whether the client is aborted. Something like this (but in comparison with the first variant ISTM overly complicated): /* * For the failures during script execution. */ typedef enum FailureStatus { NO_FAILURE = 0, /* * Failures in meta commands. In these cases the failed transaction is * terminated. */ META_SET_FAILURE, META_SETSHELL_FAILURE, META_SHELL_FAILURE, META_SLEEP_FAILURE, META_IF_FAILURE, META_ELIF_FAILURE, /* * Failures in SQL commands. In cases of serialization/deadlock failures a * failed transaction is re-executed from the very beginning if possible; * otherwise the failed transaction is terminated. */ SERIALIZATION_FAILURE, DEADLOCK_FAILURE, OTHER_SQL_FAILURE, /* other failures in SQL commands that are not * listed by themselves above */ /* * Failures while processing SQL commands. In this case the client is * aborted. */ SQL_CONNECTION_FAILURE } FailureStatus; >> [...] >> If in such cases one command is placed on several lines, ISTM that the >> code is more understandable if curly brackets are used... > > Hmmm. Such basic style changes are avoided because they break > backpatching, so we try to avoid gratuitous changes unless there is a > strong added value, which does not seem to be the case here. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>>>> commandFailed: I'm not thrilled by the added boolean, which is partially >>>> redundant with the second argument. >>> >>> Do you mean that it is partially redundant with the argument "cmd" and, >>> for example, the meta commands errors always do not cause the abortions of >>> the client? >> >> Yes. And also I'm not sure we should want this boolean at all. > > Perhaps we can use a separate function to print the messages about client's > abortion, something like this (it is assumed that all abortions happen when > processing SQL commands): > > static void > clientAborted(CState *st, const char *message) Possibly. > Or perhaps we can use a more detailed failure status so for each type of > failure we always know the command name (argument "cmd") and whether the > client is aborted. Something like this (but in comparison with the first > variant ISTM overly complicated): I agree., I do not think that it would be useful given that the same thing is done on all meta-command error cases in the end. -- Fabien.
On 17-08-2018 14:04, Fabien COELHO wrote: > ... >> Or perhaps we can use a more detailed failure status so for each type >> of failure we always know the command name (argument "cmd") and >> whether the client is aborted. Something like this (but in comparison >> with the first variant ISTM overly complicated): > > I agree., I do not think that it would be useful given that the same > thing is done on all meta-command error cases in the end. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello, hackers! This is the eleventh version of the patch for error handling and retrying of transactions with serialization/deadlock failures in pgbench (based on the commit 14e9b2a752efaa427ce1b400b9aaa5a636898a04) thanks to the comments of Fabien Coelho and Arthur Zakirov in this thread. v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch - a patch for the RandomState structure (this is used to reset a client's random seed during the repeating of transactions after serialization/deadlock failures). v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch - a patch for the Variables structure (this is used to reset client variables during the repeating of transactions after serialization/deadlock failures). v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch - the main patch for handling client errors and repetition of transactions with serialization/deadlock failures (see the detailed description in the file). v11-0004-Pgbench-errors-use-a-separate-function-to-report.patch - a patch for a separate error reporting function (this is used to report client failures that do not cause an aborts and this depends on the level of debugging). Although this is a try to fix a duplicate code for debug messages (see [1]), this may seem mostly refactoring and therefore may not seem very necessary for this set of patches (see [2], [3]), so this patch becomes the last as an optional. Any suggestions are welcome! [1] https://www.postgresql.org/message-id/20180405180807.0bc1114f%40wp.localdomain > There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with > corresponding fprintf(stderr..) I think it's time to do it like in the > main code, wrap with some function like log(level, msg). [2] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre > However ISTM that it is not as necessary as the previous one, i.e. we > could do without it to get the desired feature, so I see it more as a > refactoring done "in passing", and I'm wondering whether it is > really worth it because it adds some new complexity, so I'm not sure of > the net benefit. [3] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre > I'm still not over enthousiastic with these changes, and still think > that > it should be an independent patch, not submitted together with the > "retry > on error" feature. All that was fixed from the previous version: [4] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre > I'm at odds with the proposed levels. ISTM that pgbench internal > errors which warrant an immediate exit should be dubbed "FATAL", > I'm unsure about the "log_min_messages" variable name, I'd suggest > "log_level". > > I do not see the asserts on LOG >= log_min_messages as useful, because > the level can only be LOG or DEBUG anyway. > * PQExpBuffer > > I still do not see a positive value from importing PQExpBuffer > complexity and cost into pgbench, as the resulting code is not very > readable and it adds malloc/free cycles, so I'd try to avoid using > PQExpBuf as much as possible. ISTM that all usages could be avoided in > the patch, and most should be avoided even if ExpBuffer is imported > because it is really useful somewhere. > > - to call pgbench_error from pgbench_simple_error, you can do a > pgbench_log_va(level, format, va_list) version called both from > pgbench_error & pgbench_simple_error. > > - for PGBENCH_DEBUG function, do separate calls per type, the very > small partial code duplication is worth avoiding ExpBuf IMO. > > - for doCustom debug: I'd just let the printf as it is, with a > comment, as it is really very internal stuff for debug. Or I'd just > snprintf a something in a static buffer. > > ... > > - for listAvailableScript: I'd simply call "pgbench_error(LOG" several > time, once per line. > > I see building a string with a format (printfExpBuf..) and then > calling the pgbench_error function with just a "%s" format on the > result as not very elegant, because the second format is somehow > hacked around. [5] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre > I suggest that the called function does only one simple thing, > probably "DEBUG", and that the *caller* prints a message if it is > unhappy > about the failure of the called function, as it is currently done. This > allows to provide context as well from the caller, eg "setting variable > %s > failed while <some specific context>". The user call rerun under debug > for > precision if they need it. [6] https://www.postgresql.org/message-id/20180810125327.GA2374%40zakirov.localdomain > I agree with Fabien. Calling pgbench_error() inside pgbench_error() > could be dangerous. I think "fmt" checking could be removed, or we may > use Assert() or fprintf()+exit(1) at least. [7] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808121057540.6189%40lancre > * typo in comments: "varaibles" > > * About enlargeVariables: > > multiple INT_MAX error handling looks strange, especially as this code > can > never be triggered because pgbench would be dead long before having > allocated INT_MAX variables. So I would not bother to add such checks. > I'm not sure that the size_t cast here and there are useful for any > practical values likely to be encountered by pgbench. > > The exponential allocation seems overkill. I'd simply add a constant > number of slots, with a simple rule: > > /* reallocated with a margin */ > if (max_vars < needed) max_vars = needed + 8; [8] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808151046090.30050%40lancre > A few comments about the doc. > > According to the documentation, the feature is triggered by --max-tries > and > --latency-limit. I disagree with the later, because it means that > having > latency limit without retrying is not supported anymore. > > Maybe you can allow an "unlimited" max-tries, say with special value > zero, > and the latency limit does its job if set, over all tries. > > Doc: "error in meta commands" -> "meta command errors", for homogeneity > with > other cases? > Doc: "never occur.." -> "never occur", or eventually "...". > > Doc: "Directly client errors" -> "Direct client errors". > > I'm still in favor of asserting that the sql connection is idle (no tx > in > progress) at the beginning and/or end of a script, and report a user > error > if not, instead of writing complex caveats. > I do not think that the RETRIES_ENABLED macro is a good thing. I'd > suggest > to write the condition four times. > > ISTM that "skipped" transactions are NOT "successful" so there are a > problem > with comments. I believe that your formula are probably right, it has > more to do > with what is "success". For cnt decomposition, ISTM that "other > transactions" > are really "directly successful transactions". > > I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise > "another" > does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE". > I'm not sure of > the LOG -> DEBUG_FAIL changes. I do not understand the name > "DEBUG_FAIL", has it > is not related to debug, they just seem to be internal errors. > inTransactionBlock: I disagree with any function other than doCustom > changing > the client state, because it makes understanding the state machine > harder. There > is already one exception to that (threadRun) that I wish to remove. All > state > changes must be performed explicitely in doCustom. > PQexec("ROOLBACK"): you are inserting a synchronous command, for which > the > thread will have to wait for the result, in a middle of a framework > which > takes great care to use only asynchronous stuff so that one thread can > manage several clients efficiently. You cannot call PQexec there. > From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to > a new state CSTATE_WAIT_ABORT_RESULT which would be similar to > CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead > of proceeding to the next command. > > ISTM that it would be more logical to only get into RETRY if there is a > retry, > i.e. move the test RETRY/ABORT in FAILURE. For that, instead of > "canRetry", > maybe you want "doRetry", which tells that a retry is possible (the > error > is serializable or deadlock) and that the current parameters allow it > (timeout, max retries). > > * Minor C style comments: > > if / else if / else if ... on *_FAILURE: I'd suggest a switch. > > The following line removal does not seem useful, I'd have kept it: > > stats->cnt++; > - > if (skipped) > > copyVariables: I'm not convinced that source_vars & nvars variables are > that > useful. > > memcpy(&(st->retry_state.random_state), &(st->random_state), > sizeof(RandomState)); > > Is there a problem with "st->retry_state.random_state = > st->random_state;" > instead of memcpy? ISTM that simple assignments work in C. Idem in the > reverse > copy under RETRY. > commandFailed: I'm not thrilled by the added boolean, which is > partially > redundant with the second argument. > > if (per_script_stats) > - accumStats(&sql_script[st->use_file].stats, skipped, > latency, lag); > + { > + accumStats(&sql_script[st->use_file].stats, skipped, > latency, lag, > + st->failure_status, st->retries); > + } > } > > I do not see the point of changing the style here. [9] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808170917510.20841%40lancre > Here is an attempt at having a more precise and shorter version, not > sure > it is much better than yours, though: > > """ > Transactions are counted depending on their execution and outcome. > First > a transaction may have started or not: skipped transactions occur under > --rate and --latency-limit when the client is too late to execute them. > Secondly, a started transaction may ultimately succeed or fail on some > error, possibly after some retries when --max-tries is not one. Thus > """ -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Marina, About the two first preparatory patches. > v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch > - a patch for the RandomState structure (this is used to reset a client's > random seed during the repeating of transactions after serialization/deadlock > failures). Same version as the previous one, which was ok. Still applies, compiles, passes tests. Fine with me. > v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch > - a patch for the Variables structure (this is used to reset client variables > during the repeating of transactions after serialization/deadlock failures). Simpler version, applies cleanly on top of previous patch, compiles and global & local "make check" are ok. Fine with me as well. -- Fabien.
Hello Marina, > v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch > - the main patch for handling client errors and repetition of transactions > with serialization/deadlock failures (see the detailed description in the > file). About patch v11-3. Patch applies cleanly on top of the other two. Compiles, global and local "make check" are ok. * Features As far as the actual retry feature is concerned, I'd say we are nearly there. However I have an issue with changing the behavior on meta command and other sql errors, which I find not desirable. When a meta-command fails, before the patch the command is aborted and there is a convenient error message: sh> pgbench -T 10 -f bad-meta.sql bad-meta.sql:1: unexpected function name (false) in command "set" [...] \set i false + 1 [...] After the patch it is simply counted, pgbench loops on the same error till the time is completed, and there are no clue about the actual issue: sh> pgbench -T 10 -f bad-meta.sql starting vacuum...end. transaction type: bad-meta.sql duration: 10 s number of transactions actually processed: 0 number of failures: 27993953 (100.000%) ... Same thing about SQL errors, an immediate abort... sh> pgbench -T 10 -f bad-sql.sql starting vacuum...end. client 0 aborted in command 0 of script 0; ERROR: syntax error at or near ";" LINE 1: SELECT 1 + ; ... is turned into counting without aborting nor error messages, so that there is no clue that the user was asking for something bad. sh> pgbench -T 10 -f bad-sql.sql starting vacuum...end. transaction type: bad-sql.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 0 number of failures: 274617 (100.000%) # no clue that there was a syntax error in the script I do not think that these changes of behavior are desirable. Meta command and miscellaneous SQL errors should result in immediatly aborting the whole run, because the client test code itself could not run correctly or the SQL sent was somehow wrong, which is also the client's fault, and the server performance bench does not make much sense in such conditions. ISTM that the focus of this patch should only be to handle some server runtime errors that can be retryed, but not to change pgbench behavior on other kind of errors. If these are to be changed, ISTM that it would be a distinct patch and would require some discussion, and possibly an option to enable it or not if some use case emerge. AFA this patch is concerned, I'd suggest to let that out. Doc says "you cannot use an infinite number of retries without latency-limit..." Why should this be forbidden? At least if -T timeout takes precedent and shortens the execution, ISTM that there could be good reason to test that. Maybe it could be blocked only under -t if this would lead to an non-ending run. As "--print-errors" is really for debug, maybe it could be named "--debug-errors". I'm not sure that having "--debug" implying this option is useful: As there are two distinct options, the user may be allowed to trigger one or the other as they wish? * Code The following remarks are linked to the change of behavior discussed above: makeVariableValue error message is not for debug, but must be kept in all cases, and the false returned must result in an immediate abort. Same thing about lookupCreateVariable, an invalid name is a user error which warrants an immediate abort. Same thing again about coerce* functions or evalStandardFunc... Basically, most/all added "debug_level >= DEBUG_ERRORS" are not desirable. sendRollback(): I'd suggest to simplify. The prepare/extended statement stuff is really about the transaction script, not dealing with errors, esp as there is no significant advantage in preparing a "ROLLBACK" statement which is short and has no parameters. I'd suggest to remove this function and just issue PQsendQuery("ROLLBACK;") in all cases. In copyVariables, I'd simplify + if (source_var->svalue == NULL) + dest_var->svalue = NULL; + else + dest_var->svalue = pg_strdup(source_var->svalue); as: dest_var->value = (source_var->svalue == NULL) ? NULL : pg_strdup(source_var->svalue); + if (sqlState) -> if (sqlState != NULL) ? Function getTransactionStatus name does not seem to correspond fully to what the function does. There is a passthru case which should be either avoided or clearly commented. About: - commandFailed(st, "SQL", "perhaps the backend died while processing"); + clientAborted(st, + "perhaps the backend died while processing"); keep on one line? About: + if (doRetry(st, &now)) + st->state = CSTATE_RETRY; + else + st->state = CSTATE_FAILURE; -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE; * Comments "There're different types..." -> "There are different types..." "after the errors and"... -> "after errors and"... "the default value of max_tries is set to 1" -> "the default value of max_tries is 1" "We cannot retry the transaction" -> "We cannot retry a transaction" "may ultimately succeed or get a failure," -> "may ultimately succeed or fail," Overall, the comment text in StatsData is very clear. However they are not clearly linked to the struct fields. I'd suggest that earch field when used should be quoted, so as to separate English from code, and the struct name should always be used explicitely when possible. I'd insist in a comment that "cnt" does not include "skipped" transactions (anymore). * Documentation: Some suggestions which may be improvements, although I'm not a native English speaker. ISTM that there are too many "the": - "turns on the option ..." -> "turns on option ..." - "When the option ..." -> "When option ..." - "By default the option ..." -> "By default option ..." - "only if the option ..." -> "only if option ..." - "combined with the option ..." -> "combined with option ..." - "without the option ..." -> "without option ..." - "is the sum of all the retries" -> "is the sum of all retries" "infinite" -> "unlimited" "not retried at all" -> "not retried" (maybe several times). "messages of all errors" -> "messages about all errors". "It is assumed that the scripts used do not contain" -> "It is assumed that pgbench scripts do not contain" About v11-4. I'm do not feel that these changes are very useful/important for now. I'd propose that your prioritize on updating 11-3 so that we can have another round about it as soon as possible, and keep that one later. -- Fabien.
On 08-09-2018 10:17, Fabien COELHO wrote: > Hello Marina, Hello, Fabien! > About the two first preparatory patches. > >> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch >> - a patch for the RandomState structure (this is used to reset a >> client's random seed during the repeating of transactions after >> serialization/deadlock failures). > > Same version as the previous one, which was ok. Still applies, > compiles, passes tests. Fine with me. > >> v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch >> - a patch for the Variables structure (this is used to reset client >> variables during the repeating of transactions after >> serialization/deadlock failures). > > Simpler version, applies cleanly on top of previous patch, compiles > and global & local "make check" are ok. Fine with me as well. Glad to hear it :) -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 08-09-2018 16:03, Fabien COELHO wrote: > Hello Marina, > >> v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch >> - the main patch for handling client errors and repetition of >> transactions with serialization/deadlock failures (see the detailed >> description in the file). > > About patch v11-3. > > Patch applies cleanly on top of the other two. Compiles, global and > local > "make check" are ok. :-) > * Features > > As far as the actual retry feature is concerned, I'd say we are nearly > there. However I have an issue with changing the behavior on meta > command and other sql errors, which I find not desirable. > > When a meta-command fails, before the patch the command is aborted and > there is a convenient error message: > > sh> pgbench -T 10 -f bad-meta.sql > bad-meta.sql:1: unexpected function name (false) in command "set" > [...] > \set i false + 1 [...] > > After the patch it is simply counted, pgbench loops on the same error > till the time is completed, and there are no clue about the actual > issue: > > sh> pgbench -T 10 -f bad-meta.sql > starting vacuum...end. > transaction type: bad-meta.sql > duration: 10 s > number of transactions actually processed: 0 > number of failures: 27993953 (100.000%) > ... > > Same thing about SQL errors, an immediate abort... > > sh> pgbench -T 10 -f bad-sql.sql > starting vacuum...end. > client 0 aborted in command 0 of script 0; ERROR: syntax error at or > near ";" > LINE 1: SELECT 1 + ; > > ... is turned into counting without aborting nor error messages, so > that there is no clue that the user was asking for something bad. > > sh> pgbench -T 10 -f bad-sql.sql > starting vacuum...end. > transaction type: bad-sql.sql > scaling factor: 1 > query mode: simple > number of clients: 1 > number of threads: 1 > duration: 10 s > number of transactions actually processed: 0 > number of failures: 274617 (100.000%) > # no clue that there was a syntax error in the script > > I do not think that these changes of behavior are desirable. Meta > command and > miscellaneous SQL errors should result in immediatly aborting the whole > run, > because the client test code itself could not run correctly or the SQL > sent > was somehow wrong, which is also the client's fault, and the server > performance bench does not make much sense in such conditions. > > ISTM that the focus of this patch should only be to handle some server > runtime errors that can be retryed, but not to change pgbench behavior > on other kind of errors. If these are to be changed, ISTM that it > would be a distinct patch and would require some discussion, and > possibly an option to enable it or not if some use case emerge. AFA > this patch is concerned, I'd suggest to let that out. ... > The following remarks are linked to the change of behavior discussed > above: > makeVariableValue error message is not for debug, but must be kept in > all > cases, and the false returned must result in an immediate abort. Same > thing about > lookupCreateVariable, an invalid name is a user error which warrants > an immediate > abort. Same thing again about coerce* functions or evalStandardFunc... > Basically, most/all added "debug_level >= DEBUG_ERRORS" are not > desirable. Hmm, but we can say the same for serialization or deadlock errors that were not retried (the client test code itself could not run correctly or the SQL sent was somehow wrong, which is also the client's fault), can't we? Why not handle client errors that can occur (but they may also not occur) the same way? (For example, always abort the client, or conversely do not make aborts in these cases.) Here's an example of such error: starting vacuum...end. transaction type: pgbench_rare_sql_error.sql scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 number of transactions per client: 250 number of transactions actually processed: 2500/2500 maximum number of tries: 1 latency average = 0.375 ms tps = 26695.292848 (including connections establishing) tps = 27489.678525 (excluding connections establishing) statement latencies in milliseconds and failures: 0.001 0 \set divider random(-1000, 1000) 0.245 0 SELECT 1 / :divider; starting vacuum...end. client 5 got an error in command 1 (SQL) of script 0; ERROR: division by zero client 0 got an error in command 1 (SQL) of script 0; ERROR: division by zero client 7 got an error in command 1 (SQL) of script 0; ERROR: division by zero transaction type: pgbench_rare_sql_error.sql scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 number of transactions per client: 250 number of transactions actually processed: 2497/2500 number of failures: 3 (0.120%) number of serialization failures: 0 (0.000%) number of deadlock failures: 0 (0.000%) number of other SQL failures: 3 (0.120%) maximum number of tries: 1 latency average = 0.579 ms (including failures) tps = 17240.662547 (including connections establishing) tps = 17862.090137 (excluding connections establishing) statement latencies in milliseconds and failures: 0.001 0 \set divider random(-1000, 1000) 0.338 3 SELECT 1 / :divider; Maybe we can limit the number of failures in one statement, and abort the client if this limit is exceeded?... To get a clue about the actual issue you can use the options --failures-detailed (to find out out whether this is a serialization failure / deadlock failure / other SQL failure / meta command failure) and/or --print-errors (to get the complete error message). > Doc says "you cannot use an infinite number of retries without > latency-limit..." > > Why should this be forbidden? At least if -T timeout takes precedent > and > shortens the execution, ISTM that there could be good reason to test > that. > Maybe it could be blocked only under -t if this would lead to an > non-ending > run. ... > * Comments > > "There're different types..." -> "There are different types..." > > "after the errors and"... -> "after errors and"... > > "the default value of max_tries is set to 1" -> "the default value > of max_tries is 1" > > "We cannot retry the transaction" -> "We cannot retry a transaction" > > "may ultimately succeed or get a failure," -> "may ultimately succeed > or fail," ... > * Documentation: > > Some suggestions which may be improvements, although I'm not a native > English > speaker. > > ISTM that there are too many "the": > - "turns on the option ..." -> "turns on option ..." > - "When the option ..." -> "When option ..." > - "By default the option ..." -> "By default option ..." > - "only if the option ..." -> "only if option ..." > - "combined with the option ..." -> "combined with option ..." > - "without the option ..." -> "without option ..." > - "is the sum of all the retries" -> "is the sum of all retries" > > "infinite" -> "unlimited" > > "not retried at all" -> "not retried" (maybe several times). > > "messages of all errors" -> "messages about all errors". > > "It is assumed that the scripts used do not contain" -> > "It is assumed that pgbench scripts do not contain" Thank you, I'll fix this. If you use the option --latency-limit, the time of tries will be limited regardless of the use of the option -t. Therefore ISTM that an unlimited number of tries can be used only if the time of tries is limited by the options -T and/or -L. > As "--print-errors" is really for debug, maybe it could be named > "--debug-errors". Ok! > I'm not sure that having "--debug" implying this option > is useful: As there are two distinct options, the user may be allowed > to trigger one or the other as they wish? I'm not sure that the main debugging output will give a good clue of what's happened without full messages about errors, retries and failures... > * Code > > <...> > > sendRollback(): I'd suggest to simplify. The prepare/extended statement > stuff is > really about the transaction script, not dealing with errors, esp as > there is no > significant advantage in preparing a "ROLLBACK" statement which is > short and has > no parameters. I'd suggest to remove this function and just issue > PQsendQuery("ROLLBACK;") in all cases. Ok! > In copyVariables, I'd simplify > > + if (source_var->svalue == NULL) > + dest_var->svalue = NULL; > + else > + dest_var->svalue = pg_strdup(source_var->svalue); > > as: > > dest_var->value = (source_var->svalue == NULL) ? NULL : > pg_strdup(source_var->svalue); > About: > > + if (doRetry(st, &now)) > + st->state = CSTATE_RETRY; > + else > + st->state = CSTATE_FAILURE; > > -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE; These lines are quite long - do you suggest to wrap them this way? + dest_var->svalue = ((source_var->svalue == NULL) ? NULL : + pg_strdup(source_var->svalue)); + st->state = (doRetry(st, &now) ? CSTATE_RETRY : + CSTATE_FAILURE); > + if (sqlState) -> if (sqlState != NULL) ? Ok! > Function getTransactionStatus name does not seem to correspond fully to > what the > function does. There is a passthru case which should be either avoided > or > clearly commented. I don't quite understand you - do you mean that in fact this function finds out whether we are in a (failed) transaction block or not? Or do you mean that the case of PQTRANS_INTRANS is also ok?... > About: > > - commandFailed(st, "SQL", "perhaps the backend died while > processing"); > + clientAborted(st, > + "perhaps the backend died while processing"); > > keep on one line? I tried not to break the limit of 80 characters, but if you think that this is better, I'll change it. > Overall, the comment text in StatsData is very clear. However they are > not > clearly linked to the struct fields. I'd suggest that earch field when > used > should be quoted, so as to separate English from code, and the struct > name > should always be used explicitely when possible. Ok! > I'd insist in a comment that "cnt" does not include "skipped" > transactions > (anymore). If you mean CState.cnt I'm not sure if this is practically useful because the code uses only the sum of all client transactions including skipped and failed... Maybe we can rename this field to nxacts or total_cnt? > About v11-4. I'm do not feel that these changes are very > useful/important for now. I'd propose that your prioritize on updating > 11-3 so that we can have another round about it as soon as possible, > and keep that one later. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 11-09-2018 16:47, Marina Polyakova wrote: > On 08-09-2018 16:03, Fabien COELHO wrote: >> Hello Marina, >> I'd insist in a comment that "cnt" does not include "skipped" >> transactions >> (anymore). > > If you mean CState.cnt I'm not sure if this is practically useful > because the code uses only the sum of all client transactions > including skipped and failed... Maybe we can rename this field to > nxacts or total_cnt? Sorry, I misread your proposal for the first time. Ok! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Marina, > Hmm, but we can say the same for serialization or deadlock errors that were > not retried (the client test code itself could not run correctly or the SQL > sent was somehow wrong, which is also the client's fault), can't we? I think not. If a client asks for something "legal", but some other client in parallel happens to make an incompatible change which result in a serialization or deadlock error, the clients are not responsible for the raised errors, it is just that they happen to ask for something incompatible at the same time. So there is no user error per se, but the server is reporting its (temporary) inability to process what was asked for. For these errors, retrying is fine. If the client was alone, there would be no such errors, you cannot deadlock with yourself. This is really an isolation issue linked to parallel execution. > Why not handle client errors that can occur (but they may also not > occur) the same way? (For example, always abort the client, or > conversely do not make aborts in these cases.) Here's an example of such > error: > client 5 got an error in command 1 (SQL) of script 0; ERROR: division by zero This is an interesting case. For me we must stop the script because the client is asking for something "stupid", and retrying the same won't change the outcome, the division will still be by zero. It is the client responsability not to ask for something stupid, the bench script is buggy, it should not submit illegal SQL queries. This is quite different from submitting something legal which happens to fail. > Maybe we can limit the number of failures in one statement, and abort the > client if this limit is exceeded?... I think this is quite debatable, and that the best option is to leavze this point out of the current patch, so that we could have retry on serial/deadlock errors. Then you can submit another patch for a feature about other errors if you feel that there is a use case for going on in some cases. I think that the previous behavior made sense, and that changing it should only be considered as an option. As it involves discussing and is not obvious, later is better. > To get a clue about the actual issue you can use the options > --failures-detailed (to find out out whether this is a serialization failure > / deadlock failure / other SQL failure / meta command failure) and/or > --print-errors (to get the complete error message). Yep, but for me it should haved stopped immediately, as it did before. > If you use the option --latency-limit, the time of tries will be limited > regardless of the use of the option -t. Therefore ISTM that an unlimited > number of tries can be used only if the time of tries is limited by the > options -T and/or -L. Indeed, I'm ok with forbidding unlimitted retries when under -t. >> I'm not sure that having "--debug" implying this option >> is useful: As there are two distinct options, the user may be allowed >> to trigger one or the other as they wish? > > I'm not sure that the main debugging output will give a good clue of what's > happened without full messages about errors, retries and failures... I'm more argumenting about letting the user decide what they want. > These lines are quite long - do you suggest to wrap them this way? Sure, if it is too long, then wrap. >> Function getTransactionStatus name does not seem to correspond fully to >> what the function does. There is a passthru case which should be either >> avoided or clearly commented. > > I don't quite understand you - do you mean that in fact this function finds > out whether we are in a (failed) transaction block or not? Or do you mean > that the case of PQTRANS_INTRANS is also ok?... The former: although the function is named "getTransactionStatus", it does not really return the "status" of the transaction (aka PQstatus()?). > I tried not to break the limit of 80 characters, but if you think that this > is better, I'll change it. Hmmm. 80 columns, indeed... >> I'd insist in a comment that "cnt" does not include "skipped" transactions >> (anymore). > > If you mean CState.cnt I'm not sure if this is practically useful because the > code uses only the sum of all client transactions including skipped and > failed... Maybe we can rename this field to nxacts or total_cnt? I'm fine with renaming the field if it makes thinks clearer. They are all counters, so naming them "cnt" or "total_cnt" does not help much. Maybe "succeeded" or "success" to show what is really counted? -- Fabien.
On 11-09-2018 18:29, Fabien COELHO wrote: > Hello Marina, > >> Hmm, but we can say the same for serialization or deadlock errors that >> were not retried (the client test code itself could not run correctly >> or the SQL sent was somehow wrong, which is also the client's fault), >> can't we? > > I think not. > > If a client asks for something "legal", but some other client in > parallel happens to make an incompatible change which result in a > serialization or deadlock error, the clients are not responsible for > the raised errors, it is just that they happen to ask for something > incompatible at the same time. So there is no user error per se, but > the server is reporting its (temporary) inability to process what was > asked for. For these errors, retrying is fine. If the client was > alone, there would be no such errors, you cannot deadlock with > yourself. This is really an isolation issue linked to parallel > execution. You can get other errors that cannot happen for only one client if you use shell commands in meta commands: starting vacuum...end. transaction type: pgbench_meta_concurrent_error.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 20/20 maximum number of tries: 1 latency average = 6.953 ms tps = 287.630161 (including connections establishing) tps = 303.232242 (excluding connections establishing) statement latencies in milliseconds and failures: 1.636 0 BEGIN; 1.497 0 \setshell var mkdir my_directory && echo 1 0.007 0 \sleep 1 us 1.465 0 \setshell var rmdir my_directory && echo 1 1.622 0 END; starting vacuum...end. mkdir: cannot create directory ‘my_directory’: File exists mkdir: could not read result of shell command client 1 got an error in command 1 (setshell) of script 0; execution of meta-command failed transaction type: pgbench_meta_concurrent_error.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 19/20 number of failures: 1 (5.000%) number of meta-command failures: 1 (5.000%) maximum number of tries: 1 latency average = 11.782 ms (including failures) tps = 161.269033 (including connections establishing) tps = 167.733278 (excluding connections establishing) statement latencies in milliseconds and failures: 2.731 0 BEGIN; 2.909 1 \setshell var mkdir my_directory && echo 1 0.231 0 \sleep 1 us 2.366 0 \setshell var rmdir my_directory && echo 1 2.664 0 END; Or if you use untrusted procedural languages in SQL expressions (see the used file in the attachments): starting vacuum...ERROR: relation "pgbench_branches" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_tellers" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_history" does not exist (ignoring this error and continuing anyway) end. client 1 got an error in command 0 (SQL) of script 0; ERROR: could not create the directory "my_directory": File exists at line 3. CONTEXT: PL/Perl anonymous code block client 1 got an error in command 0 (SQL) of script 0; ERROR: could not create the directory "my_directory": File exists at line 3. CONTEXT: PL/Perl anonymous code block transaction type: pgbench_concurrent_error.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 18/20 number of failures: 2 (10.000%) number of serialization failures: 0 (0.000%) number of deadlock failures: 0 (0.000%) number of other SQL failures: 2 (10.000%) maximum number of tries: 1 latency average = 3.282 ms (including failures) tps = 548.437196 (including connections establishing) tps = 637.662753 (excluding connections establishing) statement latencies in milliseconds and failures: 1.566 2 DO $$ starting vacuum...ERROR: relation "pgbench_branches" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_tellers" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_history" does not exist (ignoring this error and continuing anyway) end. transaction type: pgbench_concurrent_error.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 20/20 maximum number of tries: 1 latency average = 2.760 ms tps = 724.746078 (including connections establishing) tps = 853.131985 (excluding connections establishing) statement latencies in milliseconds and failures: 1.893 0 DO $$ Or if you try to create a function and perhaps replace an existing one: starting vacuum...end. client 0 got an error in command 0 (SQL) of script 0; ERROR: duplicate key value violates unique constraint "pg_proc_proname_args_nsp_index" DETAIL: Key (proname, proargtypes, pronamespace)=(my_function, , 2200) already exists. client 0 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 1 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 1 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 1 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 1 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 0 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 1 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 1 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated client 0 got an error in command 0 (SQL) of script 0; ERROR: tuple concurrently updated transaction type: pgbench_create_function.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 10/20 number of failures: 10 (50.000%) number of serialization failures: 0 (0.000%) number of deadlock failures: 0 (0.000%) number of other SQL failures: 10 (50.000%) maximum number of tries: 1 latency average = 82.881 ms (including failures) tps = 12.065492 (including connections establishing) tps = 12.092216 (excluding connections establishing) statement latencies in milliseconds and failures: 82.549 10 CREATE OR REPLACE FUNCTION my_function() RETURNS integer AS 'select 1;' LANGUAGE SQL; >> Why not handle client errors that can occur (but they may also not >> occur) the same way? (For example, always abort the client, or >> conversely do not make aborts in these cases.) Here's an example of >> such error: > >> client 5 got an error in command 1 (SQL) of script 0; ERROR: division >> by zero > > This is an interesting case. For me we must stop the script because > the client is asking for something "stupid", and retrying the same > won't change the outcome, the division will still be by zero. It is > the client responsability not to ask for something stupid, the bench > script is buggy, it should not submit illegal SQL queries. This is > quite different from submitting something legal which happens to fail. > ... >>> I'm not sure that having "--debug" implying this option >>> is useful: As there are two distinct options, the user may be allowed >>> to trigger one or the other as they wish? >> >> I'm not sure that the main debugging output will give a good clue of >> what's happened without full messages about errors, retries and >> failures... > > I'm more argumenting about letting the user decide what they want. > >> These lines are quite long - do you suggest to wrap them this way? > > Sure, if it is too long, then wrap. Ok! >>> Function getTransactionStatus name does not seem to correspond fully >>> to what the function does. There is a passthru case which should be >>> either avoided or clearly commented. >> >> I don't quite understand you - do you mean that in fact this function >> finds out whether we are in a (failed) transaction block or not? Or do >> you mean that the case of PQTRANS_INTRANS is also ok?... > > The former: although the function is named "getTransactionStatus", it > does not really return the "status" of the transaction (aka > PQstatus()?). Thank you, I'll think how to improve it. Perhaps the name checkTransactionStatus will be better... >>> I'd insist in a comment that "cnt" does not include "skipped" >>> transactions >>> (anymore). >> >> If you mean CState.cnt I'm not sure if this is practically useful >> because the code uses only the sum of all client transactions >> including skipped and failed... Maybe we can rename this field to >> nxacts or total_cnt? > > I'm fine with renaming the field if it makes thinks clearer. They are > all counters, so naming them "cnt" or "total_cnt" does not help much. > Maybe "succeeded" or "success" to show what is really counted? Perhaps renaming of StatsData.cnt is better than just adding a comment to this field. But IMO we have the same problem (They are all counters, so naming them "cnt" or "total_cnt" does not help much.) for CState.cnt which cannot be named in the same way because it also includes skipped and failed transactions. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Marina, > You can get other errors that cannot happen for only one client if you use > shell commands in meta commands: > Or if you use untrusted procedural languages in SQL expressions (see the used > file in the attachments): > Or if you try to create a function and perhaps replace an existing one: Sure. Indeed there can be shell errors, perl errors, create functions conflicts... I do not understand what is your point wrt these. I'm mostly saying that your patch should focus on implementing the retry feature when appropriate, and avoid changing the behavior (error displayed, abort or not) on features unrelated to serialization & deadlock errors. Maybe there are inconsistencies, and "bug"/"feature" worth fixing, but if so that should be a separate patch, if possible, and if these are bugs they could be backpatched. For now I'm still convinced that pgbench should keep on aborting on "\set" or SQL syntax errors, and show clear error messages on these, and your examples have not changed my mind on that point. >> I'm fine with renaming the field if it makes thinks clearer. They are >> all counters, so naming them "cnt" or "total_cnt" does not help much. >> Maybe "succeeded" or "success" to show what is really counted? > > Perhaps renaming of StatsData.cnt is better than just adding a comment to > this field. But IMO we have the same problem (They are all counters, so > naming them "cnt" or "total_cnt" does not help much.) for CState.cnt which > cannot be named in the same way because it also includes skipped and failed > transactions. Hmmm. CState's cnt seems only used to implement -t anyway? I'm okay if it has a different name, esp if it has a different semantics. I think I was arguing only about cnt in StatsData. -- Fabien.
On 12-09-2018 17:04, Fabien COELHO wrote: > Hello Marina, > >> You can get other errors that cannot happen for only one client if you >> use shell commands in meta commands: > >> Or if you use untrusted procedural languages in SQL expressions (see >> the used file in the attachments): > >> Or if you try to create a function and perhaps replace an existing >> one: > > Sure. Indeed there can be shell errors, perl errors, create functions > conflicts... I do not understand what is your point wrt these. > > I'm mostly saying that your patch should focus on implementing the > retry feature when appropriate, and avoid changing the behavior (error > displayed, abort or not) on features unrelated to serialization & > deadlock errors. > > Maybe there are inconsistencies, and "bug"/"feature" worth fixing, but > if so that should be a separate patch, if possible, and if these are > bugs they could be backpatched. > > For now I'm still convinced that pgbench should keep on aborting on > "\set" or SQL syntax errors, and show clear error messages on these, > and your examples have not changed my mind on that point. > >>> I'm fine with renaming the field if it makes thinks clearer. They are >>> all counters, so naming them "cnt" or "total_cnt" does not help much. >>> Maybe "succeeded" or "success" to show what is really counted? >> >> Perhaps renaming of StatsData.cnt is better than just adding a comment >> to this field. But IMO we have the same problem (They are all >> counters, so naming them "cnt" or "total_cnt" does not help much.) for >> CState.cnt which cannot be named in the same way because it also >> includes skipped and failed transactions. > > Hmmm. CState's cnt seems only used to implement -t anyway? I'm okay if > it has a different name, esp if it has a different semantics. Ok! > I think > I was arguing only about cnt in StatsData. The discussion about this has become entangled from the beginning, because as I wrote in [1] at first I misread your original proposal... [1] https://www.postgresql.org/message-id/d318cdee8f96de6b1caf2ce684ffe4db%40postgrespro.ru -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Sep 12, 2018 at 06:12:29PM +0300, Marina Polyakova wrote: > The discussion about this has become entangled from the beginning, because > as I wrote in [1] at first I misread your original proposal... The last emails are about the last reviews of Fabien, which has remained unanswered for the last couple of weeks. I am marking this patch as returned with feedback for now. -- Michael
Attachment
On 2018-Sep-05, Marina Polyakova wrote: > v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch > - a patch for the RandomState structure (this is used to reset a client's > random seed during the repeating of transactions after > serialization/deadlock failures). Pushed this one with minor stylistic changes (the most notable of which is the move of initRandomState to where the rest of the random generator infrastructure is, instead of in a totally random place). Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-11-16 22:59, Alvaro Herrera wrote: > On 2018-Sep-05, Marina Polyakova wrote: > >> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch >> - a patch for the RandomState structure (this is used to reset a >> client's >> random seed during the repeating of transactions after >> serialization/deadlock failures). > > Pushed this one with minor stylistic changes (the most notable of which > is the move of initRandomState to where the rest of the random > generator > infrastructure is, instead of in a totally random place). Thanks, Thank you very much! I'm going to send a new patch set until the end of this week (I'm sorry I was very busy in the release of Postgres Pro 11...). -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2018-Nov-19, Marina Polyakova wrote: > On 2018-11-16 22:59, Alvaro Herrera wrote: > > On 2018-Sep-05, Marina Polyakova wrote: > > > > > v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch > > > - a patch for the RandomState structure (this is used to reset a > > > client's > > > random seed during the repeating of transactions after > > > serialization/deadlock failures). > > > > Pushed this one with minor stylistic changes (the most notable of which > > is the move of initRandomState to where the rest of the random generator > > infrastructure is, instead of in a totally random place). Thanks, > > Thank you very much! I'm going to send a new patch set until the end of this > week (I'm sorry I was very busy in the release of Postgres Pro 11...). Great, thanks. I also think that the pgbench_error() patch should go in before the main one. It seems a bit pointless to introduce code using a bad API only to fix the API together with all the new callers immediately afterwards. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Alvaro, > I also think that the pgbench_error() patch should go in before the main > one. It seems a bit pointless to introduce code using a bad API only to > fix the API together with all the new callers immediately afterwards. I'm not that keen on this part of the patch, because ISTM that introduces significant and possibly costly malloc/free cycles when handling error, which do not currently exist in pgbench. Previously an error was basically the end of the script, but with the feature being introduced by Marina some errors are handled, in which case we end up with paying these costs in the test loop. Also, refactoring error handling is not necessary for the new feature. That is why I advised to move it away and possibly keep it for later. Related to Marina patch (triggered by reviewing the patches), I have submitted a refactoring patch which aims at cleaning up the internal state machine, so that additions and checking that all is well is simpler. https://commitfest.postgresql.org/20/1754/ It has been reviewed, I think I answered to the reviewer concerns, but the reviewer did not update the patch state on the cf app, so I do not know whether he is unsatisfied or if it was just forgotten. -- Fabien.
On 2018-Nov-19, Fabien COELHO wrote: > > Hello Alvaro, > > > I also think that the pgbench_error() patch should go in before the main > > one. It seems a bit pointless to introduce code using a bad API only to > > fix the API together with all the new callers immediately afterwards. > > I'm not that keen on this part of the patch, because ISTM that introduces > significant and possibly costly malloc/free cycles when handling error, > which do not currently exist in pgbench. Oh, I wasn't aware of that. > Related to Marina patch (triggered by reviewing the patches), I have > submitted a refactoring patch which aims at cleaning up the internal state > machine, so that additions and checking that all is well is simpler. > > https://commitfest.postgresql.org/20/1754/ let me look at this one. > It has been reviewed, I think I answered to the reviewer concerns, but the > reviewer did not update the patch state on the cf app, so I do not know > whether he is unsatisfied or if it was just forgotten. Feel free to update a patch status to "needs review" yourself after submitting a new version that in your opinion respond to a reviewer's comments. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> Feel free to update a patch status to "needs review" yourself after > submitting a new version that in your opinion respond to a reviewer's > comments. Sure, I do that. But I will not switch any of my patch to "Ready". AFAICR the concerns where mostly about imprecise comments in the code, and a few questions that I answered. -- Fabien.
On Mon, Mar 9, 2020 at 10:00 AM Marina Polyakova <m.polyakova@postgrespro.ru> wrote: > On 2018-11-16 22:59, Alvaro Herrera wrote: > > On 2018-Sep-05, Marina Polyakova wrote: > > > >> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch > >> - a patch for the RandomState structure (this is used to reset a > >> client's > >> random seed during the repeating of transactions after > >> serialization/deadlock failures). > > > > Pushed this one with minor stylistic changes (the most notable of which > > is the move of initRandomState to where the rest of the random > > generator > > infrastructure is, instead of in a totally random place). Thanks, > > Thank you very much! I'm going to send a new patch set until the end of > this week (I'm sorry I was very busy in the release of Postgres Pro > 11...). Is anyone interested in rebasing this, and summarising what needs to be done to get it in? It's arguably a bug or at least quite unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard that a couple of forks already ship Marina's patch set.
Hello Thomas, >> Thank you very much! I'm going to send a new patch set until the end of >> this week (I'm sorry I was very busy in the release of Postgres Pro >> 11...). > > Is anyone interested in rebasing this, and summarising what needs to > be done to get it in? It's arguably a bug or at least quite > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard > that a couple of forks already ship Marina's patch set. I'm a reviewer on this patch, that I find a good thing (tm), and which was converging to a reasonable and simple enough addition, IMHO. If I proceed in place of Marina, who is going to do the reviews? -- Fabien.
On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > >> Thank you very much! I'm going to send a new patch set until the end of > >> this week (I'm sorry I was very busy in the release of Postgres Pro > >> 11...). > > > > Is anyone interested in rebasing this, and summarising what needs to > > be done to get it in? It's arguably a bug or at least quite > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard > > that a couple of forks already ship Marina's patch set. > > I'm a reviewer on this patch, that I find a good thing (tm), and which was > converging to a reasonable and simple enough addition, IMHO. > > If I proceed in place of Marina, who is going to do the reviews? Hi Fabien, Cool. I'll definitely take it for a spin if you post a fresh patch set. Any place that we arbitrarily don't support SERIALIZABLE, I consider a bug, so I'd like to commit this if we can agree it's ready. It sounds like it's actually in pretty good shape.
Hi hackers, On Tue, 10 Mar 2020 09:48:23 +1300 Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > >> Thank you very much! I'm going to send a new patch set until the end of > > >> this week (I'm sorry I was very busy in the release of Postgres Pro > > >> 11...). > > > > > > Is anyone interested in rebasing this, and summarising what needs to > > > be done to get it in? It's arguably a bug or at least quite > > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard > > > that a couple of forks already ship Marina's patch set. I got interested in this and now looking into the patch and the past discussion. If anyone other won't do it and there are no objection, I would like to rebase this. Is that okay? Regards, Yugo NAGATA > > > > I'm a reviewer on this patch, that I find a good thing (tm), and which was > > converging to a reasonable and simple enough addition, IMHO. > > > > If I proceed in place of Marina, who is going to do the reviews? > > Hi Fabien, > > Cool. I'll definitely take it for a spin if you post a fresh patch > set. Any place that we arbitrarily don't support SERIALIZABLE, I > consider a bug, so I'd like to commit this if we can agree it's ready. > It sounds like it's actually in pretty good shape. -- Yugo NAGATA <nagata@sraoss.co.jp>
Hi hackers, On Mon, 24 May 2021 11:29:10 +0900 Yugo NAGATA <nagata@sraoss.co.jp> wrote: > Hi hackers, > > On Tue, 10 Mar 2020 09:48:23 +1300 > Thomas Munro <thomas.munro@gmail.com> wrote: > > > On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > >> Thank you very much! I'm going to send a new patch set until the end of > > > >> this week (I'm sorry I was very busy in the release of Postgres Pro > > > >> 11...). > > > > > > > > Is anyone interested in rebasing this, and summarising what needs to > > > > be done to get it in? It's arguably a bug or at least quite > > > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard > > > > that a couple of forks already ship Marina's patch set. > > I got interested in this and now looking into the patch and the past discussion. > If anyone other won't do it and there are no objection, I would like to rebase > this. Is that okay? I rebased and fixed the previous patches (v11) rewtten by Marina Polyakova, and attached the revised version (v12). v12-0001-Pgbench-errors-use-the-Variables-structure-for-c.patch - a patch for the Variables structure (this is used to reset client variables during the repeating of transactions after serialization/deadlock failures). v12-0002-Pgbench-errors-and-serialization-deadlock-retrie.patch - the main patch for handling client errors and repetition of transactions with serialization/deadlock failures (see the detailed description in the file). These are the revised versions from v11-0002 and v11-0003. v11-0001 (for the RandomState structure) is not included because this has been already committed (40923191944). V11-0004 (for a separate error reporting function) is not included neither because pgbench now uses common logging APIs (30a3e772b40). In addition to rebase on master, I updated the patch according with the review from Fabien COELHO [1] and discussions after this. Also, I added some other fixes through my reviewing the previous patch. [1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809081450100.10506%40lancre Following are fixes according with Fabian's review. > * Features > As far as the actual retry feature is concerned, I'd say we are nearly > there. However I have an issue with changing the behavior on meta command > and other sql errors, which I find not desirable. ... > I do not think that these changes of behavior are desirable. Meta command and > miscellaneous SQL errors should result in immediatly aborting the whole run, > because the client test code itself could not run correctly or the SQL sent > was somehow wrong, which is also the client's fault, and the server > performance bench does not make much sense in such conditions. > > ISTM that the focus of this patch should only be to handle some server > runtime errors that can be retryed, but not to change pgbench behavior on > other kind of errors. If these are to be changed, ISTM that it would be a > distinct patch and would require some discussion, and possibly an option > to enable it or not if some use case emerge. AFA this patch is concerned, > I'd suggest to let that out. Previously, all SQL and meta command errors could be retried, but I fixed to allow only serialization & deadlock errors to be retried. > Doc says "you cannot use an infinite number of retries without latency-limit..." > > Why should this be forbidden? At least if -T timeout takes precedent and > shortens the execution, ISTM that there could be good reason to test that. > Maybe it could be blocked only under -t if this would lead to an non-ending > run. I fixed to allow to use --max-tries with -T option even if latency-limit is not used. > As "--print-errors" is really for debug, maybe it could be named > "--debug-errors". I'm not sure that having "--debug" implying this option > is useful: As there are two distinct options, the user may be allowed > to trigger one or the other as they wish? print-errors was renamed to debug-errors. > makeVariableValue error message is not for debug, but must be kept in all > cases, and the false returned must result in an immediate abort. Same thing about > lookupCreateVariable, an invalid name is a user error which warrants an immediate > abort. Same thing again about coerce* functions or evalStandardFunc... > Basically, most/all added "debug_level >= DEBUG_ERRORS" are not desirable. "DEBUG_ERRORS" messages unrelated to serialization & deadlock errors were removed. > sendRollback(): I'd suggest to simplify. The prepare/extended statement stuff is > really about the transaction script, not dealing with errors, esp as there is no > significant advantage in preparing a "ROLLBACK" statement which is short and has > no parameters. I'd suggest to remove this function and just issue > PQsendQuery("ROLLBACK;") in all cases. Now, we just issue PQsendQuery("ROLLBACK;"). > In copyVariables, I'd simplify > > + if (source_var->svalue == NULL) > + dest_var->svalue = NULL; > + else > + dest_var->svalue = pg_strdup(source_var->svalue); > >as: > dest_var->value = (source_var->svalue == NULL) ? NULL : pg_strdup(source_var->svalue); Fixed using a ternary operator. > + if (sqlState) -> if (sqlState != NULL) ? Fixed. > Function getTransactionStatus name does not seem to correspond fully to what the > function does. There is a passthru case which should be either avoided or > clearly commented. This was renamed to checkTransactionStatus according with [2]. [2] https://www.postgresql.org/message-id/c262e889315625e0fc0d77ca78fe2eac%40postgrespro.ru > - commandFailed(st, "SQL", "perhaps the backend died while processing"); > + clientAborted(st, > + "perhaps the backend died while processing"); > > keep on one line? This fix that replaced commandFailed with clientAborted was removed. (See below) > + if (doRetry(st, &now)) > + st->state = CSTATE_RETRY; > + else > + st->state = CSTATE_FAILURE; > > -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE; Fixed using a ternary operator. > * Comments > "There're different types..." -> "There are different types..." > "after the errors and"... -> "after errors and"... > "the default value of max_tries is set to 1" -> "the default value > of max_tries is 1" > "We cannot retry the transaction" -> "We cannot retry a transaction" > "may ultimately succeed or get a failure," -> "may ultimately succeed or fail," Fixed. > Overall, the comment text in StatsData is very clear. However they are not > clearly linked to the struct fields. I'd suggest that earch field when used > should be quoted, so as to separate English from code, and the struct name > should always be used explicitely when possible. The comment in StatsData was fixed to clarify what each filed in this struct represents. > I'd insist in a comment that "cnt" does not include "skipped" transactions > (anymore). StatsData.cnt has a comment "number of successful transactions, not including 'skipped'", and CState.cnt has a comment "skipped and failed transactions are also counted here". > * Documentation: > ISTM that there are too many "the": > - "turns on the option ..." -> "turns on option ..." > - "When the option ..." -> "When option ..." > - "By default the option ..." -> "By default option ..." > - "only if the option ..." -> "only if option ..." > - "combined with the option ..." -> "combined with option ..." > - "without the option ..." -> "without option ..." The previous patch used a lot of "the option xxxx", but I fixed them to "the xxxx option" because I found that the documentation uses such way for referring to a certain option. For example, - You can (and, for most purposes, probably should) increase the number of rows by using the <option>-s</option> (scale factor) option. - The prefix can be changed by using the <option>--log-prefix</option> option. - If the <option>-j</option> option is 2 or higher, so that there are multiple worker threads, > - "is the sum of all the retries" -> "is the sum of all retries" > "infinite" -> "unlimited" > "not retried at all" -> "not retried" (maybe several times). > "messages of all errors" -> "messages about all errors". > "It is assumed that the scripts used do not contain" -> > "It is assumed that pgbench scripts do not contai Fixed. Following are additional fixes based on my review on the previous patch. * About error reporting In the previous patch, commandFailed() was changed to report an error that doesn't immediately abort the client, and clientAborted() was added to report an abortion of the client. In the attached patch, behaviors around errors other than serialization and deadlock are not changed and such errors cause the client to abort, so commandFaile() is used without any changes to report a client abortion, and commandError() is added to report an error that can be retried under --debug-error. * About progress reporting In the previous patch, the number of failures was reported only when any transaction was failed, and statistics of retry was reported only when any transaction was retried. This means, the number of columns in the reporting were different depending on the interval. This was odd and harder to parse the output. In the attached patch, the number of failures is always reported, and the retry statistic is reported when max-tries is not 1. * About result outputs In the previous patch, the number of failed transaction, the number of retried transaction, and the number of total retries were reported as: number of failures: 324 (3.240%) ... number of retried: 5629 (56.290%) number of retries: 103299 I think this was confusable. Especially, it was unclear for me what "retried" and "retries" represent repectively. Therefore, in the attached patch, they are reported as: number of transactions failed: 324 (3.240%) ... number of transactions retried: 5629 (56.290%) number of total retries: 103299 which clarify that first two are the numbers of transactions and the last one is the number of retries over all transactions. * Abourt average connection time In the previous patch, this was calculated as "conn_total_duration / total->cnt" where conn_total_duration is the cumulated connection time sumed over threads and total->cnt is the number of transaction that is successfully processed. However, the average connection time could be overestimated because conn_total_duration includes a connection time of failed transaction due to serialization and deadlock errors. So, in the attached patch, this is calculated as "conn_total_duration / total->cnt + failures". Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
Hello Yugo-san, Thanks a lot for continuing this work started by Marina! I'm planning to review it for the July CF. I've just added an entry there: https://commitfest.postgresql.org/33/3194/ -- Fabien.
Hello Fabien, On Tue, 22 Jun 2021 20:03:58 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Hello Yugo-san, > > Thanks a lot for continuing this work started by Marina! > > I'm planning to review it for the July CF. I've just added an entry there: > > https://commitfest.postgresql.org/33/3194/ Thanks! -- Yugo NAGATA <nagata@sraoss.co.jp>
Hello Yugo-san: # About v12.1 This is a refactoring patch, which creates a separate structure for holding variables. This will become handy in the next patch. There is also a benefit from a software engineering point of view, so it has merit on its own. ## Compilation Patch applies cleanly, compiles, global & local checks pass. ## About the code Fine. I'm wondering whether we could use "vars" instead of "variables" as a struct field name and function parameter name, so that is is shorter and more distinct from the type name "Variables". What do you think? ## About comments Remove the comment on enlargeVariables about "It is assumed …" the issue of trying MAXINT vars is more than remote and is not worth mentioning. In the same function, remove the comments about MARGIN, it is already on the macro declaration, once is enough. -- Fabien.
On Wed, 23 Jun 2021 10:38:43 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Hello Yugo-san: > > # About v12.1 > > This is a refactoring patch, which creates a separate structure for > holding variables. This will become handy in the next patch. There is also > a benefit from a software engineering point of view, so it has merit on > its own. > ## Compilation > > Patch applies cleanly, compiles, global & local checks pass. > > ## About the code > > Fine. > > I'm wondering whether we could use "vars" instead of "variables" as a > struct field name and function parameter name, so that is is shorter and > more distinct from the type name "Variables". What do you think? The struct "Variables" has a field named "vars" which is an array of "Variable" type. I guess this is a reason why "variables" is used instead of "vars" as a name of "Variables" type variable so that we could know a variable's type is Variable or Variables. Also, in order to refer to the field, we would use vars->vars[vars->nvars] and there are nested "vars". Could this make a codereader confused? > ## About comments > > Remove the comment on enlargeVariables about "It is assumed …" the issue > of trying MAXINT vars is more than remote and is not worth mentioning. In > the same function, remove the comments about MARGIN, it is already on the > macro declaration, once is enough. Sure. I'll remove them. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Hello Yugo-san, >> I'm wondering whether we could use "vars" instead of "variables" as a >> struct field name and function parameter name, so that is is shorter and >> more distinct from the type name "Variables". What do you think? > > The struct "Variables" has a field named "vars" which is an array of > "Variable" type. I guess this is a reason why "variables" is used instead > of "vars" as a name of "Variables" type variable so that we could know > a variable's type is Variable or Variables. Also, in order to refer to > the field, we would use > > vars->vars[vars->nvars] > > and there are nested "vars". Could this make a codereader confused? Hmmm… Probably. Let's keep "variables" then. -- Fabien.
Hello Yugo-san, # About v12.2 ## Compilation Patch seems to apply cleanly with "git apply", but does not compile on my host: "undefined reference to `conditional_stack_reset'". However it works better when using the "patch". I'm wondering why git apply fails silently… When compiling there are warnings about "pg_log_fatal", which does not expect a FILE* on pgbench.c:4453. Remove the "stderr" argument. Global and local checks ok. > number of transactions failed: 324 (3.240%) > ... > number of transactions retried: 5629 (56.290%) > number of total retries: 103299 I'd suggest: "number of failed transactions". "total number of retries" or just "number of retries"? ## Feature The overall code structure changes to implements the feature seems reasonable to me, as we are at the 12th iteration of the patch. Comments below are somehow about details and asking questions about choices, and commenting… ## Documentation There is a lot of documentation, which is good. I'll review these separatly. It looks good, but having a native English speaker/writer would really help! Some output examples do not correspond to actual output for the current version. In particular, there is always one TPS figure given now, instead of the confusing two shown before. ## Comments transactinos -> transactions. ## Code By default max_tries = 0. Should not the initialization be 1, as the documentation argues that it is the default? Counter comments, missing + in the formula on the skipped line. Given that we manage errors, ISTM that we should not necessarily stop on other not retried errors, but rather count/report them and possibly proceed. Eg with something like: -- server side random fail DO LANGUAGE plpgsql $$ BEGIN IF RANDOM() < 0.1 THEN RAISE EXCEPTION 'unlucky!'; END IF; END; $$; Or: -- client side random fail BEGIN; \if random(1, 10) <= 1 SELECT 1 +; \else SELECT 2; \endif COMMIT; We could count the fail, rollback if necessary, and go on. What do you think? Maybe such behavior would deserve an option. --report-latencies -> --report-per-command: should we keep supporting the previous option? --failures-detailed: if we bother to run with handling failures, should it always be on? --debug-errors: I'm not sure we should want a special debug mode for that, I'd consider integrating it with the standard debug, or just for development. Also, should it use pg_log_debug? doRetry: I'd separate the 3 no retries options instead of mixing max_tries and timer_exceeeded, for clarity. Tries vs retries: I'm at odds with having tries & retries and + 1 here and there to handle that, which is a little bit confusing. I'm wondering whether we could only count "tries" and adjust to report what we want later? advanceConnectionState: ISTM that ERROR should logically be before others which lead to it. Variables management: it looks expensive, with copying and freeing variable arrays. I'm wondering whether we should think of something more clever. Well, that would be for some other patch. "Accumulate the retries" -> "Count (re)tries"? Currently, ISTM that the retry on error mode is implicitely always on. Do we want that? I'd say yes, but maybe people could disagree. ## Tests There are tests, good! I'm wondering whether something simpler could be devised to trigger serialization or deadlock errors, eg with a SEQUENCE and an \if. See the attached files for generating deadlocks reliably (start with 2 clients). What do you think? The PL/pgSQL minimal, it is really client-code oriented. Given that deadlocks are detected about every seconds, the test runs would take some time. Let it be for now. -- Fabien.
Hello Fabien, On Sat, 26 Jun 2021 12:15:38 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Hello Yugo-san, > > # About v12.2 > > ## Compilation > > Patch seems to apply cleanly with "git apply", but does not compile on my > host: "undefined reference to `conditional_stack_reset'". > > However it works better when using the "patch". I'm wondering why git > apply fails silently… Hmm, I don't know why your compiling fails... I can apply and complile successfully using git. > When compiling there are warnings about "pg_log_fatal", which does not > expect a FILE* on pgbench.c:4453. Remove the "stderr" argument. Ok. > Global and local checks ok. > > > number of transactions failed: 324 (3.240%) > > ... > > number of transactions retried: 5629 (56.290%) > > number of total retries: 103299 > > I'd suggest: "number of failed transactions". "total number of retries" or > just "number of retries"? Ok. I fixed to use "number of failed transactions" and "total number of retries". > ## Feature > > The overall code structure changes to implements the feature seems > reasonable to me, as we are at the 12th iteration of the patch. > > Comments below are somehow about details and asking questions > about choices, and commenting… > > ## Documentation > > There is a lot of documentation, which is good. I'll review these > separatly. It looks good, but having a native English speaker/writer > would really help! > > Some output examples do not correspond to actual output for > the current version. In particular, there is always one TPS figure > given now, instead of the confusing two shown before. Fixed. > ## Comments > > transactinos -> transactions. Fixed. > ## Code > > By default max_tries = 0. Should not the initialization be 1, > as the documentation argues that it is the default? Ok. I fixed the default value to 1. > Counter comments, missing + in the formula on the skipped line. Fixed. > Given that we manage errors, ISTM that we should not necessarily > stop on other not retried errors, but rather count/report them and > possibly proceed. Eg with something like: > > -- server side random fail > DO LANGUAGE plpgsql $$ > BEGIN > IF RANDOM() < 0.1 THEN > RAISE EXCEPTION 'unlucky!'; > END IF; > END; > $$; > > Or: > > -- client side random fail > BEGIN; > \if random(1, 10) <= 1 > SELECT 1 +; > \else > SELECT 2; > \endif > COMMIT; > > We could count the fail, rollback if necessary, and go on. What do you think? > Maybe such behavior would deserve an option. This feature to count failures that could occur at runtime seems nice. However, as discussed in [1], I think it is better to focus to only failures that can be retried in this patch, and introduce the feature to handle other failures in a separate patch. [1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809121519590.13887%40lancre > --report-latencies -> --report-per-command: should we keep supporting > the previous option? Ok. Although now the option is not only for latencies, considering users who are using the existing option, I'm fine with this. I got back this to the previous name. > --failures-detailed: if we bother to run with handling failures, should > it always be on? If we print other failures that cannot be retried in future, it could a lot of lines and might make some users who don't need details of failures annoyed. Moreover, some users would always need information of detailed failures in log, and others would need only total numbers of failures. Currently we handle only serialization and deadlock failures, so the number of lines printed and the number of columns of logging is not large even under the failures-detail, but if we have a chance to handle other failures in future, ISTM adding this option makes sense considering users who would like simple outputs. > --debug-errors: I'm not sure we should want a special debug mode for that, > I'd consider integrating it with the standard debug, or just for development. I think --debug is a debug option for telling users the pgbench's internal behaviors, that is, which client is doing what. On other hand, --debug-errors is for telling users what error caused a retry or a failure in detail. For users who are not interested in pgbench's internal behavior (sending a command, receiving a result, ... ) but interested in actual errors raised during running script, this option seems useful. > Also, should it use pg_log_debug? If we use pg_log_debug, the message is printed only under --debug. Therefore, I fixed to use pg_log_info instead of pg_log_error or fprintf. > doRetry: I'd separate the 3 no retries options instead of mixing max_tries and > timer_exceeeded, for clarity. Ok. I fixed to separate them. > Tries vs retries: I'm at odds with having tries & retries and + 1 here > and there to handle that, which is a little bit confusing. I'm wondering whether > we could only count "tries" and adjust to report what we want later? I fixed to use "tries" instead of "retries" in CState. However, we still use "retries" in StatsData and Command because the number of retries is printed in the final result. Is it less confusing than the previous? > advanceConnectionState: ISTM that ERROR should logically be before others which > lead to it. Sorry, I couldn't understand your suggestion. Is this about the order of case statements or pg_log_error? > Variables management: it looks expensive, with copying and freeing variable arrays. > I'm wondering whether we should think of something more clever. Well, that would be > for some other patch. Well.., indeed there may be more efficient way. For example, instead of clearing all vars in dest, it might be possible to copy or clear only the difference part between dest and source and remaining unchanged part in dest. Anyway, I think this work should be done in other patch. > "Accumulate the retries" -> "Count (re)tries"? Fixed. > Currently, ISTM that the retry on error mode is implicitely always on. > Do we want that? I'd say yes, but maybe people could disagree. The default values of max-tries is 1, so the retry on error is off. Failed transactions are retried only when the user wants it and specifies a valid value to max-treis. > ## Tests > > There are tests, good! > > I'm wondering whether something simpler could be devised to trigger > serialization or deadlock errors, eg with a SEQUENCE and an \if. > > See the attached files for generating deadlocks reliably (start with 2 clients). > What do you think? The PL/pgSQL minimal, it is really client-code > oriented. > > Given that deadlocks are detected about every seconds, the test runs > would take some time. Let it be for now. Sorry, but I cannot find the attached file. I don't have a good idea for a simpler test for now, but I can fix the test based on your idea after getting the file. I attached the patch updated according with your suggestion. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
Hello Yugo-san, Thanks for the update! >> Patch seems to apply cleanly with "git apply", but does not compile on my >> host: "undefined reference to `conditional_stack_reset'". >> >> However it works better when using the "patch". I'm wondering why git >> apply fails silently… > > Hmm, I don't know why your compiling fails... I can apply and complile > successfully using git. Hmmm. Strange! >> Given that we manage errors, ISTM that we should not necessarily stop >> on other not retried errors, but rather count/report them and possibly >> proceed. Eg with something like: [...] We could count the fail, >> rollback if necessary, and go on. What do you think? Maybe such >> behavior would deserve an option. > > This feature to count failures that could occur at runtime seems nice. However, > as discussed in [1], I think it is better to focus to only failures that can be > retried in this patch, and introduce the feature to handle other failures in a > separate patch. Ok. >> --report-latencies -> --report-per-command: should we keep supporting >> the previous option? > > Ok. Although now the option is not only for latencies, considering users who > are using the existing option, I'm fine with this. I got back this to the > previous name. Hmmm. I liked the new name! My point was whether we need to support the old one as well for compatibility, or whether we should not bother. I'm still wondering. As I think that the new name is better, I'd suggest to keep it. >> --failures-detailed: if we bother to run with handling failures, should >> it always be on? > > If we print other failures that cannot be retried in future, it could a lot > of lines and might make some users who don't need details of failures annoyed. > Moreover, some users would always need information of detailed failures in log, > and others would need only total numbers of failures. Ok. > Currently we handle only serialization and deadlock failures, so the number of > lines printed and the number of columns of logging is not large even under the > failures-detail, but if we have a chance to handle other failures in future, > ISTM adding this option makes sense considering users who would like simple > outputs. Hmmm. What kind of failures could be managed with retries? I guess that on a connection failure we can try to reconnect, but otherwise it is less clear that other failures make sense to retry. >> --debug-errors: I'm not sure we should want a special debug mode for that, >> I'd consider integrating it with the standard debug, or just for development. > > I think --debug is a debug option for telling users the pgbench's internal > behaviors, that is, which client is doing what. On other hand, --debug-errors > is for telling users what error caused a retry or a failure in detail. For > users who are not interested in pgbench's internal behavior (sending a command, > receiving a result, ... ) but interested in actual errors raised during running > script, this option seems useful. Ok. The this is not really about debug per se, but a verbosity setting? Maybe --verbose-errors would make more sense? I'm unsure. I'll think about it. >> Also, should it use pg_log_debug? > > If we use pg_log_debug, the message is printed only under --debug. > Therefore, I fixed to use pg_log_info instead of pg_log_error or fprintf. Ok, pg_log_info seems right. >> Tries vs retries: I'm at odds with having tries & retries and + 1 here >> and there to handle that, which is a little bit confusing. I'm wondering whether >> we could only count "tries" and adjust to report what we want later? > > I fixed to use "tries" instead of "retries" in CState. However, we still use > "retries" in StatsData and Command because the number of retries is printed > in the final result. Is it less confusing than the previous? I'm going to think about it. >> advanceConnectionState: ISTM that ERROR should logically be before others which >> lead to it. > > Sorry, I couldn't understand your suggestion. Is this about the order of case > statements or pg_log_error? My sentence got mixed up. My point was about the case order, so that they are put in a more logical order when reading all the cases. >> Currently, ISTM that the retry on error mode is implicitely always on. >> Do we want that? I'd say yes, but maybe people could disagree. > > The default values of max-tries is 1, so the retry on error is off. > Failed transactions are retried only when the user wants it and > specifies a valid value to max-treis. Ok. My point is that we do not stop on such errors, whereas before ISTM that we would have stopped, so somehow the default behavior has changed and the previous behavior cannot be reinstated with an option. Maybe that is not bad, but this is a behavioral change which needs to be documented and argumented. >> See the attached files for generating deadlocks reliably (start with 2 >> clients). What do you think? The PL/pgSQL minimal, it is really >> client-code oriented. > > Sorry, but I cannot find the attached file. Sorry. Attached to this mail. The serialization stuff does not seem to work as well as the deadlock one. Run with 2 clients. -- Fabien.
Attachment
> I attached the patch updated according with your suggestion. v13 patches gave a compiler warning... $ make >/dev/null pgbench.c: In function ‘commandError’: pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable] const Command *command = sql_script[st->use_file].commands[st->command]; ^~~~~~~ Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
> v13 patches gave a compiler warning... > > $ make >/dev/null > pgbench.c: In function ‘commandError’: > pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable] > const Command *command = sql_script[st->use_file].commands[st->command]; > ^~~~~~~ There is a typo in the doc (more over -> moreover). > of all transaction tries; more over, you cannot use an unlimited number of all transaction tries; moreover, you cannot use an unlimited number Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
I have found an interesting result from patched pgbench (I have set the isolation level to REPEATABLE READ): $ pgbench -p 11000 -c 10 -T 30 --max-tries=0 test pgbench (15devel, server 13.3) starting vacuum...end. transaction type: <builtin: TPC-B (sort of)> scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 duration: 30 s number of transactions actually processed: 2586 number of failed transactions: 9 (0.347%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ number of transactions retried: 1892 (72.909%) total number of retries: 21819 latency average = 115.551 ms (including failures) initial connection time = 35.268 ms tps = 86.241799 (without initial connection time) I ran pgbench with 10 concurrent sessions. In this case pgbench always reports 9 failed transactions regardless the setting of -T option. This is because at the end of a pgbench session, only 1 out of 10 transaction succeeded but 9 transactions failed due to serialization error without any chance to retry because -T expires. This is a little bit disappointed because I wanted to see a result of all transactions succeeded with retries. I tried -t instead of -T but -t cannot be used with --max-tries=0. Also I think this behavior is somewhat inconsistent with existing behavior of pgbench. When pgbench runs without --max-tries option, pgbench continues to run transactions even after -T expires: $ time pgbench -p 11000 -T 10 -f pgbench.sql test pgbench (15devel, server 13.3) starting vacuum...end. transaction type: pgbench.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 2 maximum number of tries: 1 latency average = 7009.006 ms initial connection time = 8.045 ms tps = 0.142674 (without initial connection time) real 0m14.067s user 0m0.010s sys 0m0.004s $ cat pgbench.sql SELECT pg_sleep(7); So pgbench does not stop transactions after 10 seconds passed but waits for the last transaction completes. If we consistent with behavior when --max-tries=0, shouldn't we retry until the last transaction finishes? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Hello Ishii-san, On Thu, 01 Jul 2021 09:03:42 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > v13 patches gave a compiler warning... > > > > $ make >/dev/null > > pgbench.c: In function ‘commandError’: > > pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable] > > const Command *command = sql_script[st->use_file].commands[st->command]; > > ^~~~~~~ Hmm, we'll get the warning when --enable-cassert is not specified. I'll fix it. > There is a typo in the doc (more over -> moreover). > > > of all transaction tries; more over, you cannot use an unlimited number > > of all transaction tries; moreover, you cannot use an unlimited number > Thanks. I'll fix. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Hello Ishii-san, On Fri, 02 Jul 2021 09:25:03 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > I have found an interesting result from patched pgbench (I have set > the isolation level to REPEATABLE READ): > > $ pgbench -p 11000 -c 10 -T 30 --max-tries=0 test > pgbench (15devel, server 13.3) > starting vacuum...end. > transaction type: <builtin: TPC-B (sort of)> > scaling factor: 1 > query mode: simple > number of clients: 10 > number of threads: 1 > duration: 30 s > number of transactions actually processed: 2586 > number of failed transactions: 9 (0.347%) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > number of transactions retried: 1892 (72.909%) > total number of retries: 21819 > latency average = 115.551 ms (including failures) > initial connection time = 35.268 ms > tps = 86.241799 (without initial connection time) > > I ran pgbench with 10 concurrent sessions. In this case pgbench always > reports 9 failed transactions regardless the setting of -T > option. This is because at the end of a pgbench session, only 1 out of > 10 transaction succeeded but 9 transactions failed due to > serialization error without any chance to retry because -T expires. > > This is a little bit disappointed because I wanted to see a result of > all transactions succeeded with retries. I tried -t instead of -T but > -t cannot be used with --max-tries=0. > > Also I think this behavior is somewhat inconsistent with existing > behavior of pgbench. When pgbench runs without --max-tries option, > pgbench continues to run transactions even after -T expires: > > $ time pgbench -p 11000 -T 10 -f pgbench.sql test > pgbench (15devel, server 13.3) > starting vacuum...end. > transaction type: pgbench.sql > scaling factor: 1 > query mode: simple > number of clients: 1 > number of threads: 1 > duration: 10 s > number of transactions actually processed: 2 > maximum number of tries: 1 > latency average = 7009.006 ms > initial connection time = 8.045 ms > tps = 0.142674 (without initial connection time) > > real 0m14.067s > user 0m0.010s > sys 0m0.004s > > $ cat pgbench.sql > SELECT pg_sleep(7); > > So pgbench does not stop transactions after 10 seconds passed but > waits for the last transaction completes. If we consistent with > behavior when --max-tries=0, shouldn't we retry until the last > transaction finishes? I changed the previous patch to enable that the -T option can terminate a retrying transaction and that we can specify --max-tries=0 without --latency-limit if we have -T , according with the following comment. > Doc says "you cannot use an infinite number of retries without latency-limit..." > > Why should this be forbidden? At least if -T timeout takes precedent and > shortens the execution, ISTM that there could be good reason to test that. > Maybe it could be blocked only under -t if this would lead to an non-ending > run. Indeed, as Ishii-san pointed out, some users might not want to terminate retrying transactions due to -T. However, the actual negative effect is only printing the number of failed transactions. The other result that users want to know, such as tps, are almost not affected because they are measured for transactions processed successfully. Actually, the percentage of failed transaction is very little, only 0.347%. In the existing behaviour, running transactions are never terminated due to the -T option. However, ISTM that this would be based on an assumption that a latency of each transaction is small and that a timing when we can finish the benchmark would come soon. On the other hand, when transactions can be retried unlimitedly, it may take a long time more than expected, and we can not guarantee that this would finish successfully in limited time. Therefore, terminating the benchmark by giving up to retry the transaction after time expiration seems reasonable under unlimited retries. In the sense that we don't terminate running transactions forcibly, this don't change the existing behaviour. If you don't want to print the number of transactions failed due to -T, we can fix to forbid to use -T without latency-limit under max-tries=0 for avoiding possible never-ending-benchmark. In this case, users have to limit the number of transaction retry by specifying latency-limit or max-tries (>0). However, if some users would like to benchmark simply allowing unlimited retries, using -T and max-tries=0 seems the most straight way, so I think it is better that they can be used together. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
> Indeed, as Ishii-san pointed out, some users might not want to terminate > retrying transactions due to -T. However, the actual negative effect is only > printing the number of failed transactions. The other result that users want to > know, such as tps, are almost not affected because they are measured for > transactions processed successfully. Actually, the percentage of failed > transaction is very little, only 0.347%. Well, "that's very little, let's ignore it" is not technically a right direction IMO. > In the existing behaviour, running transactions are never terminated due to > the -T option. However, ISTM that this would be based on an assumption > that a latency of each transaction is small and that a timing when we can > finish the benchmark would come soon. On the other hand, when transactions can > be retried unlimitedly, it may take a long time more than expected, and we can > not guarantee that this would finish successfully in limited time.Therefore, > terminating the benchmark by giving up to retry the transaction after time > expiration seems reasonable under unlimited retries. That's necessarily true in practice. By the time when -T is about to expire, transactions are all finished in finite time as you can see the result I showed. So it's reasonable that the very last cycle of the benchmark will finish in finite time as well. Of course if a benchmark cycle takes infinite time, this will be a problem. However same thing can be said to non-retry benchmarks. Theoretically it is possible that *one* benchmark cycle takes forever. In this case the only solution will be just hitting ^C to terminate pgbench. Why can't we have same assumption with --max-tries=0 case? > In the sense that we don't > terminate running transactions forcibly, this don't change the existing behaviour. This statement seems to be depending on your perosnal assumption. I still don't understand why you think that --max-tries non 0 case will *certainly* finish in finite time whereas --max-tries=0 case will not. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Wed, 07 Jul 2021 16:11:23 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > Indeed, as Ishii-san pointed out, some users might not want to terminate > > retrying transactions due to -T. However, the actual negative effect is only > > printing the number of failed transactions. The other result that users want to > > know, such as tps, are almost not affected because they are measured for > > transactions processed successfully. Actually, the percentage of failed > > transaction is very little, only 0.347%. > > Well, "that's very little, let's ignore it" is not technically a right > direction IMO. Hmmm, It seems to me these failures are ignorable because with regard to failures due to -T they occur only the last transaction of each client and do not affect the result such as TPS and latency of successfully processed transactions. (although I am not sure for what sense you use the word "technically"...) However, maybe I am missing something. Could you please tell me what do you think the actual harm for users about failures due to -D is? > > In the existing behaviour, running transactions are never terminated due to > > the -T option. However, ISTM that this would be based on an assumption > > that a latency of each transaction is small and that a timing when we can > > finish the benchmark would come soon. On the other hand, when transactions can > > be retried unlimitedly, it may take a long time more than expected, and we can > > not guarantee that this would finish successfully in limited time.Therefore, > > terminating the benchmark by giving up to retry the transaction after time > > expiration seems reasonable under unlimited retries. > > That's necessarily true in practice. By the time when -T is about to > expire, transactions are all finished in finite time as you can see > the result I showed. So it's reasonable that the very last cycle of > the benchmark will finish in finite time as well. Your script may finish in finite time, but others may not. However, considering only serialization and deadlock errors, almost transactions would finish in finite time eventually. In the previous version of the patch, errors other than serialization or deadlock can be retried and it causes unlimited retrying easily. Now, only the two kind of errors can be retried, nevertheless, it is unclear for me that we can assume that retying will finish in finite time. If we can assume it, maybe, we can remove the restriction that --max-retries=0 must be used with --latency-limit or -T. > Of course if a benchmark cycle takes infinite time, this will be a > problem. However same thing can be said to non-retry > benchmarks. Theoretically it is possible that *one* benchmark cycle > takes forever. In this case the only solution will be just hitting ^C > to terminate pgbench. Why can't we have same assumption with > --max-tries=0 case? Indeed, it is possible an execution of a query takes a long or infinite time. However, its cause would a problematic query in the custom script or other problems occurs on the server side. These are not problem of pgbench and, pgbench itself can't control either. On the other hand, the unlimited number of tries is a behaviours specified by the pgbench option, so I think pgbench itself should internally avoid problems caused from its behaviours. That is, if max-tries=0 could cause infinite or much longer benchmark time more than user expected due to too many retries, I think pgbench should avoid it. > > In the sense that we don't > > terminate running transactions forcibly, this don't change the existing behaviour. > > This statement seems to be depending on your perosnal assumption. Ok. If we regard that a transaction is still running even when it is under retrying after an error, terminate of the retry may imply to terminate running the transaction forcibly. > I still don't understand why you think that --max-tries non 0 case > will *certainly* finish in finite time whereas --max-tries=0 case will > not. I just mean that --max-tries greater than zero will prevent pgbench from retrying a transaction forever. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
>> Well, "that's very little, let's ignore it" is not technically a right >> direction IMO. > > Hmmm, It seems to me these failures are ignorable because with regard to failures > due to -T they occur only the last transaction of each client and do not affect > the result such as TPS and latency of successfully processed transactions. > (although I am not sure for what sense you use the word "technically"...) "My application button does not respond once in 100 times. It's just 1% error rate. You should ignore it." I would say this attitude is not technically correct. > However, maybe I am missing something. Could you please tell me what do you think > the actual harm for users about failures due to -D is? I don't know why you are referring to -D. >> That's necessarily true in practice. By the time when -T is about to >> expire, transactions are all finished in finite time as you can see >> the result I showed. So it's reasonable that the very last cycle of >> the benchmark will finish in finite time as well. > > Your script may finish in finite time, but others may not. That's why I said "practically". In other words "in most cases the scenario will finish in finite time". > Indeed, it is possible an execution of a query takes a long or infinite > time. However, its cause would a problematic query in the custom script > or other problems occurs on the server side. These are not problem of > pgbench and, pgbench itself can't control either. On the other hand, the > unlimited number of tries is a behaviours specified by the pgbench option, > so I think pgbench itself should internally avoid problems caused from its > behaviours. That is, if max-tries=0 could cause infinite or much longer > benchmark time more than user expected due to too many retries, I think > pgbench should avoid it. I would say that's user's responsibility to avoid infinite running benchmarking. Remember, pgbench is a tool for serious users, not for novice users. Or, we should terminate the last cycle of benchmark regardless it is retrying or not if -T expires. This will make pgbench behaves much more consistent. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Wed, 07 Jul 2021 21:50:16 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > >> Well, "that's very little, let's ignore it" is not technically a right > >> direction IMO. > > > > Hmmm, It seems to me these failures are ignorable because with regard to failures > > due to -T they occur only the last transaction of each client and do not affect > > the result such as TPS and latency of successfully processed transactions. > > (although I am not sure for what sense you use the word "technically"...) > > "My application button does not respond once in 100 times. It's just > 1% error rate. You should ignore it." I would say this attitude is not > technically correct. I cannot understand what you want to say. Reporting the number of transactions that is failed intentionally can be treated as same as he error rate on your application's button? > > However, maybe I am missing something. Could you please tell me what do you think > > the actual harm for users about failures due to -D is? > > I don't know why you are referring to -D. Sorry. It's just a typo as you can imagine. I am asking you what do you think the actual harm for users due to termination of retrying by the -T option is. > >> That's necessarily true in practice. By the time when -T is about to > >> expire, transactions are all finished in finite time as you can see > >> the result I showed. So it's reasonable that the very last cycle of > >> the benchmark will finish in finite time as well. > > > > Your script may finish in finite time, but others may not. > > That's why I said "practically". In other words "in most cases the > scenario will finish in finite time". Sure. > > Indeed, it is possible an execution of a query takes a long or infinite > > time. However, its cause would a problematic query in the custom script > > or other problems occurs on the server side. These are not problem of > > pgbench and, pgbench itself can't control either. On the other hand, the > > unlimited number of tries is a behaviours specified by the pgbench option, > > so I think pgbench itself should internally avoid problems caused from its > > behaviours. That is, if max-tries=0 could cause infinite or much longer > > benchmark time more than user expected due to too many retries, I think > > pgbench should avoid it. > > I would say that's user's responsibility to avoid infinite running > benchmarking. Remember, pgbench is a tool for serious users, not for > novice users. Of course, users themselves should be careful of problematic script, but it would be better that pgbench itself avoids problems if pgbench can beforehand. > Or, we should terminate the last cycle of benchmark regardless it is > retrying or not if -T expires. This will make pgbench behaves much > more consistent. Hmmm, indeed this might make the behaviour a bit consistent, but I am not sure such behavioural change benefit users. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Hello Fabien, I attached the updated patch (v14)! On Wed, 30 Jun 2021 17:33:24 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > >> --report-latencies -> --report-per-command: should we keep supporting > >> the previous option? > > > > Ok. Although now the option is not only for latencies, considering users who > > are using the existing option, I'm fine with this. I got back this to the > > previous name. > > Hmmm. I liked the new name! My point was whether we need to support the > old one as well for compatibility, or whether we should not bother. I'm > still wondering. As I think that the new name is better, I'd suggest to > keep it. Ok. I misunderstood it. I returned the option name to report-per-command. If we keep report-latencies, I can imagine the following choises: - use report-latencies to print only latency information - use report-latencies as alias of report-per-command for compatibility and remove at an appropriate timing. (that is, treat as deprecated) Among these, I prefer the latter because ISTM we would not need many options for reporting information per command. However, actually, I wander that we don't have to keep the previous one if we plan to remove it eventually. > >> --failures-detailed: if we bother to run with handling failures, should > >> it always be on? > > > > If we print other failures that cannot be retried in future, it could a lot > > of lines and might make some users who don't need details of failures annoyed. > > Moreover, some users would always need information of detailed failures in log, > > and others would need only total numbers of failures. > > Ok. > > > Currently we handle only serialization and deadlock failures, so the number of > > lines printed and the number of columns of logging is not large even under the > > failures-detail, but if we have a chance to handle other failures in future, > > ISTM adding this option makes sense considering users who would like simple > > outputs. > > Hmmm. What kind of failures could be managed with retries? I guess that on > a connection failure we can try to reconnect, but otherwise it is less > clear that other failures make sense to retry. Indeed, there would few failures that we should retry and I can not imagine other than serialization , deadlock, and connection failures for now. However, considering reporting the number of failed transaction and its causes in future, as you said > Given that we manage errors, ISTM that we should not necessarily > stop on other not retried errors, but rather count/report them and > possibly proceed. , we could define more a few kind of failures. At least we can consider meta-command and other SQL commands errors in addition to serialization, deadlock, connection failures. So, the total number of kind of failures would be five at least and reporting always all of them results a lot of lines and columns in logging. > >> --debug-errors: I'm not sure we should want a special debug mode for that, > >> I'd consider integrating it with the standard debug, or just for development. > > > > I think --debug is a debug option for telling users the pgbench's internal > > behaviors, that is, which client is doing what. On other hand, --debug-errors > > is for telling users what error caused a retry or a failure in detail. For > > users who are not interested in pgbench's internal behavior (sending a command, > > receiving a result, ... ) but interested in actual errors raised during running > > script, this option seems useful. > > Ok. The this is not really about debug per se, but a verbosity setting? I think so. > Maybe --verbose-errors would make more sense? I'm unsure. I'll think about > it. Agreed. This seems more proper than the previous one, so I fixed the name to --verbose-errors. > > Sorry, I couldn't understand your suggestion. Is this about the order of case > > statements or pg_log_error? > > My sentence got mixed up. My point was about the case order, so that they > are put in a more logical order when reading all the cases. Ok. Considering the loical order, I moved WAIT_ROLLBACK_RESULT into between ERROR and RETRY, because WAIT_ROLLBACK_RESULT comes atter ERROR state, and RETRY comes after ERROR or WAIT_ROLLBACK_RESULT.. > >> Currently, ISTM that the retry on error mode is implicitely always on. > >> Do we want that? I'd say yes, but maybe people could disagree. > > > > The default values of max-tries is 1, so the retry on error is off. > > > Failed transactions are retried only when the user wants it and > > specifies a valid value to max-treis. > > Ok. My point is that we do not stop on such errors, whereas before ISTM > that we would have stopped, so somehow the default behavior has changed > and the previous behavior cannot be reinstated with an option. Maybe that > is not bad, but this is a behavioral change which needs to be documented > and argumented. I understood. Indeed, there is a behavioural change about whether we abort the client after some types of errors or not. Now, serialization / deadlock errors don't cause the abortion and are recorded as failures whereas other errors cause to abort the client. If we would want to record other errors as failures in future, we would need a new option to specify which type of failures (or all types of errors, maybe) should be reported. Until that time, ISTM we can treat serialization and deadlock as something special errors to be reported as failures. I rewrote "Failures and Serialization/Deadlock Retries" section a bit to emphasis that such errors are treated differently than other errors. > >> See the attached files for generating deadlocks reliably (start with 2 > >> clients). What do you think? The PL/pgSQL minimal, it is really > >> client-code oriented. > > > > Sorry, but I cannot find the attached file. > > Sorry. Attached to this mail. The serialization stuff does not seem to > work as well as the deadlock one. Run with 2 clients. Hmmm, your test didn't work well for me. Both tests got stuck in pgbench_deadlock_wait() and pgbench didn't finish. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
I have played with v14 patch. I previously complained that pgbench always reported 9 errors (actually the number is always the number specified by "-c" -1 in my case). $ pgbench -p 11000 -c 10 -T 10 --max-tries=0 test pgbench (15devel, server 13.3) starting vacuum...end. transaction type: <builtin: TPC-B (sort of)> scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 duration: 10 s number of transactions actually processed: 974 number of failed transactions: 9 (0.916%) number of transactions retried: 651 (66.226%) total number of retries: 8482 latency average = 101.317 ms (including failures) initial connection time = 44.440 ms tps = 97.796487 (without initial connection time) To reduce the number of errors I provide "--max-tries=9000" because pgbench reported 8482 errors. $ pgbench -p 11000 -c 10 -T 10 --max-tries=9000 test pgbench (15devel, server 13.3) starting vacuum...end. transaction type: <builtin: TPC-B (sort of)> scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 duration: 10 s number of transactions actually processed: 1133 number of failed transactions: 9 (0.788%) number of transactions retried: 755 (66.112%) total number of retries: 9278 maximum number of tries: 9000 latency average = 88.570 ms (including failures) initial connection time = 23.384 ms tps = 112.015219 (without initial connection time) Unfortunately this didn't work. Still 9 errors because pgbench terminated the last round of run. Then I gave up to use -T, and switched to use -t. Number of transactions for -t option was calculated by the total number of transactions actually processed (1133) / number of clients (10) = 11.33. I rouned up 11.33 to 12, then multiply number of clients (10) and got 120. The result: $ pgbench -p 11000 -c 10 -t 120 --max-tries=9000 test pgbench (15devel, server 13.3) starting vacuum...end. transaction type: <builtin: TPC-B (sort of)> scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 number of transactions per client: 120 number of transactions actually processed: 1200/1200 number of transactions retried: 675 (56.250%) total number of retries: 8524 maximum number of tries: 9000 latency average = 93.777 ms initial connection time = 14.120 ms tps = 106.635908 (without initial connection time) Finally I was able to get a result without any errors. This is not a super simple way to obtain pgbench results without errors, but probably I can live with it. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Hello, > Of course, users themselves should be careful of problematic script, but it > would be better that pgbench itself avoids problems if pgbench can beforehand. > >> Or, we should terminate the last cycle of benchmark regardless it is >> retrying or not if -T expires. This will make pgbench behaves much >> more consistent. I would tend to agree with this behavior, that is not to start any new transaction or transaction attempt once -T has expired. I'm a little hesitant about how to count and report such unfinished because of bench timeout transactions, though. Not counting them seems to be the best option. > Hmmm, indeed this might make the behaviour a bit consistent, but I am not > sure such behavioural change benefit users. The user benefit would be that if they asked for a 100s benchmark, pgbench does a reasonable effort not to overshot that? -- Fabien.
>>> Or, we should terminate the last cycle of benchmark regardless it is >>> retrying or not if -T expires. This will make pgbench behaves much >>> more consistent. > > I would tend to agree with this behavior, that is not to start any new > transaction or transaction attempt once -T has expired. > > I'm a little hesitant about how to count and report such unfinished > because of bench timeout transactions, though. Not counting them seems > to be the best option. I agree. >> Hmmm, indeed this might make the behaviour a bit consistent, but I am >> not >> sure such behavioural change benefit users. > > The user benefit would be that if they asked for a 100s benchmark, > pgbench does a reasonable effort not to overshot that? Right. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Tue, 13 Jul 2021 13:00:49 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > >>> Or, we should terminate the last cycle of benchmark regardless it is > >>> retrying or not if -T expires. This will make pgbench behaves much > >>> more consistent. > > > > I would tend to agree with this behavior, that is not to start any new > > transaction or transaction attempt once -T has expired. That is the behavior in the latest patch. Once -T has expired, any new transaction or retry does not start. IIUC, Ishii-san's proposal was changing the pgbench's behavior when -T has expired to terminate any running transactions immediately regardless retrying. I am not sure we should do it in this patch. If we would like this change, it would be done in another patch as an improvement of the -T option. > > I'm a little hesitant about how to count and report such unfinished > > because of bench timeout transactions, though. Not counting them seems > > to be the best option. > > I agree. I also agree. Although I couldn't get an answer what does he think the actual harm for users due to termination of retrying by the -T option is, I guess it just complained about reporting the termination of retrying as failures. Therefore, I will fix to finish the benchmark when the time is over during retrying, that is, change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
>> > I would tend to agree with this behavior, that is not to start any new >> > transaction or transaction attempt once -T has expired. > > That is the behavior in the latest patch. Once -T has expired, any new > transaction or retry does not start. Actually v14 has not changed the behavior in this regard as explained in different email: > $ pgbench -p 11000 -c 10 -T 10 --max-tries=0 test > pgbench (15devel, server 13.3) > starting vacuum...end. > transaction type: <builtin: TPC-B (sort of)> > scaling factor: 1 > query mode: simple > number of clients: 10 > number of threads: 1 > duration: 10 s > number of transactions actually processed: 974 > number of failed transactions: 9 (0.916%) > number of transactions retried: 651 (66.226%) > total number of retries: 8482 > latency average = 101.317 ms (including failures) > initial connection time = 44.440 ms > tps = 97.796487 (without initial connection time) >> > I'm a little hesitant about how to count and report such unfinished >> > because of bench timeout transactions, though. Not counting them seems >> > to be the best option. >> >> I agree. > > I also agree. Although I couldn't get an answer what does he think the actual > harm for users due to termination of retrying by the -T option is, I guess it just > complained about reporting the termination of retrying as failures. Therefore, > I will fix to finish the benchmark when the time is over during retrying, that is, > change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases. I guess Fabien wanted it differently. Suppose "-c 10 and -T 30" and we have 100 success transactions by time 25. At time 25 pgbench starts next benchmark cycle and by time 30 there are 10 failing transactions (because they are retrying). pgbench stops the execution at time 30. According your proposal (change the state to CSTATE_FINISHED instead of CSTATE_ERROR) the total number of success transactions will be 100 + 10 = 110, right? I guess Fabien wants to have the number to be 100 rather than 110. Fabien, Please correct me if you think differently. Also actually I have explained the harm number of times but you have kept on ignoring it because "it's subtle". My request has been pretty simple. > number of failed transactions: 9 (0.916%) I don't like this and want to have the failed transactions to be 0. Who wants a benchmark result having errors? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Tue, 13 Jul 2021 14:35:00 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > >> > I would tend to agree with this behavior, that is not to start any new > >> > transaction or transaction attempt once -T has expired. > > > > That is the behavior in the latest patch. Once -T has expired, any new > > transaction or retry does not start. > > Actually v14 has not changed the behavior in this regard as explained > in different email: Right. Both of v13 and v14 doen't start any new transaction or retry once -T has expired. > >> > I'm a little hesitant about how to count and report such unfinished > >> > because of bench timeout transactions, though. Not counting them seems > >> > to be the best option. > >> > >> I agree. > > > > I also agree. Although I couldn't get an answer what does he think the actual > > harm for users due to termination of retrying by the -T option is, I guess it just > > complained about reporting the termination of retrying as failures. Therefore, > > I will fix to finish the benchmark when the time is over during retrying, that is, > > change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases. > > I guess Fabien wanted it differently. Suppose "-c 10 and -T 30" and we > have 100 success transactions by time 25. At time 25 pgbench starts > next benchmark cycle and by time 30 there are 10 failing transactions > (because they are retrying). pgbench stops the execution at time > 30. According your proposal (change the state to CSTATE_FINISHED > instead of CSTATE_ERROR) the total number of success transactions will > be 100 + 10 = 110, right? No. The last failed transaction is not counted because CSTATE_END_TX is bypassed, so please don't worry. > Also actually I have explained the harm number of times but you have > kept on ignoring it because "it's subtle". My request has been pretty > simple. > > > number of failed transactions: 9 (0.916%) > > I don't like this and want to have the failed transactions to be 0. > Who wants a benchmark result having errors? I was asking you because I would like to confirm what you really complained about; whether the problem is that retrying transaction is terminated by -T option, or that pgbench reports it as the number of failed transactions? But now, I understood this is the latter that you don't want to count the temination of retrying as failures. Thanks. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Hello, I attached the updated patch. On Tue, 13 Jul 2021 15:50:52 +0900 Yugo NAGATA <nagata@sraoss.co.jp> wrote: > > >> > I'm a little hesitant about how to count and report such unfinished > > >> > because of bench timeout transactions, though. Not counting them seems > > >> > to be the best option. > > > I will fix to finish the benchmark when the time is over during retrying, that is, > > > change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases. Done. (I wrote CSTATE_ERROR, but correctly it is CSTATE_FAILURE.) Now, once the timer is expired during retrying a failed transaction, pgbench never start a new transaction for retry. If the transaction successes, it will counted in the result. Otherwise, if the transaction fails again, it is not counted. In addition, I fixed to work well with pipeline mode. Previously, pipeline mode was not enough considered, ROLLBACK was not sent correctly. I fixed to handle errors in pipeline mode properly, and now it works. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
> I attached the updated patch. # About pgbench error handling v15 Patches apply cleanly. Compilation, global and local tests ok. - v15.1: refactoring is a definite improvement. Good, even if it is not very useful (see below). While restructuring, maybe predefined variables could be make readonly so that a script which would update them would fail, which would be a good thing. Maybe this is probably material for an independent patch. - v15.2: see detailed comments below # Doc Doc build is ok. ISTM that "number of tries" line would be better placed between the #threads and #transactions lines. What do you think? Aggregate logging description: "{ failures | ... }" seems misleading because it suggests we have one or the other, whereas it can also be empty. I suggest: "{ | failures | ... }" to show the empty case. Having a full example with retries in the doc is a good thing, and illustrates in passing that running with a number of clients on a small scale does not make much sense because of the contention on tellers/branches. I'd wonder whether the number of tries is set too high, though, ISTM that an application should give up before 100? I like that the feature it is also limited by latency limit. Minor editing: "there're" -> "there are". "the --time" -> "the --time option". The overall English seems good, but I'm not a native speaker. As I already said, a native speaker proofreading would be nice. From a technical writing point of view, maybe the documentation could be improved a bit, but I'm not a ease on that subject. Some comments: "The latency for failed transactions and commands is not computed separately." is unclear, please use a positive sentence to tell what is true instead of what is not and the reader has to guess. Maybe: "The latency figures include failed transactions which have reached the maximum number of tries or the transaction latency limit.". "The main report contains the number of failed transactions if it is non-zero." ISTM that this is a pain for scripts which would like to process these reports data, because the data may or may not be there. I'm sure to write such scripts, which explains my concern:-) "If the total number of retried transactions is non-zero…" should it rather be "not one", because zero means unlimited retries? The section describing the various type of errors that can occur is a good addition. Option "--report-latencies" changed to "--report-per-commands": I'm fine with this change. # FEATURES --failures-detailed: I'm not convinced that this option should not always be on, but this is not very important, so let it be. --verbose-errors: I still think this is only for debugging, but let it be. Copying variables: ISTM that we should not need to save the variables states… no clearing, no copying should be needed. The restarted transaction simply overrides the existing variables which is what the previous version was doing anyway. The scripts should write their own variables before using them, and if they don't then it is the user problem. This is important for performance, because it means that after a client has executed all scripts once the variable array is stable and does not incur significant maintenance costs. The only thing that needs saving for retry is the speudo-random generator state. This suggest simplifying or removing "RetryState". # CODE The semantics of "cnt" is changed. Ok, the overall counters and their relationships make sense, and it simplifies the reporting code. Good. In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and could be dealt with the default case. We are only interested in serialization/deadlocks which are fatal errors? doRetry: for consistency, given the assert, ISTM that it should return false if duration has expired, by testing end_time or timer_exceeded. checkTransactionStatus: this function does several things at once with 2 booleans, which make it not very clear to me. Maybe it would be clearer if it would just return an enum (in trans, not in trans, conn error, other error). Another reason to do that is that on connection error pgbench could try to reconnect, which would be an interesting later extension, so let's pave the way for that. Also, I do not think that the function should print out a message, it should be the caller decision to do that. verbose_errors: there is more or less repeated code under RETRY and FAILURE, which should be factored out in a separate function. The advanceConnectionFunction is long enough. Once this is done, there is no need for a getLatencyUsed function. I'd put cleaning up the pipeline in a function. I do not understand why the pipeline mode is not exited in all cases, the code checks for the pipeline status twice in a few lines. I'd put this cleanup in the sync function as well, report to the caller (advanceConnectionState) if there was an error, which would be managed there. WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to avoid code repetition (there and in the "error:" label in readCommandResponse). On the other hand, I'm not sure why the loop is needed: we can only get there by submitting a "ROLLBACK" command, so there should be only one result anyway? report_per_command: please always count retries and failures of commands even if they will not be reported in the end, the code will be simpler and more efficient. doLog: the format has changed, including a new string on failures which replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM that the string could be shorten to "deadlock" or "serialization". ISTM that the documentation example should include a line with a failure, to make it clear what to expect. I'm okay with always getting computing thread stats. # COMMENTS struct StatsData comment is helpful. - "failed transactions" -> "unsuccessfully retried transactions"? - 'cnt' decomposition: first term is field 'retried'? if so say it explicitely? "Complete the failed transaction" sounds strange: If it failed, it could not complete? I'd suggest "Record a failed transaction". # TESTS I suggested to simplify the tests by using conditionals & sequences. You reported that you got stuck. Hmmm. I tried again my tests which worked fine when started with 2 clients, otherwise they get stuck because the first client waits for the other one which does not exists (the point is to generate deadlocks and other errors). Maybe this is your issue? Could you try with: psql < deadlock_prep.sql pgbench -t 4 -c 2 -f deadlock.sql # note: each deadlock detection takes 1 second psql < deadlock_prep.sql pgbench -t 10 -c 2 -f serializable.sql # very quick 50% serialization errors -- Fabien.
Hello Fabien, Thank you so much for your review. Sorry for the late reply. I've stopped working on it due to other jobs but I came back again. I attached the updated patch. I would appreciate it if you could review this again. On Mon, 19 Jul 2021 20:04:23 +0200 (CEST) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > # About pgbench error handling v15 > > Patches apply cleanly. Compilation, global and local tests ok. > > - v15.1: refactoring is a definite improvement. > Good, even if it is not very useful (see below). Ok, we don't need to save variables in order to implement the retry feature on pbench as you suggested. Well, should we completely separate these two patches and should I fix v15.2 to not rely v15.1? > While restructuring, maybe predefined variables could be make readonly > so that a script which would update them would fail, which would be a > good thing. Maybe this is probably material for an independent patch. Yes, It shoule be for an independent patch. > - v15.2: see detailed comments below > > # Doc > > Doc build is ok. > > ISTM that "number of tries" line would be better placed between the > #threads and #transactions lines. What do you think? Agreed. Fixed. > Aggregate logging description: "{ failures | ... }" seems misleading > because it suggests we have one or the other, whereas it can also be > empty. I suggest: "{ | failures | ... }" to show the empty case. The description is correct because either "failures" or "both of serialization_failures and deadlock_failures" should appear in aggregate logging. If "failures" was printed only when any transaction failed, each line in aggregate logging could have different numbers of columns and which would make it difficult to parse the results. > I'd wonder whether the number of tries is set too high, > though, ISTM that an application should give up before 100? Indeed, max-tries=100 seems too high for practical system. Also, I noticed that sum of latencies of each command (= 15.839 ms) is significantly larger than the latency average (= 10.870 ms) because "per commands" results in the documentation were fixed. So, I retook a measurement on my machine for more accurate documentation. I used max-tries=10. > Minor editing: > > "there're" -> "there are". > > "the --time" -> "the --time option". Fixed. > "The latency for failed transactions and commands is not computed separately." is unclear, > please use a positive sentence to tell what is true instead of what is not and the reader > has to guess. Maybe: "The latency figures include failed transactions which have reached > the maximum number of tries or the transaction latency limit.". I'm not the original author of this description, but I guess this means "The latency is measured only for successful transactions and commands but not for failed transactions or commands.". > "The main report contains the number of failed transactions if it is non-zero." ISTM that > this is a pain for scripts which would like to process these reports data, because the data > may or may not be there. I'm sure to write such scripts, which explains my concern:-) I agree with you. I fixed the behavior to report the the number of failed transactions always regardless with if it is non-zero or not. > "If the total number of retried transactions is non-zero…" should it rather be "not one", > because zero means unlimited retries? I guess that this means the actual number of retried transaction not the max-tries, so "non-zero" was correct. However, for the same reason above, I fixed the behavior to report the the retry statistics always regardeless with the actual retry numbers. > > # FEATURES > Copying variables: ISTM that we should not need to save the variables > states… no clearing, no copying should be needed. The restarted > transaction simply overrides the existing variables which is what the > previous version was doing anyway. The scripts should write their own > variables before using them, and if they don't then it is the user > problem. This is important for performance, because it means that after a > client has executed all scripts once the variable array is stable and does > not incur significant maintenance costs. The only thing that needs saving > for retry is the speudo-random generator state. This suggest simplifying > or removing "RetryState". Yes. The variables states is not necessary because we retry the whole script. It was necessary in the initial patch because it planned to retry one transaction included in the script. I removed RetryState and copyVariables. > # CODE > In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and > could be dealt with the default case. We are only interested in > serialization/deadlocks which are fatal errors? We need PGRES_NONFATAL_ERROR to save st->estatus. It is used outside readCommandResponse to determine whether we should abort or not. > doRetry: for consistency, given the assert, ISTM that it should return > false if duration has expired, by testing end_time or timer_exceeded. Ok. I fixed doRetry to check timer_exceeded again. > checkTransactionStatus: this function does several things at once with 2 > booleans, which make it not very clear to me. Maybe it would be clearer if > it would just return an enum (in trans, not in trans, conn error, other > error). Another reason to do that is that on connection error pgbench > could try to reconnect, which would be an interesting later extension, so > let's pave the way for that. Also, I do not think that the function > should print out a message, it should be the caller decision to do that. OK. I added a new enum type TStatus and I fixed the function to return it. Also, I changed the function name to getTransactionStatus because the actual check is done by the caller. > verbose_errors: there is more or less repeated code under RETRY and > FAILURE, which should be factored out in a separate function. The > advanceConnectionFunction is long enough. Once this is done, there is no > need for a getLatencyUsed function. OK. I made a function to print verbose error messages and removed the getLatencyUsed function. > I'd put cleaning up the pipeline in a function. I do not understand why > the pipeline mode is not exited in all cases, the code checks for the > pipeline status twice in a few lines. I'd put this cleanup in the sync > function as well, report to the caller (advanceConnectionState) if there > was an error, which would be managed there. I fixed to exit the pipeline whenever we have an error in a pipeline mode. Also, I added a PQpipelineSync call which was forgotten in the previous patch. > WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to > avoid code repetition (there and in the "error:" label in > readCommandResponse). On the other hand, I'm not sure why the loop is > needed: we can only get there by submitting a "ROLLBACK" command, so there > should be only one result anyway? Right. We should receive just one PGRES_COMMAND_OK and null following it. I eliminated the loop. > report_per_command: please always count retries and failures of commands > even if they will not be reported in the end, the code will be simpler and > more efficient. Ok. I fixed to count retries and failures of commands even if report_per_command is false. > doLog: the format has changed, including a new string on failures which > replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM > that the string could be shorten to "deadlock" or "serialization". ISTM > that the documentation example should include a line with a failure, to > make it clear what to expect. I fixed getResultString to return "deadlock" or "serialization" instead of "deadlock_failure" or "serialization_failure". Also, I added an output example to the documentation. > I'm okay with always getting computing thread stats. > > # COMMENTS > > struct StatsData comment is helpful. > - "failed transactions" -> "unsuccessfully retried transactions"? This seems an accurate description. However, "failed transaction" is short and simple, and it is used in several places, so instead of replacing them I added the following statement to define it: "failed transaction is defined as unsuccessfully retried transactions." > - 'cnt' decomposition: first term is field 'retried'? if so say it > explicitely? No. 'retreid' includes unsuccessfully retreid transactions, but 'cnt' includes only successfully retried transactions. > "Complete the failed transaction" sounds strange: If it failed, it could > not complete? I'd suggest "Record a failed transaction". Sounds good. Fixed. > # TESTS > > I suggested to simplify the tests by using conditionals & sequences. You > reported that you got stuck. Hmmm. > > I tried again my tests which worked fine when started with 2 clients, > otherwise they get stuck because the first client waits for the other one > which does not exists (the point is to generate deadlocks and other > errors). Maybe this is your issue? That seems to be right. It got stuck when I used -T option rather than -t, it was because, I guess, the number of transactions on each thread was different. > Could you try with: > > psql < deadlock_prep.sql > pgbench -t 4 -c 2 -f deadlock.sql > # note: each deadlock detection takes 1 second > > psql < deadlock_prep.sql > pgbench -t 10 -c 2 -f serializable.sql > # very quick 50% serialization errors That works. However, it still gets hang when --max-tries = 2, so maybe I would not think we can use it for testing the retry feature.... Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
Hi Yugo and Fabien, It seems the patch is ready for committer except below. Do you guys want to do more on below? >> # TESTS >> >> I suggested to simplify the tests by using conditionals & sequences. You >> reported that you got stuck. Hmmm. >> >> I tried again my tests which worked fine when started with 2 clients, >> otherwise they get stuck because the first client waits for the other one >> which does not exists (the point is to generate deadlocks and other >> errors). Maybe this is your issue? > > That seems to be right. It got stuck when I used -T option rather than -t, > it was because, I guess, the number of transactions on each thread was > different. > >> Could you try with: >> >> psql < deadlock_prep.sql >> pgbench -t 4 -c 2 -f deadlock.sql >> # note: each deadlock detection takes 1 second >> >> psql < deadlock_prep.sql >> pgbench -t 10 -c 2 -f serializable.sql >> # very quick 50% serialization errors > > That works. However, it still gets hang when --max-tries = 2, > so maybe I would not think we can use it for testing the retry > feature.... Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Hello Tatsuo-san, > It seems the patch is ready for committer except below. Do you guys want > to do more on below? I'm planning a new review of this significant patch, possibly over the next week-end, or the next. -- Fabien.
Hello Yugo-san, About Pgbench error handling v16: This patch set needs a minor rebase because of 506035b0. Otherwise, patch compiles, global and local "make check" are ok. Doc generation is ok. This patch is in good shape, the code and comments are clear. Some minor remarks below, including typos and a few small suggestions. ## About v16-1 This refactoring patch adds a struct for managing pgbench variables, instead of mixing fields into the client state (CState) struct. Patch compiles, global and local "make check" are both ok. Although this patch is not necessary to add the feature, I'm fine with it as improves pgbench source code readability. ## About v16-2 This last patch adds handling of serialization and deadlock errors to pgbench transactions. This feature is desirable because it enlarge performance testing options, and makes pgbench behave more like a database client application. Possible future extension enabled by this patch include handling deconnections errors by trying to reconnect, for instance. The documentation is clear and well written, at least for my non-native speaker eyes and ears. English: "he will be aborted" -> "it will be aborted". I'm fine with renaming --report-latencies to --report-per-command as the later is clearer about what the options does. I'm still not sure I like the "failure detailed" option, ISTM that the report could be always detailed. That would remove some complexity and I do not think that people executing a bench with error handling would mind having the details. No big deal. printVerboseErrorMessages: I'd make the buffer static and initialized only once so that there is no significant malloc/free cycle involved when calling the function. advanceConnectionState: I'd really prefer not to add new variables (res, status) in the loop scope, and only declare them when actually needed in the state branches, so as to avoid any unwanted interaction between states. typo: "fullowing" -> "following" Pipeline cleaning: the advance function is already soooo long, I'd put that in a separate function and call it. I think that the report should not remove data when they are 0, otherwise it makes it harder to script around it (in failures_detailed on line 6284). The test cover the different cases. I tried to suggest a simpler approach in a previous round, but it seems not so simple to do so. They could be simplified later, if possible. -- Fabien.
Hello Fabien, On Sat, 12 Mar 2022 15:54:54 +0100 (CET) Fabien COELHO <coelho@cri.ensmp.fr> wrote: > Hello Yugo-san, > > About Pgbench error handling v16: Thank you for your review! I attached the updated patches. > This patch set needs a minor rebase because of 506035b0. Otherwise, patch > compiles, global and local "make check" are ok. Doc generation is ok. I rebased it. > ## About v16-2 > English: "he will be aborted" -> "it will be aborted". Fixed. > I'm still not sure I like the "failure detailed" option, ISTM that the report > could be always detailed. That would remove some complexity and I do not think > that people executing a bench with error handling would mind having the details. > No big deal. I didn't change it because I think those who don't expect any failures using a well designed script may not need details of failures. I think reporting such details will be required only for benchmarks where any failures are expected. > printVerboseErrorMessages: I'd make the buffer static and initialized only once > so that there is no significant malloc/free cycle involved when calling the function. OK. I fixed printVerboseErrorMessages to use a static variable. > advanceConnectionState: I'd really prefer not to add new variables (res, status) > in the loop scope, and only declare them when actually needed in the state branches, > so as to avoid any unwanted interaction between states. I fixed to declare the variables in the case statement blocks. > typo: "fullowing" -> "following" fixed. > Pipeline cleaning: the advance function is already soooo long, I'd put that in a > separate function and call it. Ok. I made a new function "discardUntilSync" for the pipeline cleaning. > I think that the report should not remove data when they are 0, otherwise it makes > it harder to script around it (in failures_detailed on line 6284). I fixed to report both serialization and deadlock failures always even when they are 0. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
Hi Yugo, I have looked into the patch and I noticed that <xref linkend=... endterm=...> is used in pgbench.sgml. e.g. <xref linkend="failures-and-retries" endterm="failures-and-retries-title"/> AFAIK this is the only place where "endterm" is used. In other places "link" tag is used instead: <link linkend="failures-and-retries">Failures and Serialization/Deadlock Retries</link> Note that the rendered result is identical. Do we want to use the link tag as well? Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Hi Yugo, I tested with serialization error scenario by setting: default_transaction_isolation = 'repeatable read' The result was: $ pgbench -t 10 -c 10 --max-tries=10 test transaction type: <builtin: TPC-B (sort of)> scaling factor: 10 query mode: simple number of clients: 10 number of threads: 1 maximum number of tries: 10 number of transactions per client: 10 number of transactions actually processed: 100/100 number of failed transactions: 0 (0.000%) number of transactions retried: 35 (35.000%) total number of retries: 74 latency average = 5.306 ms initial connection time = 15.575 ms tps = 1884.516810 (without initial connection time) I had hard time to understand what those numbers mean: number of transactions retried: 35 (35.000%) total number of retries: 74 It seems "total number of retries" matches with the number of ERRORs reported in PostgreSQL. Good. What I am not sure is "number of transactions retried". What does this mean? Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
> Hi Yugo, > > I tested with serialization error scenario by setting: > default_transaction_isolation = 'repeatable read' > The result was: > > $ pgbench -t 10 -c 10 --max-tries=10 test > transaction type: <builtin: TPC-B (sort of)> > scaling factor: 10 > query mode: simple > number of clients: 10 > number of threads: 1 > maximum number of tries: 10 > number of transactions per client: 10 > number of transactions actually processed: 100/100 > number of failed transactions: 0 (0.000%) > number of transactions retried: 35 (35.000%) > total number of retries: 74 > latency average = 5.306 ms > initial connection time = 15.575 ms > tps = 1884.516810 (without initial connection time) > > I had hard time to understand what those numbers mean: > number of transactions retried: 35 (35.000%) > total number of retries: 74 > > It seems "total number of retries" matches with the number of ERRORs > reported in PostgreSQL. Good. What I am not sure is "number of > transactions retried". What does this mean? Oh, ok. I see it now. It turned out that "number of transactions retried" does not actually means the number of transactions rtried. Suppose pgbench exectutes following in a session: BEGIN; -- transaction A starts : (ERROR) ROLLBACK; -- transaction A aborts (retry) BEGIN; -- transaction B starts : (ERROR) ROLLBACK; -- transaction B aborts (retry) BEGIN; -- transaction C starts : END; -- finally succeeds In this case "total number of retries:" = 2 and "number of transactions retried:" = 1. In this patch transactions A, B and C are regarded as "same" transaction, so the retried transaction count becomes 1. But it's confusing to use the language "transaction" here because A, B and C are different transactions. I would think it's better to use different language instead of "transaction", something like "cycle"? i.e. number of cycles retried: 35 (35.000%) Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Hi Ishii-san, On Sun, 20 Mar 2022 09:52:06 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > Hi Yugo, > > I have looked into the patch and I noticed that <xref > linkend=... endterm=...> is used in pgbench.sgml. e.g. > > <xref linkend="failures-and-retries" endterm="failures-and-retries-title"/> > > AFAIK this is the only place where "endterm" is used. In other places > "link" tag is used instead: Thank you for pointing out it. I've checked other places using <xref/> referring to <refsect2>, and found that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it in this style. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
On Sun, 20 Mar 2022 16:11:43 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > Hi Yugo, > > > > I tested with serialization error scenario by setting: > > default_transaction_isolation = 'repeatable read' > > The result was: > > > > $ pgbench -t 10 -c 10 --max-tries=10 test > > transaction type: <builtin: TPC-B (sort of)> > > scaling factor: 10 > > query mode: simple > > number of clients: 10 > > number of threads: 1 > > maximum number of tries: 10 > > number of transactions per client: 10 > > number of transactions actually processed: 100/100 > > number of failed transactions: 0 (0.000%) > > number of transactions retried: 35 (35.000%) > > total number of retries: 74 > > latency average = 5.306 ms > > initial connection time = 15.575 ms > > tps = 1884.516810 (without initial connection time) > > > > I had hard time to understand what those numbers mean: > > number of transactions retried: 35 (35.000%) > > total number of retries: 74 > > > > It seems "total number of retries" matches with the number of ERRORs > > reported in PostgreSQL. Good. What I am not sure is "number of > > transactions retried". What does this mean? > > Oh, ok. I see it now. It turned out that "number of transactions > retried" does not actually means the number of transactions > rtried. Suppose pgbench exectutes following in a session: > > BEGIN; -- transaction A starts > : > (ERROR) > ROLLBACK; -- transaction A aborts > > (retry) > > BEGIN; -- transaction B starts > : > (ERROR) > ROLLBACK; -- transaction B aborts > > (retry) > > BEGIN; -- transaction C starts > : > END; -- finally succeeds > > In this case "total number of retries:" = 2 and "number of > transactions retried:" = 1. In this patch transactions A, B and C are > regarded as "same" transaction, so the retried transaction count > becomes 1. But it's confusing to use the language "transaction" here > because A, B and C are different transactions. I would think it's > better to use different language instead of "transaction", something > like "cycle"? i.e. > > number of cycles retried: 35 (35.000%) In the original patch by Marina Polyakova it was "number of retried", but I changed it to "number of transactions retried" is because I felt it was confusing with "number of retries". I chose the word "transaction" because a transaction ends in any one of successful commit , skipped, or failure, after possible retries. Well, I agree with that it is somewhat confusing wording. If we can find nice word to resolve the confusion, I don't mind if we change the word. Maybe, we can use "executions" as well as "cycles". However, I am not sure that the situation is improved by using such word because what such word exactly means seems to be still unclear for users. Another idea is instead reporting only "the number of successfully retried transactions" that does not include "failed transactions", that is, transactions failed after retries, like this; number of transactions actually processed: 100/100 number of failed transactions: 0 (0.000%) number of successfully retried transactions: 35 (35.000%) total number of retries: 74 The meaning is clear and there seems to be no confusion. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
> On Sun, 20 Mar 2022 16:11:43 +0900 (JST) > Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > >> > Hi Yugo, >> > >> > I tested with serialization error scenario by setting: >> > default_transaction_isolation = 'repeatable read' >> > The result was: >> > >> > $ pgbench -t 10 -c 10 --max-tries=10 test >> > transaction type: <builtin: TPC-B (sort of)> >> > scaling factor: 10 >> > query mode: simple >> > number of clients: 10 >> > number of threads: 1 >> > maximum number of tries: 10 >> > number of transactions per client: 10 >> > number of transactions actually processed: 100/100 >> > number of failed transactions: 0 (0.000%) >> > number of transactions retried: 35 (35.000%) >> > total number of retries: 74 >> > latency average = 5.306 ms >> > initial connection time = 15.575 ms >> > tps = 1884.516810 (without initial connection time) >> > >> > I had hard time to understand what those numbers mean: >> > number of transactions retried: 35 (35.000%) >> > total number of retries: 74 >> > >> > It seems "total number of retries" matches with the number of ERRORs >> > reported in PostgreSQL. Good. What I am not sure is "number of >> > transactions retried". What does this mean? >> >> Oh, ok. I see it now. It turned out that "number of transactions >> retried" does not actually means the number of transactions >> rtried. Suppose pgbench exectutes following in a session: >> >> BEGIN; -- transaction A starts >> : >> (ERROR) >> ROLLBACK; -- transaction A aborts >> >> (retry) >> >> BEGIN; -- transaction B starts >> : >> (ERROR) >> ROLLBACK; -- transaction B aborts >> >> (retry) >> >> BEGIN; -- transaction C starts >> : >> END; -- finally succeeds >> >> In this case "total number of retries:" = 2 and "number of >> transactions retried:" = 1. In this patch transactions A, B and C are >> regarded as "same" transaction, so the retried transaction count >> becomes 1. But it's confusing to use the language "transaction" here >> because A, B and C are different transactions. I would think it's >> better to use different language instead of "transaction", something >> like "cycle"? i.e. >> >> number of cycles retried: 35 (35.000%) I realized that the same argument can be applied even to "number of transactions actually processed" because with the retry feature, "transaction" could comprise multiple transactions. But if we go forward and replace those "transactions" with "cycles" (or whatever) altogether, probably it could bring enough confusion to users who have been using pgbench. Probably we should give up the language changing and redefine "transaction" when the retry feature is enabled instead like "when retry feature is enabled, each transaction can be consisted of multiple transactions retried." > In the original patch by Marina Polyakova it was "number of retried", > but I changed it to "number of transactions retried" is because I felt > it was confusing with "number of retries". I chose the word "transaction" > because a transaction ends in any one of successful commit , skipped, or > failure, after possible retries. Ok. > Well, I agree with that it is somewhat confusing wording. If we can find > nice word to resolve the confusion, I don't mind if we change the word. > Maybe, we can use "executions" as well as "cycles". However, I am not sure > that the situation is improved by using such word because what such word > exactly means seems to be still unclear for users. > > Another idea is instead reporting only "the number of successfully > retried transactions" that does not include "failed transactions", > that is, transactions failed after retries, like this; > > number of transactions actually processed: 100/100 > number of failed transactions: 0 (0.000%) > number of successfully retried transactions: 35 (35.000%) > total number of retries: 74 > > The meaning is clear and there seems to be no confusion. Thank you for the suggestion. But I think it would better to leave it as it is because of the reason I mentioned above. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Tue, 22 Mar 2022 09:08:15 +0900 Yugo NAGATA <nagata@sraoss.co.jp> wrote: > Hi Ishii-san, > > On Sun, 20 Mar 2022 09:52:06 +0900 (JST) > Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > > Hi Yugo, > > > > I have looked into the patch and I noticed that <xref > > linkend=... endterm=...> is used in pgbench.sgml. e.g. > > > > <xref linkend="failures-and-retries" endterm="failures-and-retries-title"/> > > > > AFAIK this is the only place where "endterm" is used. In other places > > "link" tag is used instead: > > Thank you for pointing out it. > > I've checked other places using <xref/> referring to <refsect2>, and found > that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it > in this style. I attached the updated patch. I also fixed the following paragraph which I had forgotten to fix in the previous patch. The first seven lines report some of the most important parameter settings. The sixth line reports the maximum number of tries for transactions with serialization or deadlock errors Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
>> I've checked other places using <xref/> referring to <refsect2>, and found >> that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it >> in this style. > > I attached the updated patch. I also fixed the following paragraph which I had > forgotten to fix in the previous patch. > > The first seven lines report some of the most important parameter settings. > The sixth line reports the maximum number of tries for transactions with > serialization or deadlock errors Thank you for the updated patch. I think the patches look good and now it's ready for commit. If there's no objection, I would like to commit/push the patches. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
>> I attached the updated patch. I also fixed the following paragraph which I had >> forgotten to fix in the previous patch. >> >> The first seven lines report some of the most important parameter settings. >> The sixth line reports the maximum number of tries for transactions with >> serialization or deadlock errors > > Thank you for the updated patch. I think the patches look good and now > it's ready for commit. If there's no objection, I would like to > commit/push the patches. The patch Pushed. Thank you! Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > The patch Pushed. Thank you! My hoary animal prairiedog doesn't like this [1]: # Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script0; ERROR: could not serialize access due to concurrent update\\b.*\\g1)/' # at t/001_pgbench_with_server.pl line 1229. # 'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres ... # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR: could not serialize access due to concurrent update ... # ' # doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR: could not serialize accessdue to concurrent update\\b.*\\g1)' # Looks like you failed 1 test of 425. I'm not sure what the "\\b.*\\g1" part of this regex is meant to accomplish, but it seems to be assuming more than it should about the output format of TAP messages. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2022-03-23%2013%3A21%3A44
On Wed, 23 Mar 2022 14:26:54 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tatsuo Ishii <ishii@sraoss.co.jp> writes: > > The patch Pushed. Thank you! > > My hoary animal prairiedog doesn't like this [1]: > > # Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script0; ERROR: could not serialize access due to concurrent update\\b.*\\g1)/' > # at t/001_pgbench_with_server.pl line 1229. > # 'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres > ... > # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR: could not serialize access due to concurrent update > ... > # ' > # doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR: could not serializeaccess due to concurrent update\\b.*\\g1)' > # Looks like you failed 1 test of 425. > > I'm not sure what the "\\b.*\\g1" part of this regex is meant to > accomplish, but it seems to be assuming more than it should > about the output format of TAP messages. I have edited the test code from the original patch by mistake, but I could not realize because the test works in my machine without any errors somehow. I attached a patch to fix the test as was in the original patch, where backreferences are used to check retry of the same query. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
>> My hoary animal prairiedog doesn't like this [1]: >> >> # Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script0; ERROR: could not serialize access due to concurrent update\\b.*\\g1)/' >> # at t/001_pgbench_with_server.pl line 1229. >> # 'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres >> ... >> # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR: could not serialize access due to concurrentupdate >> ... >> # ' >> # doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR: could not serializeaccess due to concurrent update\\b.*\\g1)' >> # Looks like you failed 1 test of 425. >> >> I'm not sure what the "\\b.*\\g1" part of this regex is meant to >> accomplish, but it seems to be assuming more than it should >> about the output format of TAP messages. > > I have edited the test code from the original patch by mistake, but > I could not realize because the test works in my machine without any > errors somehow. > > I attached a patch to fix the test as was in the original patch, where > backreferences are used to check retry of the same query. My machine (Ubuntu 20) did not complain either. Maybe perl version difference? Any way, the fix pushed. Let's see how prairiedog feels. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Tatsuo Ishii <ishii@sraoss.co.jp> writes: >> My hoary animal prairiedog doesn't like this [1]: > My machine (Ubuntu 20) did not complain either. Maybe perl version > difference? Any way, the fix pushed. Let's see how prairiedog feels. Still not happy. After some digging in man pages, I believe the problem is that its old version of Perl does not understand "\gN" backreferences. Is there a good reason to be using that rather than the traditional "\N" backref notation? regards, tom lane
>> My machine (Ubuntu 20) did not complain either. Maybe perl version >> difference? Any way, the fix pushed. Let's see how prairiedog feels. > > Still not happy. After some digging in man pages, I believe the > problem is that its old version of Perl does not understand "\gN" > backreferences. Is there a good reason to be using that rather > than the traditional "\N" backref notation? I don't see a reason to use "\gN" either. Actually after applying attached patch, my machine is still happy with pgbench test. Yugo? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index 60cae1e843..22a23489e8 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -1224,7 +1224,7 @@ my $err_pattern = "(client (0|1) sending UPDATE xy SET y = y \\+ -?\\d+\\b).*" . "client \\g2 got an error in command 3 \\(SQL\\) of script 0; " . "ERROR: could not serialize access due to concurrent update\\b.*" - . "\\g1"; + . "\\1"; $node->pgbench( "-n -c 2 -t 1 -d --verbose-errors --max-tries 2",
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > I don't see a reason to use "\gN" either. Actually after applying > attached patch, my machine is still happy with pgbench test. Note that the \\g2 just above also needs to be changed. regards, tom lane
> Note that the \\g2 just above also needs to be changed. Oops. Thanks. New patch attached. Test has passed on my machine. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index 60cae1e843..ca71f968dc 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -1222,9 +1222,9 @@ local $ENV{PGOPTIONS} = "-c default_transaction_isolation=repeatable\\ read"; # delta variable in the next try my $err_pattern = "(client (0|1) sending UPDATE xy SET y = y \\+ -?\\d+\\b).*" - . "client \\g2 got an error in command 3 \\(SQL\\) of script 0; " + . "client \\2 got an error in command 3 \\(SQL\\) of script 0; " . "ERROR: could not serialize access due to concurrent update\\b.*" - . "\\g1"; + . "\\1"; $node->pgbench( "-n -c 2 -t 1 -d --verbose-errors --max-tries 2",
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > Oops. Thanks. New patch attached. Test has passed on my machine. I reproduced the failure on another machine with perl 5.8.8, and I can confirm that this patch fixes it. regards, tom lane
On Fri, 25 Mar 2022 09:14:00 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > Note that the \\g2 just above also needs to be changed. > > Oops. Thanks. New patch attached. Test has passed on my machine. This patch works for me. I think it is ok to use \N instead of \gN. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
> I reproduced the failure on another machine with perl 5.8.8, > and I can confirm that this patch fixes it. Thank you for the test. I have pushed the patch. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
>> Oops. Thanks. New patch attached. Test has passed on my machine. > > This patch works for me. I think it is ok to use \N instead of \gN. Thanks. Patch pushed. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > Thanks. Patch pushed. This patch has caused the PDF documentation to fail to build cleanly: [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by morethan 50 points. (See position 125066:375) It's complaining about this: <synopsis> <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable> <replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable> <replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional> <replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional> <replaceable>retried</replaceable><replaceable>retries</replaceable> </optional> </synopsis> which runs much too wide in HTML format too, even though that toolchain doesn't tell you so. We could silence the warning by inserting an arbitrary line break or two, or refactoring the syntax description into multiple parts. Either way seems to create a risk of confusion. TBH, I think the *real* problem is that the complexity of this log format has blown past "out of hand". Can't we simplify it? Who is really going to use all these numbers? I pity the poor sucker who tries to write a log analysis tool that will handle all the variants. regards, tom lane
> This patch has caused the PDF documentation to fail to build cleanly: > > [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by morethan 50 points. (See position 125066:375) > > It's complaining about this: > > <synopsis> > <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable> <replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable> <replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional> <replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional> <replaceable>retried</replaceable><replaceable>retries</replaceable> </optional> > </synopsis> > > which runs much too wide in HTML format too, even though that toolchain > doesn't tell you so. Yeah. > We could silence the warning by inserting an arbitrary line break or two, > or refactoring the syntax description into multiple parts. Either way > seems to create a risk of confusion. I think we can fold the line nicely. Here is the rendered image. Before: interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ] After: interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries] Note that before it was like this: interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency [ sum_lag sum_lag_2 min_lag max_lag [skipped ] ] So newly added items are "{ failures | serialization_failures deadlock_failures }" and " [ retried retries ]". > TBH, I think the *real* problem is that the complexity of this log format > has blown past "out of hand". Can't we simplify it? Who is really going > to use all these numbers? I pity the poor sucker who tries to write a > log analysis tool that will handle all the variants. Well, the extra logging items above only appear when the retry feature is enabled. For those who do not use the feature the only new logging item is "failures". For those who use the feature, the extra logging items are apparently necessary. For example if we write an application using repeatable read or serializable transaction isolation mode, retrying failed transactions due to srialization error is an essential technique. Also the retry rate of transactions will deeply affect the performance and in such use cases the newly added items will be precisou information. I would suggest leave the log items as it is. Patch attached. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index ebdb4b3f46..b65b813ebe 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -2398,10 +2398,11 @@ END; <para> With the <option>--aggregate-interval</option> option, a different - format is used for the log files: + format is used for the log files (note that the actual log line is not folded). <synopsis> -<replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable> <replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable> <replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional> <replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional> <replaceable>retried</replaceable><replaceable>retries</replaceable> </optional> + <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable> <replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable> <replaceable>max_latency</replaceable> + { <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable> <replaceable>deadlock_failures</replaceable>} <optional> <replaceable>sum_lag</replaceable> <replaceable>sum_lag_2</replaceable><replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable> <optional> <replaceable>skipped</replaceable></optional> </optional> <optional> <replaceable>retried</replaceable> <replaceable>retries</replaceable></optional> </synopsis> where
On Sun, 27 Mar 2022 15:28:41 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > This patch has caused the PDF documentation to fail to build cleanly: > > > > [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction bymore than 50 points. (See position 125066:375) > > > > It's complaining about this: > > > > <synopsis> > > <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable> <replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable> <replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional> <replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional> <replaceable>retried</replaceable><replaceable>retries</replaceable> </optional> > > </synopsis> > > > > which runs much too wide in HTML format too, even though that toolchain > > doesn't tell you so. > > Yeah. > > > We could silence the warning by inserting an arbitrary line break or two, > > or refactoring the syntax description into multiple parts. Either way > > seems to create a risk of confusion. > > I think we can fold the line nicely. Here is the rendered image. > > Before: > interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ] > > After: > interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency > { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries] > > Note that before it was like this: > > interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency [ sum_lag sum_lag_2 min_lag max_lag[ skipped ] ] > > So newly added items are "{ failures | serialization_failures deadlock_failures }" and " [ retried retries ]". > > > TBH, I think the *real* problem is that the complexity of this log format > > has blown past "out of hand". Can't we simplify it? Who is really going > > to use all these numbers? I pity the poor sucker who tries to write a > > log analysis tool that will handle all the variants. > > Well, the extra logging items above only appear when the retry feature > is enabled. For those who do not use the feature the only new logging > item is "failures". For those who use the feature, the extra logging > items are apparently necessary. For example if we write an application > using repeatable read or serializable transaction isolation mode, > retrying failed transactions due to srialization error is an essential > technique. Also the retry rate of transactions will deeply affect the > performance and in such use cases the newly added items will be > precisou information. I would suggest leave the log items as it is. > > Patch attached. Even applying this patch, "make postgres-A4.pdf" arises the warning on my machine. After some investigations, I found that previous document had a break after 'num_transactions', but it has been removed due to this commit. So, I would like to get back this as it was. I attached the patch. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
> Even applying this patch, "make postgres-A4.pdf" arises the warning on my > machine. After some investigations, I found that previous document had a break > after 'num_transactions', but it has been removed due to this commit. Yes, your patch removed "&zwsp;". > So, > I would like to get back this as it was. I attached the patch. This produces errors. Needs ";" postfix? ref/pgbench.sgml:2404: parser error : EntityRef: expecting ';' le>interval_start</replaceable> <replaceable>num_transactions</replaceable>&zwsp ^ ref/pgbench.sgml:2781: parser error : chunk is not well balanced ^ reference.sgml:251: parser error : Failure to process entity pgbench &pgbench; ^ reference.sgml:251: parser error : Entity 'pgbench' not defined &pgbench; ^ reference.sgml:296: parser error : chunk is not well balanced ^ postgres.sgml:240: parser error : Failure to process entity reference &reference; ^ postgres.sgml:240: parser error : Entity 'reference' not defined &reference; ^ make: *** [Makefile:135: html-stamp] エラー 1 Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Mon, 28 Mar 2022 12:17:13 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > Even applying this patch, "make postgres-A4.pdf" arises the warning on my > > machine. After some investigations, I found that previous document had a break > > after 'num_transactions', but it has been removed due to this commit. > > Yes, your patch removed "&zwsp;". > > > So, > > I would like to get back this as it was. I attached the patch. > > This produces errors. Needs ";" postfix? Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped. I attached the fixed patch. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
>> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my >> > machine. After some investigations, I found that previous document had a break >> > after 'num_transactions', but it has been removed due to this commit. >> >> Yes, your patch removed "&zwsp;". >> >> > So, >> > I would like to get back this as it was. I attached the patch. >> >> This produces errors. Needs ";" postfix? > > Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped. > I attached the fixed patch. Basic problem with this patch is, this may solve the issue with pdf generation but this does not solve the issue with HTML generation. The PDF manual of pgbench has ridiculously long line, which Tom Lane complained too: interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ] Why can't we use just line feeds instead of &zwsp;? Although it's not a command usage but the SELECT manual already uses line feeds to nicely break into multiple lines of command usage. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
>>> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my >>> > machine. After some investigations, I found that previous document had a break >>> > after 'num_transactions', but it has been removed due to this commit. >>> >>> Yes, your patch removed "&zwsp;". >>> >>> > So, >>> > I would like to get back this as it was. I attached the patch. >>> >>> This produces errors. Needs ";" postfix? >> >> Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped. >> I attached the fixed patch. > > Basic problem with this patch is, this may solve the issue with pdf > generation but this does not solve the issue with HTML generation. The > PDF manual of pgbench has ridiculously long line, which Tom Lane I meant "HTML manual" here. > complained too: > > interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failuresdeadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ] > > Why can't we use just line feeds instead of &zwsp;? Although it's not > a command usage but the SELECT manual already uses line feeds to > nicely break into multiple lines of command usage. > > Best reagards, > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese:http://www.sraoss.co.jp
Hello, On 2022-Mar-27, Tatsuo Ishii wrote: > After: > interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency > { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries] You're showing an indentation, but looking at the HTML output there is no such. Is the HTML processor eating leading whitespace or something like that? I think that the explanatory paragraph is way too long now, particularly since it explains --failures-detailed starting in the middle. Also, the example output doesn't include the failures-detailed mode. I suggest that this should be broken down even more; first to explain the output without failures-detailed, including an example, and then the output with failures-detailed, and an example of that. Something like this, perhaps: Aggregated Logging With the --aggregate-interval option, a different format is used for the log files (note that the actual log line is notfolded). interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency failures [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ] where interval_start is the start of the interval (as a Unix epoch time stamp), num_transactions is the number of transactionswithin the interval, sum_latency is the sum of the transaction latencies within the interval, sum_latency_2 isthe sum of squares of the transaction latencies within the interval, min_latency is the minimum latency within the interval,and max_latency is the maximum latency within the interval, failures is the number of transactions that ended witha failed SQL command within the interval. The next fields, sum_lag, sum_lag_2, min_lag, and max_lag, are only present if the --rate option is used. They provide statisticsabout the time each transaction had to wait for the previous one to finish, i.e., the difference between each transaction'sscheduled start time and the time it actually started. The next field, skipped, is only present if the --latency-limitoption is used, too. It counts the number of transactions skipped because they would have started too late.The retried and retries fields are present only if the --max-tries option is not equal to 1. They report the numberof retried transactions and the sum of all retries after serialization or deadlock errors within the interval. Eachtransaction is counted in the interval when it was committed. Notice that while the plain (unaggregated) log file shows which script was used for each transaction, the aggregated logdoes not. Therefore if you need per-script data, you need to aggregate the data on your own. Here is some example output: 1345828501 5601 1542744 483552416 61 2573 0 1345828503 7884 1979812 565806736 60 1479 0 1345828505 7208 1979422 567277552 59 1391 0 1345828507 7685 1980268 569784714 60 1398 0 1345828509 7073 1979779 573489941 236 1411 0 If you use option --failures-detailed, instead of the sum of all failed transactions you will get more detailed statisticsfor the failed transactions: interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency serialization_failures deadlock_failures [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ] This is similar to the above, but here the single 'failures' figure is replaced by serialization_failures which is the numberof transactions that got a serialization error and were not retried after this, deadlock_failures which is the numberof transactions that got a deadlock error and were not retried after this. The other fields are as above. Here is someexample output: [example with detailed failures] -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint)