Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors - Mailing list pgsql-hackers
From | Marina Polyakova |
---|---|
Subject | Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors |
Date | |
Msg-id | a1bd32671a6777b78dd67b95eb68ff82@postgrespro.ru Whole thread Raw |
In response to | Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors (Marina Polyakova <m.polyakova@postgrespro.ru>) |
Responses |
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors |
List | pgsql-hackers |
Hello, hackers! This is the eleventh version of the patch for error handling and retrying of transactions with serialization/deadlock failures in pgbench (based on the commit 14e9b2a752efaa427ce1b400b9aaa5a636898a04) thanks to the comments of Fabien Coelho and Arthur Zakirov in this thread. v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch - a patch for the RandomState structure (this is used to reset a client's random seed during the repeating of transactions after serialization/deadlock failures). v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch - a patch for the Variables structure (this is used to reset client variables during the repeating of transactions after serialization/deadlock failures). v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch - the main patch for handling client errors and repetition of transactions with serialization/deadlock failures (see the detailed description in the file). v11-0004-Pgbench-errors-use-a-separate-function-to-report.patch - a patch for a separate error reporting function (this is used to report client failures that do not cause an aborts and this depends on the level of debugging). Although this is a try to fix a duplicate code for debug messages (see [1]), this may seem mostly refactoring and therefore may not seem very necessary for this set of patches (see [2], [3]), so this patch becomes the last as an optional. Any suggestions are welcome! [1] https://www.postgresql.org/message-id/20180405180807.0bc1114f%40wp.localdomain > There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with > corresponding fprintf(stderr..) I think it's time to do it like in the > main code, wrap with some function like log(level, msg). [2] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre > However ISTM that it is not as necessary as the previous one, i.e. we > could do without it to get the desired feature, so I see it more as a > refactoring done "in passing", and I'm wondering whether it is > really worth it because it adds some new complexity, so I'm not sure of > the net benefit. [3] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre > I'm still not over enthousiastic with these changes, and still think > that > it should be an independent patch, not submitted together with the > "retry > on error" feature. All that was fixed from the previous version: [4] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre > I'm at odds with the proposed levels. ISTM that pgbench internal > errors which warrant an immediate exit should be dubbed "FATAL", > I'm unsure about the "log_min_messages" variable name, I'd suggest > "log_level". > > I do not see the asserts on LOG >= log_min_messages as useful, because > the level can only be LOG or DEBUG anyway. > * PQExpBuffer > > I still do not see a positive value from importing PQExpBuffer > complexity and cost into pgbench, as the resulting code is not very > readable and it adds malloc/free cycles, so I'd try to avoid using > PQExpBuf as much as possible. ISTM that all usages could be avoided in > the patch, and most should be avoided even if ExpBuffer is imported > because it is really useful somewhere. > > - to call pgbench_error from pgbench_simple_error, you can do a > pgbench_log_va(level, format, va_list) version called both from > pgbench_error & pgbench_simple_error. > > - for PGBENCH_DEBUG function, do separate calls per type, the very > small partial code duplication is worth avoiding ExpBuf IMO. > > - for doCustom debug: I'd just let the printf as it is, with a > comment, as it is really very internal stuff for debug. Or I'd just > snprintf a something in a static buffer. > > ... > > - for listAvailableScript: I'd simply call "pgbench_error(LOG" several > time, once per line. > > I see building a string with a format (printfExpBuf..) and then > calling the pgbench_error function with just a "%s" format on the > result as not very elegant, because the second format is somehow > hacked around. [5] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre > I suggest that the called function does only one simple thing, > probably "DEBUG", and that the *caller* prints a message if it is > unhappy > about the failure of the called function, as it is currently done. This > allows to provide context as well from the caller, eg "setting variable > %s > failed while <some specific context>". The user call rerun under debug > for > precision if they need it. [6] https://www.postgresql.org/message-id/20180810125327.GA2374%40zakirov.localdomain > I agree with Fabien. Calling pgbench_error() inside pgbench_error() > could be dangerous. I think "fmt" checking could be removed, or we may > use Assert() or fprintf()+exit(1) at least. [7] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808121057540.6189%40lancre > * typo in comments: "varaibles" > > * About enlargeVariables: > > multiple INT_MAX error handling looks strange, especially as this code > can > never be triggered because pgbench would be dead long before having > allocated INT_MAX variables. So I would not bother to add such checks. > I'm not sure that the size_t cast here and there are useful for any > practical values likely to be encountered by pgbench. > > The exponential allocation seems overkill. I'd simply add a constant > number of slots, with a simple rule: > > /* reallocated with a margin */ > if (max_vars < needed) max_vars = needed + 8; [8] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808151046090.30050%40lancre > A few comments about the doc. > > According to the documentation, the feature is triggered by --max-tries > and > --latency-limit. I disagree with the later, because it means that > having > latency limit without retrying is not supported anymore. > > Maybe you can allow an "unlimited" max-tries, say with special value > zero, > and the latency limit does its job if set, over all tries. > > Doc: "error in meta commands" -> "meta command errors", for homogeneity > with > other cases? > Doc: "never occur.." -> "never occur", or eventually "...". > > Doc: "Directly client errors" -> "Direct client errors". > > I'm still in favor of asserting that the sql connection is idle (no tx > in > progress) at the beginning and/or end of a script, and report a user > error > if not, instead of writing complex caveats. > I do not think that the RETRIES_ENABLED macro is a good thing. I'd > suggest > to write the condition four times. > > ISTM that "skipped" transactions are NOT "successful" so there are a > problem > with comments. I believe that your formula are probably right, it has > more to do > with what is "success". For cnt decomposition, ISTM that "other > transactions" > are really "directly successful transactions". > > I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise > "another" > does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE". > I'm not sure of > the LOG -> DEBUG_FAIL changes. I do not understand the name > "DEBUG_FAIL", has it > is not related to debug, they just seem to be internal errors. > inTransactionBlock: I disagree with any function other than doCustom > changing > the client state, because it makes understanding the state machine > harder. There > is already one exception to that (threadRun) that I wish to remove. All > state > changes must be performed explicitely in doCustom. > PQexec("ROOLBACK"): you are inserting a synchronous command, for which > the > thread will have to wait for the result, in a middle of a framework > which > takes great care to use only asynchronous stuff so that one thread can > manage several clients efficiently. You cannot call PQexec there. > From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to > a new state CSTATE_WAIT_ABORT_RESULT which would be similar to > CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead > of proceeding to the next command. > > ISTM that it would be more logical to only get into RETRY if there is a > retry, > i.e. move the test RETRY/ABORT in FAILURE. For that, instead of > "canRetry", > maybe you want "doRetry", which tells that a retry is possible (the > error > is serializable or deadlock) and that the current parameters allow it > (timeout, max retries). > > * Minor C style comments: > > if / else if / else if ... on *_FAILURE: I'd suggest a switch. > > The following line removal does not seem useful, I'd have kept it: > > stats->cnt++; > - > if (skipped) > > copyVariables: I'm not convinced that source_vars & nvars variables are > that > useful. > > memcpy(&(st->retry_state.random_state), &(st->random_state), > sizeof(RandomState)); > > Is there a problem with "st->retry_state.random_state = > st->random_state;" > instead of memcpy? ISTM that simple assignments work in C. Idem in the > reverse > copy under RETRY. > commandFailed: I'm not thrilled by the added boolean, which is > partially > redundant with the second argument. > > if (per_script_stats) > - accumStats(&sql_script[st->use_file].stats, skipped, > latency, lag); > + { > + accumStats(&sql_script[st->use_file].stats, skipped, > latency, lag, > + st->failure_status, st->retries); > + } > } > > I do not see the point of changing the style here. [9] https://www.postgresql.org/message-id/alpine.DEB.2.21.1808170917510.20841%40lancre > Here is an attempt at having a more precise and shorter version, not > sure > it is much better than yours, though: > > """ > Transactions are counted depending on their execution and outcome. > First > a transaction may have started or not: skipped transactions occur under > --rate and --latency-limit when the client is too late to execute them. > Secondly, a started transaction may ultimately succeed or fail on some > error, possibly after some retries when --max-tries is not one. Thus > """ -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: