Re: WIP Patch: Pgbench Serialization and deadlock errors - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: WIP Patch: Pgbench Serialization and deadlock errors |
Date | |
Msg-id | alpine.DEB.2.20.1801031720270.20034@lancre Whole thread Raw |
In response to | WIP Patch: Pgbench Serialization and deadlock errors (Marina Polyakova <m.polyakova@postgrespro.ru>) |
Responses |
Re: WIP Patch: Pgbench Serialization and deadlock errors
|
List | pgsql-hackers |
Hello Marina, Some comment about WIP pgbench error handling v4. Patch applies, compiles, global & local "make check" are ok. doc compiles. I'm generally okay with having such a feature, but I'd like it to be *MUCH* simpler, otherwise it is going to make pgbench unmaintainable:-( Also, ISTM that a new patch should address somehow (in the code or with explanation about why you disagree) all comments from the previous review, which is not really the case here, as I have to repeat some of the comments I did on v3. You should answer to the previous comment mail and tag all comments with "ok", "no because this or that", "done", "postponed because this or that...". About the documentation: Again, as already said in the previous review, please take into account comments or justify why you do not do so, I do not think that this feature should be given any pre-emminence: most of the time performance testing is about all-is- well transactions which do not display any error. I do not claim that it is not a useful feature, on the contrary I do think that testing under error conditions is a good capability, but I just insist that it is a on the side feature should not be put forward. As a consequence, the "maximum number of transaction tries" should certainly not be the default first output of a summary run. I'm unclear about the added example added in the documentation. There are 71% errors, but 100% of transactions are reported as processed. If there were errors, then it is not a success, so the transaction were not processed? To me it looks inconsistent. Also, while testing, it seems that failed transactions are counted in tps, which I think is not appropriate: About the feature: sh> PGOPTIONS='-c default_transaction_isolation=serializable' \ ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4 starting vacuum...end. progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed # NOT 10845.8 TPS... progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed ... number of transactions actually processed: 32028 # NO! number of errors: 30969 (96.694 %) latency average = 2.833 ms latency stddev = 1.508 ms tps = 10666.720870 (including connections establishing) # NO tps = 10683.034369 (excluding connections establishing) # NO ... For me this is all wrong. I think that the tps report is about transactions that succeeded, not mere attempts. I cannot say that a transaction which aborted was "actually processed"... as it was not. The order of reported elements is not logical: maximum number of transaction tries: 100 scaling factor: 10 query mode: prepared number of clients: 4 number of threads: 2 duration: 3 s number of transactions actually processed: 967 number of errors: 152 (15.719 %) latency average = 9.630 ms latency stddev = 13.366 ms number of transactions retried: 623 (64.426 %) number of retries: 32272 I would suggest to group everything about error handling in one block, eg something like: scaling factor: 10 query mode: prepared number of clients: 4 number of threads: 2 duration: 3 s number of transactions actually processed: 967 number of errors: 152 (15.719 %) number of transactions retried: 623 (64.426 %) number of retries: 32272 maximum number of transaction tries: 100 latency average = 9.630 ms latency stddev = 13.366 ms Also, percent character should be stuck to its number: 15.719% to have the style more homogeneous (although there seems to be pre-existing inhomogeneities). I would replace "transaction tries/retried" by "tries/retried", everything is about transactions in the report anyway. Without reading the documentation, the overall report semantics is unclear, especially given the absurd tps results I got with the my first attempt, as failing transactions are counted as "processed". For the detailed report, ISTM that I already said in the previous review that the 22 columns (?) indentation is much too large: 0.078 149 30886 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid =:tid; When giving the number of retries, I'd like to also have the average number of retried per attempted transactions, whether they succeeded or failed in the end. About the code: I'm at lost with the 7 states added to the automaton, where I would have hoped that only 2 (eg RETRY & FAIL, or even less) would be enough. I would hope for something like "if (some error) state = ERROR", then in "ERROR" do the stats, check whether it should be retried, and if so state = START... and we are done. I'm wondering whether the whole feature could be simplified by considering that one script is one "transaction" (it is from the report point of view at least), and that any retry is for the full script only, from its beginning. That would remove the trying to guess at transactions begin or end, avoid scanning manually for subcommands, and so on. - Would it make sense? - Would it be ok for your use case? The proposed version of the code looks unmaintainable to me. There are 3 levels of nested "switch/case" with state changes at the deepest level. I cannot even see it on my screen which is not wide enough. There should be a typedef for "random_state", eg something like: typedef struct { unsigned short data[3]; } RandomState; Please keep "const" declarations, eg "commandFailed". I think that choosing script should depend on the thread random state, not the client random state, so that a run would generate the same pattern per thread, independently of which client finishes first. I'm sceptical of the "--debug-fails" options. ISTM that --debug is already there and should just be reused. Also, you have changed some existing error unrelated error messages with this option, especially in evalFunc, which is clearly not appropriate: - fprintf(stderr, "random range is too large\n"); + if (debug_fails) + fprintf(stderr, "random range is too large\n"); Please let unrelated code as is. I agree that function naming style is a already a mess, but I think that new functions you add should use a common style, eg "is_compound" vs "canRetry". Translating error strings to their enum should be put in a function. I do not believe in the normalize_whitespace function: ISTM that it would turn "SELECT LENGTH(' ');" to "SELECT LENGTH(' ');", which is not desirable. I do not think that it should be needed. I do not understand the "get subcommands of a compound command" strange re-parsing phase. There should be comments explaining what you want to achieve. I'm not sure about what happens if several begin/end appears on the line. I'm not sure this whole thing should be done anyway. About the tests: The 828 lines perl script seems over complicated, with pgbench & psql interacting together... Is all that really necessary? Isn't some (much) simpler test possible and would be sufficient? The "node" is started but never stopped. For file contents, maybe the << 'EOF' here-document syntax would help instead of using concatenated backslashed strings everywhere. -- Fabien.
pgsql-hackers by date: