Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Date
Msg-id alpine.DEB.2.22.394.2107190710500.3459242@pseudo
Whole thread Raw
In response to Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors  (Yugo NAGATA <nagata@sraoss.co.jp>)
Responses Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
List pgsql-hackers
> I attached the updated patch.

# About pgbench error handling v15

Patches apply cleanly. Compilation, global and local tests ok.

  - v15.1: refactoring is a definite improvement.
    Good, even if it is not very useful (see below).

    While restructuring, maybe predefined variables could be make readonly
    so that a script which would update them would fail, which would be a
    good thing. Maybe this is probably material for an independent patch.

  - v15.2: see detailed comments below

# Doc

Doc build is ok.

ISTM that "number of tries" line would be better placed between the 
#threads and #transactions lines. What do you think?

Aggregate logging description: "{ failures | ... }" seems misleading 
because it suggests we have one or the other, whereas it can also be 
empty. I suggest: "{ | failures | ... }" to show the empty case.

Having a full example with retries in the doc is a good thing, and 
illustrates in passing that running with a number of clients on a small 
scale does not make much sense because of the contention on 
tellers/branches. I'd wonder whether the number of tries is set too high, 
though, ISTM that an application should give up before 100? I like that 
the feature it is also limited by latency limit.

Minor editing:

"there're" -> "there are".

"the --time" -> "the --time option".

The overall English seems good, but I'm not a native speaker. As I already said, a native
speaker proofreading would be nice.

From a technical writing point of view, maybe the documentation could be improved a bit,
but I'm not a ease on that subject. Some comments:

"The latency for failed transactions and commands is not computed separately." is unclear,
please use a positive sentence to tell what is true instead of what is not and the reader
has to guess. Maybe: "The latency figures include failed transactions which have reached
the maximum number of tries or the transaction latency limit.".

"The main report contains the number of failed transactions if it is non-zero." ISTM that
this is a pain for scripts which would like to process these reports data, because the data
may or may not be there. I'm sure to write such scripts, which explains my concern:-)

"If the total number of retried transactions is non-zero…" should it rather be "not one",
because zero means unlimited retries?

The section describing the various type of errors that can occur is a good addition.

Option "--report-latencies" changed to "--report-per-commands": I'm fine with this change.

# FEATURES

--failures-detailed: I'm not convinced that this option should not always be on, but
this is not very important, so let it be.

--verbose-errors: I still think this is only for debugging, but let it be.

Copying variables: ISTM that we should not need to save the variables 
states… no clearing, no copying should be needed. The restarted 
transaction simply overrides the existing variables which is what the 
previous version was doing anyway. The scripts should write their own 
variables before using them, and if they don't then it is the user 
problem. This is important for performance, because it means that after a 
client has executed all scripts once the variable array is stable and does 
not incur significant maintenance costs. The only thing that needs saving 
for retry is the speudo-random generator state. This suggest simplifying 
or removing "RetryState".

# CODE

The semantics of "cnt" is changed. Ok, the overall counters and their 
relationships make sense, and it simplifies the reporting code. Good.

In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and 
could be dealt with the default case. We are only interested in 
serialization/deadlocks which are fatal errors?

doRetry: for consistency, given the assert, ISTM that it should return 
false if duration has expired, by testing end_time or timer_exceeded.

checkTransactionStatus: this function does several things at once with 2 
booleans, which make it not very clear to me. Maybe it would be clearer if 
it would just return an enum (in trans, not in trans, conn error, other 
error). Another reason to do that is that on connection error pgbench 
could try to reconnect, which would be an interesting later extension, so 
let's pave the way for that.  Also, I do not think that the function 
should print out a message, it should be the caller decision to do that.

verbose_errors: there is more or less repeated code under RETRY and 
FAILURE, which should be factored out in a separate function. The 
advanceConnectionFunction is long enough. Once this is done, there is no 
need for a getLatencyUsed function.

I'd put cleaning up the pipeline in a function. I do not understand why 
the pipeline mode is not exited in all cases, the code checks for the 
pipeline status twice in a few lines. I'd put this cleanup in the sync 
function as well, report to the caller (advanceConnectionState) if there 
was an error, which would be managed there.

WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to 
avoid code repetition (there and in the "error:" label in 
readCommandResponse). On the other hand, I'm not sure why the loop is 
needed: we can only get there by submitting a "ROLLBACK" command, so there 
should be only one result anyway?

report_per_command: please always count retries and failures of commands 
even if they will not be reported in the end, the code will be simpler and 
more efficient.

doLog: the format has changed, including a new string on failures which 
replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM 
that the string could be shorten to "deadlock" or "serialization". ISTM 
that the documentation example should include a line with a failure, to 
make it clear what to expect.

I'm okay with always getting computing thread stats.

# COMMENTS

struct StatsData comment is helpful.
  - "failed transactions" -> "unsuccessfully retried transactions"?
  - 'cnt' decomposition: first term is field 'retried'? if so say it
    explicitely?

"Complete the failed transaction" sounds strange: If it failed, it could 
not complete? I'd suggest "Record a failed transaction".

# TESTS

I suggested to simplify the tests by using conditionals & sequences. You 
reported that you got stuck. Hmmm.

I tried again my tests which worked fine when started with 2 clients, 
otherwise they get stuck because the first client waits for the other one 
which does not exists (the point is to generate deadlocks and other 
errors). Maybe this is your issue?

Could you try with:

   psql < deadlock_prep.sql
   pgbench -t 4 -c 2 -f deadlock.sql
   # note: each deadlock detection takes 1 second

   psql < deadlock_prep.sql
   pgbench -t 10 -c 2 -f serializable.sql
   # very quick 50% serialization errors

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Chris Cleveland
Date:
Subject: Transactions and indexes
Next
From: Mark Dilger
Date:
Subject: Re: refactoring basebackup.c