Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Date
Msg-id alpine.DEB.2.22.394.2106231053090.1117163@pseudo
Whole thread Raw
In response to Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors  (Yugo NAGATA <nagata@sraoss.co.jp>)
Responses Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
List pgsql-hackers
Hello Yugo-san,

# About v12.2

## Compilation

Patch seems to apply cleanly with "git apply", but does not compile on my 
host: "undefined reference to `conditional_stack_reset'".

However it works better when using the "patch". I'm wondering why git 
apply fails silently…

When compiling there are warnings about "pg_log_fatal", which does not 
expect a FILE* on pgbench.c:4453. Remove the "stderr" argument.

Global and local checks ok.

> number of transactions failed: 324 (3.240%)
> ...
> number of transactions retried: 5629 (56.290%)
> number of total retries: 103299

I'd suggest: "number of failed transactions". "total number of retries" or 
just "number of retries"?

## Feature

The overall code structure changes to implements the feature seems 
reasonable to me, as we are at the 12th iteration of the patch.

Comments below are somehow about details and asking questions
about choices, and commenting…

## Documentation

There is a lot of documentation, which is good. I'll review these
separatly. It looks good, but having a native English speaker/writer
would really help!

Some output examples do not correspond to actual output for
the current version. In particular, there is always one TPS figure
given now, instead of the confusing two shown before.

## Comments

transactinos -> transactions.

## Code

By default max_tries = 0. Should not the initialization be 1,
as the documentation argues that it is the default?

Counter comments, missing + in the formula on the skipped line.

Given that we manage errors, ISTM that we should not necessarily
stop on other not retried errors, but rather count/report them and
possibly proceed.  Eg with something like:

   -- server side random fail
   DO LANGUAGE plpgsql $$
   BEGIN
     IF RANDOM() < 0.1 THEN
       RAISE EXCEPTION 'unlucky!';
     END IF;
   END;
   $$;

Or:

   -- client side random fail
   BEGIN;
   \if random(1, 10) <= 1
   SELECT 1 +;
   \else
   SELECT 2;
   \endif
   COMMIT;

We could count the fail, rollback if necessary, and go on.  What do you think?
Maybe such behavior would deserve an option.

--report-latencies -> --report-per-command: should we keep supporting
the previous option?

--failures-detailed: if we bother to run with handling failures, should
it always be on?

--debug-errors: I'm not sure we should want a special debug mode for that,
I'd consider integrating it with the standard debug, or just for development.
Also, should it use pg_log_debug?

doRetry: I'd separate the 3 no retries options instead of mixing max_tries and
timer_exceeeded, for clarity.

Tries vs retries: I'm at odds with having tries & retries and + 1 here
and there to handle that, which is a little bit confusing. I'm wondering whether
we could only count "tries" and adjust to report what we want later?

advanceConnectionState: ISTM that ERROR should logically be before others which
lead to it.

Variables management: it looks expensive, with copying and freeing variable arrays.
I'm wondering whether we should think of something more clever. Well, that would be
for some other patch.

"Accumulate the retries" -> "Count (re)tries"?

Currently, ISTM that the retry on error mode is implicitely always on.
Do we want that? I'd say yes, but maybe people could disagree.

## Tests

There are tests, good!

I'm wondering whether something simpler could be devised to trigger
serialization or deadlock errors, eg with a SEQUENCE and an \if.

See the attached files for generating deadlocks reliably (start with 2 clients).
What do you think? The PL/pgSQL minimal, it is really client-code 
oriented.

Given that deadlocks are detected about every seconds, the test runs
would take some time. Let it be for now.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE
Next
From: Richard Guo
Date:
Subject: Re: Using each rel as both outer and inner for JOIN_ANTI