Re: Test 031_recovery_conflict fails when a conflict counted twice - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Test 031_recovery_conflict fails when a conflict counted twice
Date
Msg-id 7wcw42bhu6n5bmne6j7776qxooqvn2wdxho7wrxlr6x3alhwjm@nkaeflilqnq2
Whole thread Raw
In response to Test 031_recovery_conflict fails when a conflict counted twice  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
> On Mon, May 27, 2024 at 06:00:00PM GMT, Alexander Lakhin wrote:
> Hello hackers,
>
> As a recent buildfarm test failure on olingo (which builds postgres with
> -O1 and address sanitizer) [1] shows:
> ...
> [23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby
> [23:12:02.130](0.003s) #   Failed test 'snapshot conflict: stats show conflict on standby'
> #   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332.
> [23:12:02.130](0.000s) #          got: '2'
> #     expected: '1'
> ...
> [23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database
> [23:12:06.887](0.040s) #   Failed test '5 recovery conflicts shown in pg_stat_database'
> #   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286.
> [23:12:06.887](0.000s) #          got: '6'
> #     expected: '5'
> Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary
> done
> ...
>
> pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log:
> 2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement:
> FETCH FORWARD FROM test_recovery_conflict_cursor;
> 2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ;
> 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still
> waiting after 15.289 ms: recovery conflict on snapshot
> 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981.
> 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT:  WAL redo at
> 0/344F468 for Heap2/PRUNE_VACUUM_SCAN: snapshotConflictHorizon: 746,
> isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{
> xmax: 0, infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, {
> xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 18, offsets: [41, 42,
> 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }],
> redirected: [23->41, 24->42, 25->43, 26->44, 27->45, 28->46, 29->47, 30->48,
> 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57,
> 40->58]; blkref #0: rel 1663/16385/16386, blk 0
> 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with
recovery
> 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL:  User
> query might have needed to see row versions that must be removed.
> 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a
> moment you should be able to reconnect to the database and repeat your
> command.
> vvv an important difference from a successful test run
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with
recovery
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL:  User
> query might have needed to see row versions that must be removed.
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a
> moment you should be able to reconnect to the database and repeat your
> command.
> ^^^
>
> test  031_recovery_conflict may fail in the following scenario:
>
> 031_recovery_conflict.pl:
>     executes  a query, which produces a conflict:
>     ## RECOVERY CONFLICT 2: Snapshot conflict
>     ...
>     $psql_standby->query_safe(...)
>
> startup process:
>         detects a snapshot conflict and sends
>         PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
>         (ResolveRecoveryConflictWithVirtualXIDs ->
>         CancelVirtualTransaction) to the client backend
>
> client backend:
>             receives and processes the signal:
>             HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
>             ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
>             ProcessRecoveryConflictInterrupt,
>
>             reports the recovery conflict:
>             pgstat_report_recovery_conflict(reason);
>
>             and reports the error:
>             ereport(FATAL, ... "terminating connection due to conflict with
>             recovery" ...)
>             sends the message to the server log:
>             errfinish -> EmitErrorReport -> send_message_to_server_log
>
> 031_recovery_conflict.pl:
>     # finds the message in the log and resets psql connection:
>     check_conflict_log(
>       "User query might have needed to see row versions that must
>        be removed");
>     $psql_standby->reconnect_and_clear();
>
> startup process:
>         keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client
>         backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs
>
> client backend:
>             tries to send the message to the client:
>             send_message_to_frontend -> socket_flush ->
>             internal_flush_buffer,
>             gets an error (EPIPE) from secure_write, and calls
>             ereport(COMMERROR,
>               (errcode_for_socket_access(),
>                errmsg("could not send data to client: %m")));
>
>            receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT
>            signal and processes it the same way:
>
>            HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
>            ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
>            ProcessRecoveryConflictInterrupt,
>
>            reports the recovery conflict:
>            pgstat_report_recovery_conflict(reason);
>            // now the conflict is counted twice
>
>            and reports the error:
>            ereport(FATAL, ... "terminating connection due to conflict with
>            recovery" ...)
>            sends the message to the server log:
>            errfinish -> EmitErrorReport -> send_message_to_server_log
>
> 031_recovery_conflict.pl:
>     calls
>     check_conflict_stat("snapshot");
>     and gets 2 instead of 1.

Thanks for the breakdown of the steps, a very interesting case. If I
understand everything correctly, it could be summarized as: the startup
process tries to stop the client backend, which struggles with error
reporting just enough to receive a second signal. It seems this issue is
rare, even on olingo it has appeared only once. Yet it raises a question
whether is it an incorrect behaviour or the test is not robust enough?

From what I see in ProcessRecoveryConflictInterrupt, there is no defence
against duplicated recovery conflict getting reported, except an
expectation that ereport FATAL will terminate the session (the
limitations on the recursive error reporting do not seem to stop that).
I guess this points into the direction that it's expected, and
check_conflict_stat might be checking for number of conflicts > 1.

If that's not the case, and recovery conflicts have to be reported only
once, then either the startup process or the client backend have to take
care of that. My first idea was to experiment with longer waiting
intervals for an unresponsive client backend (currently it's a constant
of 5000 us) -- it seems to reduce chances of reproducing the error (with
the attached patch applied), but I'm not sure it's an appropriate
solution.



pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: ssl tests fail due to TCP port conflict
Next
From: Robert Haas
Date:
Subject: [multithreading] extension compatibility