Thread: Test 031_recovery_conflict fails when a conflict counted twice

Test 031_recovery_conflict fails when a conflict counted twice

From
Alexander Lakhin
Date:
Hello hackers,

As a recent buildfarm test failure on olingo (which builds postgres with
-O1 and address sanitizer) [1] shows:
...
[23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby
[23:12:02.130](0.003s) #   Failed test 'snapshot conflict: stats show conflict on standby'
#   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332.
[23:12:02.130](0.000s) #          got: '2'
#     expected: '1'
...
[23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database
[23:12:06.887](0.040s) #   Failed test '5 recovery conflicts shown in pg_stat_database'
#   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286.
[23:12:06.887](0.000s) #          got: '6'
#     expected: '5'
Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary
done
...

pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log:
2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM 
test_recovery_conflict_cursor;
2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ;
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still waiting after 15.289 ms: recovery conflict
on
 
snapshot
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981.
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT:  WAL redo at 0/344F468 for Heap2/PRUNE_VACUUM_SCAN: 
snapshotConflictHorizon: 746, isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{ xmax: 0, 
infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, { xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 
18, offsets: [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }], redirected: [23->41, 24->42,

25->43, 26->44, 27->45, 28->46, 29->47, 30->48, 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57,

40->58]; blkref #0: rel 1663/16385/16386, blk 0
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with
recovery
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL:  User query might have needed to see row versions

that must be removed.
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the 
database and repeat your command.
vvv an important difference from a successful test run
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with
recovery
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL:  User query might have needed to see row versions

that must be removed.
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the 
database and repeat your command.
^^^

test  031_recovery_conflict may fail in the following scenario:

031_recovery_conflict.pl:
     executes  a query, which produces a conflict:
     ## RECOVERY CONFLICT 2: Snapshot conflict
     ...
     $psql_standby->query_safe(...)

startup process:
         detects a snapshot conflict and sends
         PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
         (ResolveRecoveryConflictWithVirtualXIDs ->
         CancelVirtualTransaction) to the client backend

client backend:
             receives and processes the signal:
             HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
             ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
             ProcessRecoveryConflictInterrupt,

             reports the recovery conflict:
             pgstat_report_recovery_conflict(reason);

             and reports the error:
             ereport(FATAL, ... "terminating connection due to conflict with
             recovery" ...)
             sends the message to the server log:
             errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
     # finds the message in the log and resets psql connection:
     check_conflict_log(
       "User query might have needed to see row versions that must
        be removed");
     $psql_standby->reconnect_and_clear();

startup process:
         keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client
         backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs

client backend:
             tries to send the message to the client:
             send_message_to_frontend -> socket_flush ->
             internal_flush_buffer,
             gets an error (EPIPE) from secure_write, and calls
             ereport(COMMERROR,
               (errcode_for_socket_access(),
                errmsg("could not send data to client: %m")));

            receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT
            signal and processes it the same way:

            HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
            ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
            ProcessRecoveryConflictInterrupt,

            reports the recovery conflict:
            pgstat_report_recovery_conflict(reason);
            // now the conflict is counted twice

            and reports the error:
            ereport(FATAL, ... "terminating connection due to conflict with
            recovery" ...)
            sends the message to the server log:
            errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
     calls
     check_conflict_stat("snapshot");
     and gets 2 instead of 1.

The patch adding delays to reproduce the issue is attached.

With the patch applied, I run the test (against an "-O0" build) in a loop:
for i in `seq 20`; do echo "I $i"; make check -s -C \
src/test/recovery/ PROVE_TESTS="t/031*"; grep ' not ok 6 ' \
src/test/recovery/tmp_check/log/regress_log_031_recovery_conflict && break;
done

and get exactly the same failure as on olingo:
I 1
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl .. 6/?
#   Failed test 'snapshot conflict: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 333.
#          got: '2'
#     expected: '1'
t/031_recovery_conflict.pl .. 13/?
#   Failed test '5 recovery conflicts shown in pg_stat_database'
#   at t/031_recovery_conflict.pl line 287.
#          got: '6'
#     expected: '5'
# Looks like you failed 2 tests of 18.
t/031_recovery_conflict.pl .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/18 subtests

Test Summary Report
-------------------
t/031_recovery_conflict.pl (Wstat: 512 Tests: 18 Failed: 2)
   Failed tests:  6, 17
   Non-zero exit status: 2

(Similar failures can be seen with other sections of 031_recovery_conflict.)

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-05-15%2023%3A03%3A30

Best regards,
Alexander
Attachment

Re: Test 031_recovery_conflict fails when a conflict counted twice

From
Dmitry Dolgov
Date:
> On Mon, May 27, 2024 at 06:00:00PM GMT, Alexander Lakhin wrote:
> Hello hackers,
>
> As a recent buildfarm test failure on olingo (which builds postgres with
> -O1 and address sanitizer) [1] shows:
> ...
> [23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby
> [23:12:02.130](0.003s) #   Failed test 'snapshot conflict: stats show conflict on standby'
> #   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332.
> [23:12:02.130](0.000s) #          got: '2'
> #     expected: '1'
> ...
> [23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database
> [23:12:06.887](0.040s) #   Failed test '5 recovery conflicts shown in pg_stat_database'
> #   at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286.
> [23:12:06.887](0.000s) #          got: '6'
> #     expected: '5'
> Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary
> done
> ...
>
> pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log:
> 2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement:
> FETCH FORWARD FROM test_recovery_conflict_cursor;
> 2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ;
> 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still
> waiting after 15.289 ms: recovery conflict on snapshot
> 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981.
> 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT:  WAL redo at
> 0/344F468 for Heap2/PRUNE_VACUUM_SCAN: snapshotConflictHorizon: 746,
> isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{
> xmax: 0, infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, {
> xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 18, offsets: [41, 42,
> 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }],
> redirected: [23->41, 24->42, 25->43, 26->44, 27->45, 28->46, 29->47, 30->48,
> 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57,
> 40->58]; blkref #0: rel 1663/16385/16386, blk 0
> 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with
recovery
> 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL:  User
> query might have needed to see row versions that must be removed.
> 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a
> moment you should be able to reconnect to the database and repeat your
> command.
> vvv an important difference from a successful test run
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with
recovery
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL:  User
> query might have needed to see row versions that must be removed.
> 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a
> moment you should be able to reconnect to the database and repeat your
> command.
> ^^^
>
> test  031_recovery_conflict may fail in the following scenario:
>
> 031_recovery_conflict.pl:
>     executes  a query, which produces a conflict:
>     ## RECOVERY CONFLICT 2: Snapshot conflict
>     ...
>     $psql_standby->query_safe(...)
>
> startup process:
>         detects a snapshot conflict and sends
>         PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
>         (ResolveRecoveryConflictWithVirtualXIDs ->
>         CancelVirtualTransaction) to the client backend
>
> client backend:
>             receives and processes the signal:
>             HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
>             ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
>             ProcessRecoveryConflictInterrupt,
>
>             reports the recovery conflict:
>             pgstat_report_recovery_conflict(reason);
>
>             and reports the error:
>             ereport(FATAL, ... "terminating connection due to conflict with
>             recovery" ...)
>             sends the message to the server log:
>             errfinish -> EmitErrorReport -> send_message_to_server_log
>
> 031_recovery_conflict.pl:
>     # finds the message in the log and resets psql connection:
>     check_conflict_log(
>       "User query might have needed to see row versions that must
>        be removed");
>     $psql_standby->reconnect_and_clear();
>
> startup process:
>         keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client
>         backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs
>
> client backend:
>             tries to send the message to the client:
>             send_message_to_frontend -> socket_flush ->
>             internal_flush_buffer,
>             gets an error (EPIPE) from secure_write, and calls
>             ereport(COMMERROR,
>               (errcode_for_socket_access(),
>                errmsg("could not send data to client: %m")));
>
>            receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT
>            signal and processes it the same way:
>
>            HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
>            ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
>            ProcessRecoveryConflictInterrupt,
>
>            reports the recovery conflict:
>            pgstat_report_recovery_conflict(reason);
>            // now the conflict is counted twice
>
>            and reports the error:
>            ereport(FATAL, ... "terminating connection due to conflict with
>            recovery" ...)
>            sends the message to the server log:
>            errfinish -> EmitErrorReport -> send_message_to_server_log
>
> 031_recovery_conflict.pl:
>     calls
>     check_conflict_stat("snapshot");
>     and gets 2 instead of 1.

Thanks for the breakdown of the steps, a very interesting case. If I
understand everything correctly, it could be summarized as: the startup
process tries to stop the client backend, which struggles with error
reporting just enough to receive a second signal. It seems this issue is
rare, even on olingo it has appeared only once. Yet it raises a question
whether is it an incorrect behaviour or the test is not robust enough?

From what I see in ProcessRecoveryConflictInterrupt, there is no defence
against duplicated recovery conflict getting reported, except an
expectation that ereport FATAL will terminate the session (the
limitations on the recursive error reporting do not seem to stop that).
I guess this points into the direction that it's expected, and
check_conflict_stat might be checking for number of conflicts > 1.

If that's not the case, and recovery conflicts have to be reported only
once, then either the startup process or the client backend have to take
care of that. My first idea was to experiment with longer waiting
intervals for an unresponsive client backend (currently it's a constant
of 5000 us) -- it seems to reduce chances of reproducing the error (with
the attached patch applied), but I'm not sure it's an appropriate
solution.