Re: Test 031_recovery_conflict fails when a conflict counted twice - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: Test 031_recovery_conflict fails when a conflict counted twice |
Date | |
Msg-id | 7wcw42bhu6n5bmne6j7776qxooqvn2wdxho7wrxlr6x3alhwjm@nkaeflilqnq2 Whole thread Raw |
In response to | Test 031_recovery_conflict fails when a conflict counted twice (Alexander Lakhin <exclusion@gmail.com>) |
List | pgsql-hackers |
> On Mon, May 27, 2024 at 06:00:00PM GMT, Alexander Lakhin wrote: > Hello hackers, > > As a recent buildfarm test failure on olingo (which builds postgres with > -O1 and address sanitizer) [1] shows: > ... > [23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby > [23:12:02.130](0.003s) # Failed test 'snapshot conflict: stats show conflict on standby' > # at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332. > [23:12:02.130](0.000s) # got: '2' > # expected: '1' > ... > [23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database > [23:12:06.887](0.040s) # Failed test '5 recovery conflicts shown in pg_stat_database' > # at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286. > [23:12:06.887](0.000s) # got: '6' > # expected: '5' > Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary > done > ... > > pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log: > 2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement: > FETCH FORWARD FROM test_recovery_conflict_cursor; > 2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ; > 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still > waiting after 15.289 ms: recovery conflict on snapshot > 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981. > 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT: WAL redo at > 0/344F468 for Heap2/PRUNE_VACUUM_SCAN: snapshotConflictHorizon: 746, > isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{ > xmax: 0, infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, { > xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 18, offsets: [41, 42, > 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }], > redirected: [23->41, 24->42, 25->43, 26->44, 27->45, 28->46, 29->47, 30->48, > 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57, > 40->58]; blkref #0: rel 1663/16385/16386, blk 0 > 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery > 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL: User > query might have needed to see row versions that must be removed. > 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a > moment you should be able to reconnect to the database and repeat your > command. > vvv an important difference from a successful test run > 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe > 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery > 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL: User > query might have needed to see row versions that must be removed. > 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a > moment you should be able to reconnect to the database and repeat your > command. > ^^^ > > test 031_recovery_conflict may fail in the following scenario: > > 031_recovery_conflict.pl: > executes a query, which produces a conflict: > ## RECOVERY CONFLICT 2: Snapshot conflict > ... > $psql_standby->query_safe(...) > > startup process: > detects a snapshot conflict and sends > PROCSIG_RECOVERY_CONFLICT_SNAPSHOT: > (ResolveRecoveryConflictWithVirtualXIDs -> > CancelVirtualTransaction) to the client backend > > client backend: > receives and processes the signal: > HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt -> > ProcessInterrupts -> ProcessRecoveryConflictInterrupts -> > ProcessRecoveryConflictInterrupt, > > reports the recovery conflict: > pgstat_report_recovery_conflict(reason); > > and reports the error: > ereport(FATAL, ... "terminating connection due to conflict with > recovery" ...) > sends the message to the server log: > errfinish -> EmitErrorReport -> send_message_to_server_log > > 031_recovery_conflict.pl: > # finds the message in the log and resets psql connection: > check_conflict_log( > "User query might have needed to see row versions that must > be removed"); > $psql_standby->reconnect_and_clear(); > > startup process: > keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client > backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs > > client backend: > tries to send the message to the client: > send_message_to_frontend -> socket_flush -> > internal_flush_buffer, > gets an error (EPIPE) from secure_write, and calls > ereport(COMMERROR, > (errcode_for_socket_access(), > errmsg("could not send data to client: %m"))); > > receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT > signal and processes it the same way: > > HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt -> > ProcessInterrupts -> ProcessRecoveryConflictInterrupts -> > ProcessRecoveryConflictInterrupt, > > reports the recovery conflict: > pgstat_report_recovery_conflict(reason); > // now the conflict is counted twice > > and reports the error: > ereport(FATAL, ... "terminating connection due to conflict with > recovery" ...) > sends the message to the server log: > errfinish -> EmitErrorReport -> send_message_to_server_log > > 031_recovery_conflict.pl: > calls > check_conflict_stat("snapshot"); > and gets 2 instead of 1. Thanks for the breakdown of the steps, a very interesting case. If I understand everything correctly, it could be summarized as: the startup process tries to stop the client backend, which struggles with error reporting just enough to receive a second signal. It seems this issue is rare, even on olingo it has appeared only once. Yet it raises a question whether is it an incorrect behaviour or the test is not robust enough? From what I see in ProcessRecoveryConflictInterrupt, there is no defence against duplicated recovery conflict getting reported, except an expectation that ereport FATAL will terminate the session (the limitations on the recursive error reporting do not seem to stop that). I guess this points into the direction that it's expected, and check_conflict_stat might be checking for number of conflicts > 1. If that's not the case, and recovery conflicts have to be reported only once, then either the startup process or the client backend have to take care of that. My first idea was to experiment with longer waiting intervals for an unresponsive client backend (currently it's a constant of 5000 us) -- it seems to reduce chances of reproducing the error (with the attached patch applied), but I'm not sure it's an appropriate solution.
pgsql-hackers by date: