Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers
| From | Alexander Korotkov |
|---|---|
| Subject | Re: Implement waiting for wal lsn replay: reloaded |
| Date | |
| Msg-id | CAPpHfduOE+P3YPfSWF8vSQEj+gBEBVq_N6w0PVZdMqO2sA7Grw@mail.gmail.com Whole thread Raw |
| In response to | Re: Implement waiting for wal lsn replay: reloaded (Xuneng Zhou <xunengzhou@gmail.com>) |
| Responses |
Re: Implement waiting for wal lsn replay: reloaded
|
| List | pgsql-hackers |
On Tue, Jan 6, 2026 at 3:12 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > Could this be causing the recent flapping failures on CI/macOS in > > > > recovery/031_recovery_conflict? I didn't have time to dig personally > > > > but f30848cb looks relevant: > > > > > > > > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary > > > > error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to > > > > conflict with recovery > > > > DETAIL: User was or might have been using tablespace that must be dropped.' > > > > while running 'psql --no-psqlrc --no-align --tuples-only --quiet > > > > --dbname port=25195 > > > > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI > > > > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT > > > > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s', > > > > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm > > > > line 2300. > > > > > > > > https://cirrus-ci.com/task/5771274900733952 > > > > > > > > The master branch in time-descending order, macOS tasks only: > > > > > > > > task_id | substring | status > > > > ------------------+-----------+----------- > > > > 6460882231754752 | c970bdc0 | FAILED > > > > 5771274900733952 | 6ca8506e | FAILED > > > > 6217757068361728 | 63ed3bc7 | FAILED > > > > 5980650261446656 | ae283736 | FAILED > > > > 6585898394976256 | 5f13999a | COMPLETED > > > > 4527474786172928 | 7f9acc9b | COMPLETED > > > > 4826100842364928 | e8d4e94a | COMPLETED > > > > 4540563027918848 | b9ee5f2d | FAILED > > > > 6358528648019968 | c5af141c | FAILED > > > > 5998005284765696 | e212a0f8 | COMPLETED > > > > 6488580526178304 | b85d5dc0 | FAILED > > > > 5034091344560128 | 7dc95cc3 | ABORTED > > > > 5688692477526016 | bb048e31 | COMPLETED > > > > 5481187977723904 | d351063e | COMPLETED > > > > 5101831568752640 | f30848cb | COMPLETED <-- the change > > > > 6395317408497664 | 3f33b63d | COMPLETED > > > > 6741325208354816 | 877ae5db | COMPLETED > > > > 4594007789010944 | de746e0d | COMPLETED > > > > 6497208998035456 | 461b8cc9 | COMPLETED > > > > > > Thanks for raising this issue. I think it is related to f30848cb after > > > some analysis. I'll prepare a follow-up patch to fix it. > > > > Sorry, I've mistakenly referenced this report from commit [1]. I > > thought it was related, but it appears to be not. [1] is related to > > the report I've got from Ruikai Peng off-list. > > > > Regarding the present failure, could it happen before ExecWaitStmt() > > calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we > > should do preliminary efforts to release these snapshots. > > > > 1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f > > > > I agree that moving PopActiveSnapshot() and > InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt() > appears to be a sensible optimization. However, in this particular > failure scenario, it may not address the issue. > > For tablespace conflicts, recovery conflict resolution uses > GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which > returns all active backends, regardless of their snapshot state. As a > result, even if all snapshots are released at the start of > ExecWaitStmt(), the session would still be canceled during replay of > DROP TABLESPACE. GetConflictingVirtualXIDs() uses proc->xmin to detect the conflicts. ExecWaitStmt() asserts MyProc->xmin == InvalidTransactionId after releasing all the snapshots. I still think this happens because conflict handling happens before ExecWaitStmt() manages to release the snapshots. ------ Regards, Alexander Korotkov Supabase
pgsql-hackers by date: