Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers
| From | Xuneng Zhou |
|---|---|
| Subject | Re: Implement waiting for wal lsn replay: reloaded |
| Date | |
| Msg-id | CABPTF7UtCZW4EcOaTDnBgMxmdsx9RS_d5Q+LbfroQYLzK2g__A@mail.gmail.com Whole thread Raw |
| In response to | Re: Implement waiting for wal lsn replay: reloaded (Alexander Korotkov <aekorotkov@gmail.com>) |
| Responses |
Re: Implement waiting for wal lsn replay: reloaded
Re: Implement waiting for wal lsn replay: reloaded |
| List | pgsql-hackers |
Hi, On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Could this be causing the recent flapping failures on CI/macOS in > > > recovery/031_recovery_conflict? I didn't have time to dig personally > > > but f30848cb looks relevant: > > > > > > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary > > > error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to > > > conflict with recovery > > > DETAIL: User was or might have been using tablespace that must be dropped.' > > > while running 'psql --no-psqlrc --no-align --tuples-only --quiet > > > --dbname port=25195 > > > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI > > > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT > > > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s', > > > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm > > > line 2300. > > > > > > https://cirrus-ci.com/task/5771274900733952 > > > > > > The master branch in time-descending order, macOS tasks only: > > > > > > task_id | substring | status > > > ------------------+-----------+----------- > > > 6460882231754752 | c970bdc0 | FAILED > > > 5771274900733952 | 6ca8506e | FAILED > > > 6217757068361728 | 63ed3bc7 | FAILED > > > 5980650261446656 | ae283736 | FAILED > > > 6585898394976256 | 5f13999a | COMPLETED > > > 4527474786172928 | 7f9acc9b | COMPLETED > > > 4826100842364928 | e8d4e94a | COMPLETED > > > 4540563027918848 | b9ee5f2d | FAILED > > > 6358528648019968 | c5af141c | FAILED > > > 5998005284765696 | e212a0f8 | COMPLETED > > > 6488580526178304 | b85d5dc0 | FAILED > > > 5034091344560128 | 7dc95cc3 | ABORTED > > > 5688692477526016 | bb048e31 | COMPLETED > > > 5481187977723904 | d351063e | COMPLETED > > > 5101831568752640 | f30848cb | COMPLETED <-- the change > > > 6395317408497664 | 3f33b63d | COMPLETED > > > 6741325208354816 | 877ae5db | COMPLETED > > > 4594007789010944 | de746e0d | COMPLETED > > > 6497208998035456 | 461b8cc9 | COMPLETED > > > > Thanks for raising this issue. I think it is related to f30848cb after > > some analysis. I'll prepare a follow-up patch to fix it. > > Sorry, I've mistakenly referenced this report from commit [1]. I > thought it was related, but it appears to be not. [1] is related to > the report I've got from Ruikai Peng off-list. > > Regarding the present failure, could it happen before ExecWaitStmt() > calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we > should do preliminary efforts to release these snapshots. > > 1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f > I agree that moving PopActiveSnapshot() and InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt() appears to be a sensible optimization. However, in this particular failure scenario, it may not address the issue. For tablespace conflicts, recovery conflict resolution uses GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which returns all active backends, regardless of their snapshot state. As a result, even if all snapshots are released at the start of ExecWaitStmt(), the session would still be canceled during replay of DROP TABLESPACE. Given this, I am considering handling this conflict class explicitly: if the WAIT FOR statement is terminated and the error indicates a recovery conflict, we fall back to the existing polling-based approach. * Ask everybody to cancel their queries immediately so we can ensure no * temp files remain and we can remove the tablespace. Nuke the entire * site from orbit, it's the only way to be sure. * * XXX: We could work out the pids of active backends using this * tablespace by examining the temp filenames in the directory. We would * then convert the pids into VirtualXIDs before attempting to cancel * them. I am also wondering whether this optimization would be helpful. -- Best, Xuneng
Attachment
pgsql-hackers by date: