Re: 001_rep_changes.pl fails due to publisher stuck on shutdown - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: 001_rep_changes.pl fails due to publisher stuck on shutdown |
Date | |
Msg-id | CAHut+Pv0KcOpOEDL8cY7Si4AVfsCFhD--4V7HvMpitXcv6PEiA@mail.gmail.com Whole thread Raw |
In response to | Re: 001_rep_changes.pl fails due to publisher stuck on shutdown (vignesh C <vignesh21@gmail.com>) |
List | pgsql-hackers |
On Thu, May 30, 2024 at 2:09 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 29 May 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > > > Hello hackers, > > > > As a recent buildfarm test failure [1] shows: > > [14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column > > ### Stopping node "publisher" using mode fast > > # Running: pg_ctl -D > > /home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata > > -m fast stop > > waiting for server to shut down.. ... ... ... .. failed > > pg_ctl: server does not shut down > > # pg_ctl stop failed: 256 > > # Postmaster PID for node "publisher" is 2222549 > > [14:39:04.375](362.001s) Bail out! pg_ctl stop failed > > > > 001_rep_changes_publisher.log > > 2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep > > 2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf > > database=postgres host=[local] > > 2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request > > 2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions > > 2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID > > 2223110) exited with exit code 1 > > 2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down > > 2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request > > 2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut down > > > > the publisher node may hang on stopping. > > > > I reproduced the failure (with aggressive autovacuum) locally and > > discovered that it happens because: > > 1) checkpointer calls WalSndInitStopping() (which sends > > PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside > > WalSndWaitStopping() indefinitely, because: > > 2) walsender receives the signal, sets got_STOPPING = true, but can't exit > > WalSndLoop(): > > 3) it never sets got_SIGUSR2 (to get into WalSndDone()) in > > XLogSendLogical(): > > 4) it never sets WalSndCaughtUp to true: > > 5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in > > XLogSendLogical(): > > 6) EndRecPtr doesn't advance in XLogNextRecord(): > > 7) XLogDecodeNextRecord() fails do decode a record that crosses a page > > boundary: > > 8) ReadPageInternal() (commented "Wait for the next page to become > > available") constantly returns XLREAD_FAIL: > > 9) state->routine.page_read()/logical_read_xlog_page() constantly returns > > -1: > > 10) flushptr = WalSndWaitForWal() stops advancing, because > > got_STOPPING == true (see 2). > > > > That is, walsender doesn't let itself to catch up, if it gets the stop > > signal when it's lagging behind and decoding a record requires reading > > the next wal page. > > > > Please look at the reproducing test (based on 001_rep_changes.pl) attached. > > If fails for me as below: > > # 17 > > Bailout called. Further testing stopped: pg_ctl stop failed > > FAILED--Further testing stopped: pg_ctl stop failed > > make: *** [Makefile:21: check] Ошибка 255 > > Thank you, Alexander, for sharing the script. I was able to reproduce > the issue using the provided script. Furthermore, while investigating > its origins, I discovered that this problem persists across all > branches up to PG10 (the script needs slight adjustments to run it on > older versions). It's worth noting that this issue isn't a result of > recent version changes. > Hi, FWIW using the provided scripting I was also able to reproduce the problem on HEAD but for me, it was more rare. -- the script passed ok 3 times all 100 iterations; it eventually failed on the 4th run on the 75th iteration. ====== Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: