Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process - Mailing list pgsql-hackers
| From | Masahiko Sawada |
|---|---|
| Subject | Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process |
| Date | |
| Msg-id | CAD21AoA6EVJ6eNq6xSgHffu9R-kHFhas2BO3jt9JjpTYLi3+Jg@mail.gmail.com Whole thread |
| In response to | Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process (Alexander Lakhin <exclusion@gmail.com>) |
| List | pgsql-hackers |
On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > Dear Sawada-san, > > 01.05.2026 01:08, Masahiko Sawada wrote: > > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > I was wondering why is that failure the only one of this kind on buildfarm > (in last two years, at least), so I've tried to reproduce it on > REL_18_STABLE... and failed. > > Then I've bisected it on the master branch and found (your) commit that > introduced this behavior: 67c20979c from 2025-12-23. > > I've confirmed that this race condition issue is present from v15 to > the master. In v14, we have the procsignal barrier code but don't use > it anywhere. In v18 or older, it could happen when executing DROP > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen > in more cases as we're using procsignal barrier more places. In any > case, if a process emits a signal barrier when another process is > between the initialization of slot->pss_barrierGeneration and > slot->pss_pid initialization, the subsequent > WaitForProcSignalBarrier() ends up waiting for that process forever. > So I think the patch should be backpatched to v15. Please review these > patches. > > > Yes, you're right -- it's not reproduced on REL_18_STABLE with > test_oat_hooks, which simply starts postgres node (as many other tests), > but when I tried the full test suite with the sleep inserted before > setting pss_pid, I discovered the following vulnerable tests: > > 030_stats_cleanup_replica_standby.log > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393 > > 033_replay_tsp_drops_standby2_FILE_COPY.log > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to16384/16389 > > 040_standby_failover_slots_sync_publisher.log > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477to accept ProcSignalBarrier > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db; > > 002_compare_backups_pitr1.log > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414 > > I've tried my repro with 033_replay_tsp_drops and it really fails on > REL_15_STABLE..master and doesn't fail on REL_14_STABLE. > > FYI I found that we had a similar report[1] last year, I'm not sure > it hit the exact same issue, though. > > Regards, > > [1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com > > > Yeah, and probably this one: > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru > > By the way, mamba produced the same failure just yesterday: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39 > > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata--log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log--options --cluster-name=primarystart > waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stoppedwaiting > pg_ctl: server did not start in time > 004_restart_primary.log > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier > ... > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier > > The proposed patches make the test pass reliably for me in all affected > branches. Thank you for working on this! > Thank you for checking this issue on stable branches too! Considering that this issue is not very visible in practice and we're going to release new minor versions next week, I'm planning to push these fixes to master and backbranches after the minor releases. That way, we can fix the issue on the master relatively soon and have enough time to verify that fix works well on backbranches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: