Core dumps from recovery/017_shm - Mailing list pgsql-hackers

From Thomas Munro
Subject Core dumps from recovery/017_shm
Date
Msg-id CA+hUKGKzfkN6re3yboQ+9qbhV3+f8Qk__ZCApSKY+NoC1Y1thA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
While looking for something else, I noticed that we occasionally see
assertion failures like this:

TRAP: failed Assert("latch->maybe_sleeping == false"), File:
"latch.c", Line: 378, PID: 28023

Here's one in the build farm:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-08-05%2005:52:51

And here are some recent cases on CI, which again fail somewhere else,
but that might be expected as these are cfbot branches from patches on
the mailing list:

     task_id      |            task_name
------------------+---------------------------------
 6347210574528512 | Linux - Debian Bookworm - Meson
 6420333948829696 | FreeBSD - Meson
 5616450825617408 | FreeBSD - Meson
 4515661445070848 | Linux - Debian Bookworm - Meson
 4945927242252288 | Linux - Debian Bookworm - Meson
 5133563223343104 | Linux - Debian Bookworm - Meson

You can drop those task IDs into these URLs:

https://cirrus-ci.com/task/$TASK_ID
https://api.cirrus-ci.com/v1/artifact/task/$TASK_ID/testrun/build/testrun/recovery/017_shm/log/017_shm_gnat.log

My current theory is that backends are exiting when the test kills the
postmaster, but a backend that is concurrently starting up takes over
its latch, and then its first ResetLatch(MyLatch) fails that assertion
because maybe_sleeping was never cleared.  So I suppose it should be
cleared in ... DisownLatch()?

That sails close to the topic in these threads:

https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
https://www.postgresql.org/message-id/flat/CA+hUKGKp0kTpummCPa97+WFJTm+uYzQ9Ex8UMdH8ZXkLwO0QgA@mail.gmail.com

If we didn't use proc_exit(), we wouldn't recycle the latch, so the
problem would go away with the new emergency cleanup solution I'm
working on (which incidentally also gets rid of the other source of
core dump spam that clogs up BF and CI systems: archive scripts and
other subprocesses of backends).  More about that soon on that last
thread, but...

That would still leave versions 15-18 with these rare assertion
failures, since they have commit c8f3bc24.  So I think the thing to do
is change DisownLatch() to clear maybe_sleeping just where it also
clears owner_pid, and backpatch that.  Another idea would be to do it
in WaitEventSetWaitBlock() before exiting, but that'd be duplicated in
several places.



pgsql-hackers by date:

Previous
From: Daniele Varrazzo
Date:
Subject: Failure building libpq v18.0 on old aarch64
Next
From: Peter Smith
Date:
Subject: Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE