Dividing line between before_shmem_exit and on_shmem_exit? - Mailing list pgsql-hackers

From Andres Freund
Subject Dividing line between before_shmem_exit and on_shmem_exit?
Date
Msg-id 20210803023612.iziacxk5syn2r4ut@alap3.anarazel.de
Whole thread Raw
List pgsql-hackers
Hi,

I don't think I really understand what decides what should use
before_shmem_exit() and what into on_shmem_exit(). The existing comments say:

/* ----------------------------------------------------------------
 *        before_shmem_exit
 *
 *        Register early callback to perform user-level cleanup,
 *        e.g. transaction abort, before we begin shutting down
 *        low-level subsystems.
 * ----------------------------------------------------------------
 */

/* ----------------------------------------------------------------
 *        on_shmem_exit
 *
 *        Register ordinary callback to perform low-level shutdown
 *        (e.g. releasing our PGPROC); run after before_shmem_exit
 *        callbacks and before on_proc_exit callbacks.
 * ----------------------------------------------------------------
 */

but I don't find it super obvious when a subsystem is "low-level" or
"user-level". Which is not helped by a fairly inconsistent use of the
mechanism. E.g. ParallelWorkerShutdown() is on_*, ShutdownPostgres,
ShutdownAuxiliaryProcess are before_* and I can't really make sense of that.

Nor do I really understand why dsm_backend_shutdown() is scheduled to be
exactly between those phases? That seems to indicate that no "low-level"
subsystem could ever use DSM, but I don't really understand that either. I
realize it has to be reliably be re-invoked in case of errors as the comment
in shmem_exit() explains - but does not really explain why that's the right
moment to shut down DSM.

I suspect it's mostly pragmatic, because it has to happen before we tear down
"really fundamental" subsystems, but there's no easy way to call
dsm_backend_shutdown() repeatedly at that point?


The concrete reason I am asking is that for the shared memory stats patch I
needed to adjust a few on/before determinations. The fundamental reason for
that is that the shared memory stats patch uses DSM (since stats are variably
sized), which means that anything that could report stats needs to happen in
before_shmem_exit().

To make things work in a halfway reasonable fashion I needed to adjust:

1) Single user mode using on_shmem_exit(ShutdownXLOG()) to before*. From the
   stats side that makes sense - clearly checkpoints can produce stat worthy
   stuff, so it has to happen before the stats subsystem is inactive. I don't
   know if it's "low-level" in the sense referenced in the comments above
   though.

2) Change ParallelWorkerShutdown to before*. That seems consistent with other
   *Shutdown* routines, so ...

3) Have ParallelWorkerShutdown() also trigger detaching from its dsm
   segment. Some on_dsm_detach() callbacks trigger stats (e.g. sharedfileset.c
   causes fd.c to do ReportTemporaryFileUsage()), which still need to be
   reported.

   Unfortunately ordering on_dsm_detach() callbacks is *not* sufficient,
   because dsa.c needs to be able to create new dsm segments if the amount of
   stats grow - and those will be put at the head of dsm_segment_list, which
   in turn will lead to dsm_backend_shutdown() to detach from that newly added
   segment last. Oops.

Pragmatically I can live with all of these changes and I don't think they
should be blockers for the shmem stats patch.


However, this situation feels deeply unsatisfying. We have to be very careful
about nesting initialization of subsystems correctly (see e.g. comments for
ShutdownPostgres's registration), there's no way to order dsm callbacks in a
proper order, on_shmem_exit is used for things ranging from
pgss_shmem_shutdown over ProcKill to IpcMemoryDelete, ...

IOW, the reason that I am (and I think others) are struggling with
before/on_shmem_exit is that the division is not really good enough.

I'm not even convinced that dynamic registration of such callbacks is a good
idea. At least for the important in core stuff it might be better to have a
centrally maintained ordering of callbacks, plus a bit of state describing how
far teardown has progressed (to deal with failing callback issues).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Background writer and checkpointer in crash recovery
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: archive status ".ready" files may be created too early