Thread: Re: Reorder shutdown sequence, to flush pgstats later

Re: Reorder shutdown sequence, to flush pgstats later

From
Heikki Linnakangas
Date:
On 08/01/2025 04:10, Andres Freund wrote:
> I instead opted to somewhat rearrange the shutdown sequence:
> 
>      This commit changes the shutdown sequence so checkpointer is signalled to
>      trigger writing the shutdown checkpoint without terminating it. Once
>      checkpointer wrote the checkpoint it will wait for a termination
>      signal.
> 
>      Postmaster now triggers the shutdown checkpoint where we previously did so by
>      terminating checkpointer. Checkpointer is terminated after all other children
>      have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.
> 
>      In addition, the existing PM_SHUTDOWN_2 state is renamed to
>      PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.

Sounds good.

> This patch is not fully polished, there are a few details I'm not sure about:
> 
> - Previously the shutdown checkpoint and process exit were triggered from
>    HandleCheckpointerInterrupts().  I could have continued doing so in the
>    patch, but it seemed quite weird to have a wait loop in a function called
>    HandleCheckpointerInterrupts().
> 
>    Instead I made the main loop in CheckpointerMain() terminate if checkpointer
>    was asked to write the shutdown checkpoint or exit
> 
> - In process_pm_pmsignal() the code now triggers a PostmasterStateMachine() if
>    checkpointer signalled ShutdownXLOG having completed. We didn't really have
>    a need for that before. process_pm_child_exit() does call
>    PostmasterStateMachine(), but somehow calling it in process_pm_pmsignal()
>    feels slightly off

Looks good to me.

> - I don't love the naming of the various PMState values (existing and new),
>    but a larger renaming should probably be done separately?

We currently have PM_WAIT_BACKENDS and PM_WAIT_DEAD_END. Maybe follow 
that example and name all the shutdown states after what you're waiting for:

     PM_WAIT_BACKENDS, /* waiting for live backends to exit */
     PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown 
ckpt */
     PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
                               finish */
     PM_WAIT_DEAD_END,      /* waiting for dead-end children to exit */
     PM_WAIT_CHECKPOINTER,  /* waiting for checkpointer to shut down */
     PM_NO_CHILDREN,        /* all important children have exited */

> - There's logging in ShutdownXLOG() that's only triggered if called from
>    checkpointer. Seems kinda odd - shouldn't we just move that to
>    checkpointer.c?

What logging do you mean? The "LOG: shutting down" message? No 
objections to moving it, although it doesn't bother me either way.

> The first two patches are just logging improvements that I found helpful to
> write this patch:
> 
> 0001: Instead of modifying pmState directly, do so via a new function
>        UpdatePMState(), which logs the state transition at DEBUG2.
> 
>        Requires a helper to stringify PMState.
> 
>        I've written one-off versions of this patch quite a few times. Without
>        knowing in what state postmaster is in, it's quite hard to debug the
>        state machine.

+1

>     elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
>          pmState, pmstate_name(pmState), newState, pmstate_name(newState));

I think the state names are enough, no need to print the numerical values.

> 0002: Make logging of postmaster signalling child processes more consistent
> 
>        I found it somewhat hard to understand what's happening during state
>        changes without being able to see the signals being sent. While we did
>        have logging in SignalChildren(), we didn't in signal_child(), and most
>        signals that are important for the shutdown sequence are sent via
>        signal_child().

+1

I don't think the cases where DEBUG2 or DEBUG4 are used currently are 
very well thought out. To save a few lines of code, I'd be happy with 
merging signal_child_ext() and signal_child() and just always use DEBUG2 
or DEBUG4. Or DEBUG3 as a compromise :-).

> The next is a small prerequisite:
> 
> 0003: postmaster: Introduce variadic btmask_all_except()
> 
>        I proposed this in another thread, where I also needed
>        btmask_all_except3() but then removed the only call to
>        btmask_all_except2(). Instead of adding/removing code, it seems better
>        to just use a variadic function.

Nice. A variadic btmask_add() might be handy too.

> And then 0004, the reason for this thread.

Overall, looks good to me.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Reorder shutdown sequence, to flush pgstats later

From
Bertrand Drouvot
Date:
Hi,

On Wed, Jan 08, 2025 at 03:10:03PM +0200, Heikki Linnakangas wrote:
> On 08/01/2025 04:10, Andres Freund wrote:
> 
> >     elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
> >          pmState, pmstate_name(pmState), newState, pmstate_name(newState));
> 
> I think the state names are enough, no need to print the numerical values.

+1

> > 0002: Make logging of postmaster signalling child processes more consistent
> > 
> >        I found it somewhat hard to understand what's happening during state
> >        changes without being able to see the signals being sent. While we did
> >        have logging in SignalChildren(), we didn't in signal_child(), and most
> >        signals that are important for the shutdown sequence are sent via
> >        signal_child().
> 
> +1

Random comments:

=== 1

+static const char *
+pm_signame(int signal)
+{
+#define PM_CASE(state) case state: return #state
+       switch (signal)

What about?
"
#define PM_SIGNAL_CASE(sig) case sig: return #sig
"

to better reflect its context.

=== 2

+               default:
+                       /* all signals sent by postmaster should be listed here */
+                       Assert(false);
+                       return "(unknown)";
+       }
+#undef PM_CASE
+       pg_unreachable();

Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
like dead code).

> I don't think the cases where DEBUG2 or DEBUG4 are used currently are very
> well thought out. To save a few lines of code, I'd be happy with merging
> signal_child_ext() and signal_child()

+1

> > The next is a small prerequisite:
> > 
> > 0003: postmaster: Introduce variadic btmask_all_except()
> > 
> >        I proposed this in another thread, where I also needed
> >        btmask_all_except3() but then removed the only call to
> >        btmask_all_except2(). Instead of adding/removing code, it seems better
> >        to just use a variadic function.

LGTM.

> Nice. A variadic btmask_add() might be handy too.
> 
> > And then 0004, the reason for this thread.

Did not give it a detailled review, but IIUC it now provides this flow:

PM_SHUTDOWN->PM_XLOG_IS_SHUTDOWN->PM_WAIT_DEAD_END->PM_SHUTDOWN_CHECKPOINTER->PM_NO_CHILDREN

and that looks good to me to fix the issue.

A few random comments:

=== 3

+ postmaster.c will only
+ * signal checkpointer to exit after all processes that could emit stats
+ * have been shut down.

s/postmaster.c/PostmasterStateMachine()/?

=== 4

+ * too. That allows checkpointer to perform some last bits of of

Typo "of of"

I'll give 0004 a closer look once it leaves the WIP state but want to share that
it looks like a good way to fix the issue.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Reorder shutdown sequence, to flush pgstats later

From
Andres Freund
Date:
Hi,

On 2025-01-08 14:48:21 +0000, Bertrand Drouvot wrote:
> On Wed, Jan 08, 2025 at 03:10:03PM +0200, Heikki Linnakangas wrote:
> > On 08/01/2025 04:10, Andres Freund wrote:
> > > 0002: Make logging of postmaster signalling child processes more consistent
> > >
> > >        I found it somewhat hard to understand what's happening during state
> > >        changes without being able to see the signals being sent. While we did
> > >        have logging in SignalChildren(), we didn't in signal_child(), and most
> > >        signals that are important for the shutdown sequence are sent via
> > >        signal_child().
> >
> > +1
>
> Random comments:
>
> === 1
>
> +static const char *
> +pm_signame(int signal)
> +{
> +#define PM_CASE(state) case state: return #state
> +       switch (signal)
>
> What about?
> "
> #define PM_SIGNAL_CASE(sig) case sig: return #sig
> "
> to better reflect its context.

I went for PM_TOSTR_CASE - that way the same name can be used in
pmstate_name().


> === 2
>
> +               default:
> +                       /* all signals sent by postmaster should be listed here */
> +                       Assert(false);
> +                       return "(unknown)";
> +       }
> +#undef PM_CASE
> +       pg_unreachable();
>
> Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
> like dead code).

I don't think so - we're not listing all signals here, just ones that
postmaster is currently using. They're also typically not defined in an enum
allowing the compiler to warn about uncovered values. It seemed better to
actually return something in a production build rather than aborting.

Some compilers tend to be pretty annoying about missing return values, that's
why I added the pg_unreachable(). Perhaps I should have done a
return "" /* silence compilers */;

or such.



> > Nice. A variadic btmask_add() might be handy too.
> >
> > > And then 0004, the reason for this thread.
>
> Did not give it a detailled review, but IIUC it now provides this flow:
>
> PM_SHUTDOWN->PM_XLOG_IS_SHUTDOWN->PM_WAIT_DEAD_END->PM_SHUTDOWN_CHECKPOINTER->PM_NO_CHILDREN
>
> and that looks good to me to fix the issue.
>
> A few random comments:
>
> === 3
>
> + postmaster.c will only
> + * signal checkpointer to exit after all processes that could emit stats
> + * have been shut down.
>
> s/postmaster.c/PostmasterStateMachine()/?

I just went for 'postmaster' (without the .c) - I don't think it's really
useful to specifically reference s/postmaster.c/PostmasterStateMachine() in
checkpointer.c. But I could be swayed if you feel strongly.

> === 4
>
> + * too. That allows checkpointer to perform some last bits of of
>
> Typo "of of"

Fixed.


> I'll give 0004 a closer look once it leaves the WIP state but want to share that
> it looks like a good way to fix the issue.

Cool.

Thanks for the review,

Andres Freund



Re: Reorder shutdown sequence, to flush pgstats later

From
Bertrand Drouvot
Date:
Hi,

On Wed, Jan 08, 2025 at 02:32:24PM -0500, Andres Freund wrote:
> On 2025-01-08 14:48:21 +0000, Bertrand Drouvot wrote:
> > === 2
> >
> > +               default:
> > +                       /* all signals sent by postmaster should be listed here */
> > +                       Assert(false);
> > +                       return "(unknown)";
> > +       }
> > +#undef PM_CASE
> > +       pg_unreachable();
> >
> > Shouldn't we remove the "return "(unknown)"? (if not the pg_unreachable() looks
> > like dead code).
> 
> I don't think so - we're not listing all signals here, just ones that
> postmaster is currently using. They're also typically not defined in an enum
> allowing the compiler to warn about uncovered values.

Oh right, the parameter is an "int" not an "enum" (and anyway, as you said, we're
not listing all the signals) (did not pay attention to that). So we may need to
"help" some compilers regarding missing return values. 

> It seemed better to
> actually return something in a production build rather than aborting.

Yeah, fully agree.

> Perhaps I should have done a
> return "" /* silence compilers */;

Using pg_unreachable() is fine but a comment (like the return value you're 
suggesting above) could help.

Thanks for the explanation!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Reorder shutdown sequence, to flush pgstats later

From
Bertrand Drouvot
Date:
Hi,

On Wed, Jan 08, 2025 at 02:26:15PM -0500, Andres Freund wrote:
> I'm currently to plan the four patches relatively soon, unless somebody speaks
> up, of course. They seem fairly uncontroversial.  The renaming of the phases
> doesn't need to wait much longer, I think.

Thanks for the patches.

A few comments:

0001 LGTM.

0002,

=== 1

+static const char *
+pm_signame(int signal)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+   switch (signal)

s/state/signal/? (seems better in the pm_signame() context)

0003 and 0004 LGTM.

0005,

=== 2

+ PM_WAIT_XLOG_ARCHIVAL,          /* waiting for archiver and walsenders to

> I don't love PM_WAIT_XLOG_ARCHIVAL, but I can't think of anything better.

PM_WAIT_ARCHIVER_WALSENDERS maybe? (that would follow the pattern of naming
the processes like PM_WAIT_BACKENDS, PM_WAIT_CHECKPOINTER for example).

That said, I'm not 100% convinced it makes it more clear though...

> The last two (0006: trigger checkpoints via SetLatch, 0007: change the
> shutdown sequence), probably can use a few more eyes.

0006,

=== 3

+        * when it does start, with or without getting a signal.

s/getting a signal/getting a latch set/ or "getting notified"?

=== 4

+  if (checkpointerProc == INVALID_PROC_NUMBER)
   {
      if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
      {
          elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-               "could not signal for checkpoint: checkpointer is not running");
+                "could not notify checkpoint: checkpointer is not running");

Worth renaming MAX_SIGNAL_TRIES with MAX_NOTIFY_TRIES then?

0007,

=== 5

+       pqsignal(SIGINT, ReqShutdownXLOG);

Worth a comment like:

pqsignal(SIGINT, ReqShutdownXLOG);  /* trigger shutdown checkpoint */


=== 6

+ * Wait until we're asked to shut down. By seperating the writing of the

Typo: s/seperating/separating/

I don't really anything else in 0006 and 0007 but as you said it's probably
worth a few more eyes as the code change is pretty substantial.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Reorder shutdown sequence, to flush pgstats later

From
Nazir Bilal Yavuz
Date:
Hi,

Thanks for working on this!

On Wed, 8 Jan 2025 at 22:26, Andres Freund <andres@anarazel.de> wrote:
>
> I'm currently to plan the four patches relatively soon, unless somebody speaks
> up, of course. They seem fairly uncontroversial.  The renaming of the phases
> doesn't need to wait much longer, I think.
>
> The last two (0006: trigger checkpoints via SetLatch, 0007: change the
> shutdown sequence), probably can use a few more eyes.

Some minor comments:

=== 0001

LGTM.

=== 0002

+    }
+#undef PM_TOSTR_CASE
+    pg_unreachable();
+}

Maybe a blank line after '#undef PM_TOSTR_CASE' (or no blank line at
the end of the pmstate_name() at 0001)?

+    ereport(DEBUG3,
+            (errmsg_internal("sending signal %d/%s to %s process %d",

I am not sure if that makes sense but with the addition of the backend
type here, I think 'sending signal %d/%s to %s process with the pid of
%d' would be clearer.

=== 0003

LGTM.

=== 0004

LGTM.

=== 0005

> I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
> but I can't think of anything better.

I liked this, I think it is good enough.

-    PM_SHUTDOWN,                /* waiting for checkpointer to do shutdown
+    PM_WAIT_XLOG_SHUTDOWN,        /* waiting for checkpointer to do shutdown
                                  * ckpt */

There are couple of variables and functions which include pm_shutdown
in their names:
    pending_pm_shutdown_request
    handle_pm_shutdown_request_signal()
    process_pm_shutdown_request()

Do you think these need to be updated as well?

=== 0006

Please see below.

=== 0007

-                 * told to shut down.  We expect that it wrote a shutdown
-                 * checkpoint.  (If for some reason it didn't, recovery will
-                 * occur on next postmaster start.)
+                 * told to shut down.  We know it wrote a shutdown checkpoint,
+                 * otherwise we'd still be in PM_SHUTDOWN.

s/PM_SHUTDOWN/PM_WAIT_XLOG_SHUTDOWN/

I have these comments for now. I need to study 0006 and 0007 more.

--
Regards,
Nazir Bilal Yavuz
Microsoft