Thread: Non-robustness in pmsignal.c

Non-robustness in pmsignal.c

From
Tom Lane
Date:
As I mentioned in another thread, I came across a reproducible
situation in which a memory clobber in a child backend crashes
the postmaster too, at least on FreeBSD/arm64.  Needless to say,
this is Not Cool.  I've now traced down what is happening,
and it's this:

1. Careless coding in aset.c causes it to decide to wipe_mem
the universe.  (I'll have more to say about that separately;
the point of this thread is keeping the postmaster alive
afterwards.)  Apparently, there's not any non-live memory
space between process-local memory and shared memory on this
platform, so the failing backend manages to trash shared memory
too before it finally hits SIGSEGV.

2. Most of the background processes die on something like

TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 686, PID: 5916)

or they encounter what seems to be a stuck spinlock.  The postmaster,
however, SIGSEGVs.  It's not supposed to do that; it is supposed to
be sufficiently arms-length from shared memory that it can recover
despite a backend trashing shared memory contents.

3. The cause of the SIGSEGV is that AssignPostmasterChildSlot
naively believes that it can trust PMSignalState->next_child_flag
to be a valid array index, so after that's been clobbered with
something like 0x7f7f7f7f we index off the end of memory.
I see no good reason for that state variable to be in shared memory
at all, so the attached patch just moves it to postmaster static
data.  We also need a less-exposed copy of the array size variable.

4. That's enough to stop the SIGSEGV crash, but the postmaster
still fails to recover, because then it hits

    elog(FATAL, "no free slots in PMChildFlags array");

since all of the array entries have been clobbered as well.
In the attached patch, I fixed this by treating the case similarly
to failure to fork a new child process.  This seems to be enough
to let the postmaster survive, and recover after it starts noticing
crashing children.

5. It's possible that we should take some proactive steps to get out
of the "no free slots" situation, rather than just wait for some
child to crash.  I'm inclined not to, however.  It'd be hard-to-test
corner-case code, and given the lack of field reports like this,
the situation must be awfully rare.

Thoughts?

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 383bc4776e..4a50b24b57 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4173,7 +4173,17 @@ BackendStartup(Port *port)
      * Unless it's a dead_end child, assign it a child slot number
      */
     if (!bn->dead_end)
+    {
         bn->child_slot = MyPMChildSlot = AssignPostmasterChildSlot();
+        if (bn->child_slot == 0)
+        {
+            /* No child slots ... treat like fork failure */
+            free(bn);
+            /* error is already logged, but send something to the client */
+            report_fork_failure_to_client(port, ENOMEM);
+            return STATUS_ERROR;
+        }
+    }
     else
         bn->child_slot = 0;

@@ -5509,23 +5519,27 @@ StartAutovacuumWorker(void)
             bn->child_slot = MyPMChildSlot = AssignPostmasterChildSlot();
             bn->bgworker_notify = false;

-            bn->pid = StartAutoVacWorker();
-            if (bn->pid > 0)
+            /* if we fail to get a child slot, treat it like fork failure */
+            if (bn->child_slot > 0)
             {
-                bn->bkend_type = BACKEND_TYPE_AUTOVAC;
-                dlist_push_head(&BackendList, &bn->elem);
+                bn->pid = StartAutoVacWorker();
+                if (bn->pid > 0)
+                {
+                    bn->bkend_type = BACKEND_TYPE_AUTOVAC;
+                    dlist_push_head(&BackendList, &bn->elem);
 #ifdef EXEC_BACKEND
-                ShmemBackendArrayAdd(bn);
+                    ShmemBackendArrayAdd(bn);
 #endif
-                /* all OK */
-                return;
-            }
+                    /* all OK */
+                    return;
+                }

-            /*
-             * fork failed, fall through to report -- actual error message was
-             * logged by StartAutoVacWorker
-             */
-            (void) ReleasePostmasterChildSlot(bn->child_slot);
+                /*
+                 * fork failed, fall through to report -- actual error message
+                 * was logged by StartAutoVacWorker
+                 */
+                (void) ReleasePostmasterChildSlot(bn->child_slot);
+            }
             free(bn);
         }
         else
@@ -5908,6 +5922,12 @@ assign_backendlist_entry(RegisteredBgWorker *rw)

     bn->cancel_key = MyCancelKey;
     bn->child_slot = MyPMChildSlot = AssignPostmasterChildSlot();
+    if (bn->child_slot == 0)
+    {
+        /* couldn't get a slot; failure is already logged */
+        free(bn);
+        return false;
+    }
     bn->bkend_type = BACKEND_TYPE_BGWORKER;
     bn->dead_end = false;
     bn->bgworker_notify = false;
diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
index 3f0ec5e6b8..51776779f4 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -75,12 +75,21 @@ struct PMSignalData
     QuitSignalReason sigquit_reason;    /* why SIGQUIT was sent */
     /* per-child-process flags */
     int            num_child_flags;    /* # of entries in PMChildFlags[] */
-    int            next_child_flag;    /* next slot to try to assign */
     sig_atomic_t PMChildFlags[FLEXIBLE_ARRAY_MEMBER];
 };

+/* PMSignalState pointer is valid in both postmaster and child processes */
 NON_EXEC_STATIC volatile PMSignalData *PMSignalState = NULL;

+/*
+ * These static variables are valid only in the postmaster.  The postmaster
+ * should always use num_child_flags, not the copy in shared memory, so
+ * that it will not go insane if some failing child clobbers the copy
+ * in shared memory.
+ */
+static int    num_child_flags;    /* # of entries in PMChildFlags[] */
+static int    next_child_flag;    /* next slot to try to assign */
+
 /*
  * Signal handler to be notified if postmaster dies.
  */
@@ -142,7 +151,8 @@ PMSignalShmemInit(void)
     {
         /* initialize all flags to zeroes */
         MemSet(unvolatize(PMSignalData *, PMSignalState), 0, PMSignalShmemSize());
-        PMSignalState->num_child_flags = MaxLivePostmasterChildren();
+        num_child_flags = MaxLivePostmasterChildren();
+        PMSignalState->num_child_flags = num_child_flags;
     }
 }

@@ -210,7 +220,7 @@ GetQuitSignalReason(void)
 /*
  * AssignPostmasterChildSlot - select an unused slot for a new postmaster
  * child process, and set its state to ASSIGNED.  Returns a slot number
- * (one to N).
+ * (one to N).  On failure, log a message and return zero.
  *
  * Only the postmaster is allowed to execute this routine, so we need no
  * special locking.
@@ -218,28 +228,31 @@ GetQuitSignalReason(void)
 int
 AssignPostmasterChildSlot(void)
 {
-    int            slot = PMSignalState->next_child_flag;
+    int            slot = next_child_flag;
     int            n;

     /*
      * Scan for a free slot.  We track the last slot assigned so as not to
      * waste time repeatedly rescanning low-numbered slots.
      */
-    for (n = PMSignalState->num_child_flags; n > 0; n--)
+    for (n = num_child_flags; n > 0; n--)
     {
         if (--slot < 0)
-            slot = PMSignalState->num_child_flags - 1;
+            slot = num_child_flags - 1;
         if (PMSignalState->PMChildFlags[slot] == PM_CHILD_UNUSED)
         {
             PMSignalState->PMChildFlags[slot] = PM_CHILD_ASSIGNED;
-            PMSignalState->next_child_flag = slot;
+            next_child_flag = slot;
             return slot + 1;
         }
     }

-    /* Out of slots ... should never happen, else postmaster.c messed up */
-    elog(FATAL, "no free slots in PMChildFlags array");
-    return 0;                    /* keep compiler quiet */
+    /*
+     * Out of slots.  Should never happen if postmaster.c counts children
+     * accurately, but conceivably a failing child has clobbered the array.
+     */
+    elog(LOG, "no free slots in PMChildFlags array");
+    return 0;
 }

 /*
@@ -254,7 +267,7 @@ ReleasePostmasterChildSlot(int slot)
 {
     bool        result;

-    Assert(slot > 0 && slot <= PMSignalState->num_child_flags);
+    Assert(slot > 0 && slot <= num_child_flags);
     slot--;

     /*
@@ -269,12 +282,12 @@ ReleasePostmasterChildSlot(int slot)

 /*
  * IsPostmasterChildWalSender - check if given slot is in use by a
- * walsender process.
+ * walsender process.  This is called only by the postmaster.
  */
 bool
 IsPostmasterChildWalSender(int slot)
 {
-    Assert(slot > 0 && slot <= PMSignalState->num_child_flags);
+    Assert(slot > 0 && slot <= num_child_flags);
     slot--;

     if (PMSignalState->PMChildFlags[slot] == PM_CHILD_WALSENDER)

Re: Non-robustness in pmsignal.c

From
Andres Freund
Date:
Hi,

On 2022-10-07 19:57:35 -0400, Tom Lane wrote:
> As I mentioned in another thread, I came across a reproducible
> situation in which a memory clobber in a child backend crashes
> the postmaster too, at least on FreeBSD/arm64.  Needless to say,
> this is Not Cool.

Ugh.


> I've now traced down what is happening, and it's this:
> 
> 1. Careless coding in aset.c causes it to decide to wipe_mem
> the universe.  (I'll have more to say about that separately;
> the point of this thread is keeping the postmaster alive
> afterwards.)  Apparently, there's not any non-live memory
> space between process-local memory and shared memory on this
> platform, so the failing backend manages to trash shared memory
> too before it finally hits SIGSEGV.

Perhaps it'd be worth mark a page or two inaccessible, via
mprotect(PROT_NONE), at the start and end of shared memory. I've wondered
about a debugging mode where we do that after separate shared memory
allocations even. But start/end would be something we could conceivably always
enable.


> 2. Most of the background processes die on something like
> 
> TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 686, PID: 5916)
> 
> or they encounter what seems to be a stuck spinlock.  The postmaster,
> however, SIGSEGVs.  It's not supposed to do that; it is supposed to
> be sufficiently arms-length from shared memory that it can recover
> despite a backend trashing shared memory contents.

> 3. The cause of the SIGSEGV is that AssignPostmasterChildSlot
> naively believes that it can trust PMSignalState->next_child_flag
> to be a valid array index, so after that's been clobbered with
> something like 0x7f7f7f7f we index off the end of memory.
> I see no good reason for that state variable to be in shared memory
> at all, so the attached patch just moves it to postmaster static
> data.  We also need a less-exposed copy of the array size variable.

Those make sense to me.


> 4. That's enough to stop the SIGSEGV crash, but the postmaster
> still fails to recover, because then it hits
> 
>     elog(FATAL, "no free slots in PMChildFlags array");
> 
> since all of the array entries have been clobbered as well.
> In the attached patch, I fixed this by treating the case similarly
> to failure to fork a new child process.  This seems to be enough
> to let the postmaster survive, and recover after it starts noticing
> crashing children.

Why are we even tracking PM_CHILD_UNUSED / PM_CHILD_ASSIGNED in shared memory?
ISTM those should live in postmaster local memory (maybe copied to shared
memory). PM_CHILD_ACTIVE and PM_CHILD_WALSENDER do have to live in shared
memory, but ...

Your fix seems ok. We really ought to deduplicate the way we start postmaster
children, but that's obviously work for another day.


> 5. It's possible that we should take some proactive steps to get out
> of the "no free slots" situation, rather than just wait for some
> child to crash.  I'm inclined not to, however.  It'd be hard-to-test
> corner-case code, and given the lack of field reports like this,
> the situation must be awfully rare.

Agreed.


Are you thinking these should be backpatched?

Greetings,

Andres Freund



Re: Non-robustness in pmsignal.c

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Why are we even tracking PM_CHILD_UNUSED / PM_CHILD_ASSIGNED in shared memory?

Because those flags are set by the child processes too, cf
MarkPostmasterChildActive and MarkPostmasterChildInactive.

> Are you thinking these should be backpatched?

I am, but I'm not inclined to push this immediately before a wrap.
If we intend to wrap 15.0 on Monday then I'll wait till after that.
OTOH, if we slip that a week, I'd be okay with pushing in the
next day or two.

            regards, tom lane



Re: Non-robustness in pmsignal.c

From
Andres Freund
Date:
Hi,

On 2022-10-07 20:35:58 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Why are we even tracking PM_CHILD_UNUSED / PM_CHILD_ASSIGNED in shared memory?
> 
> Because those flags are set by the child processes too, cf
> MarkPostmasterChildActive and MarkPostmasterChildInactive.

Only PM_CHILD_ACTIVE and PM_CHILD_WALSENDER though. We could afford another
MaxLivePostmasterChildren() sized array...


> > Are you thinking these should be backpatched?
> 
> I am, but I'm not inclined to push this immediately before a wrap.

+1


> If we intend to wrap 15.0 on Monday then I'll wait till after that.
> OTOH, if we slip that a week, I'd be okay with pushing in the
> next day or two.

Makes sense.

- Andres



Re: Non-robustness in pmsignal.c

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-10-07 20:35:58 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> Why are we even tracking PM_CHILD_UNUSED / PM_CHILD_ASSIGNED in shared memory?

>> Because those flags are set by the child processes too, cf
>> MarkPostmasterChildActive and MarkPostmasterChildInactive.

> Only PM_CHILD_ACTIVE and PM_CHILD_WALSENDER though. We could afford another
> MaxLivePostmasterChildren() sized array...

Oh, I see what you mean --- one private and one public array.
Maybe that makes more sense than what I did, not sure.

>> I am, but I'm not inclined to push this immediately before a wrap.

> +1

OK, I'll take a little more time on this and maybe code it up as
you suggest.

            regards, tom lane



Re: Non-robustness in pmsignal.c

From
Tom Lane
Date:
I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> Only PM_CHILD_ACTIVE and PM_CHILD_WALSENDER though. We could afford another
>> MaxLivePostmasterChildren() sized array...

> Oh, I see what you mean --- one private and one public array.
> Maybe that makes more sense than what I did, not sure.

Yeah, that's definitely a better way.  I'll push this after the
release freeze lifts.

            regards, tom lane

diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
index 3f0ec5e6b8..c85521d364 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -26,6 +26,7 @@
 #include "replication/walsender.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/memutils.h"


 /*
@@ -75,12 +76,21 @@ struct PMSignalData
     QuitSignalReason sigquit_reason;    /* why SIGQUIT was sent */
     /* per-child-process flags */
     int            num_child_flags;    /* # of entries in PMChildFlags[] */
-    int            next_child_flag;    /* next slot to try to assign */
     sig_atomic_t PMChildFlags[FLEXIBLE_ARRAY_MEMBER];
 };

+/* PMSignalState pointer is valid in both postmaster and child processes */
 NON_EXEC_STATIC volatile PMSignalData *PMSignalState = NULL;

+/*
+ * These static variables are valid only in the postmaster.  We keep a
+ * duplicative private array so that we can trust its state even if some
+ * failing child has clobbered the PMSignalData struct in shared memory.
+ */
+static int    num_child_inuse;    /* # of entries in PMChildInUse[] */
+static int    next_child_inuse;    /* next slot to try to assign */
+static bool *PMChildInUse;        /* true if i'th flag slot is assigned */
+
 /*
  * Signal handler to be notified if postmaster dies.
  */
@@ -142,7 +152,25 @@ PMSignalShmemInit(void)
     {
         /* initialize all flags to zeroes */
         MemSet(unvolatize(PMSignalData *, PMSignalState), 0, PMSignalShmemSize());
-        PMSignalState->num_child_flags = MaxLivePostmasterChildren();
+        num_child_inuse = MaxLivePostmasterChildren();
+        PMSignalState->num_child_flags = num_child_inuse;
+
+        /*
+         * Also allocate postmaster's private PMChildInUse[] array.  We
+         * might've already done that in a previous shared-memory creation
+         * cycle, in which case free the old array to avoid a leak.  (Do it
+         * like this to support the possibility that MaxLivePostmasterChildren
+         * changed.)  In a standalone backend, we do not need this.
+         */
+        if (PostmasterContext != NULL)
+        {
+            if (PMChildInUse)
+                pfree(PMChildInUse);
+            PMChildInUse = (bool *)
+                MemoryContextAllocZero(PostmasterContext,
+                                       num_child_inuse * sizeof(bool));
+        }
+        next_child_inuse = 0;
     }
 }

@@ -218,21 +246,24 @@ GetQuitSignalReason(void)
 int
 AssignPostmasterChildSlot(void)
 {
-    int            slot = PMSignalState->next_child_flag;
+    int            slot = next_child_inuse;
     int            n;

     /*
-     * Scan for a free slot.  We track the last slot assigned so as not to
-     * waste time repeatedly rescanning low-numbered slots.
+     * Scan for a free slot.  Notice that we trust nothing about the contents
+     * of PMSignalState, but use only postmaster-local data for this decision.
+     * We track the last slot assigned so as not to waste time repeatedly
+     * rescanning low-numbered slots.
      */
-    for (n = PMSignalState->num_child_flags; n > 0; n--)
+    for (n = num_child_inuse; n > 0; n--)
     {
         if (--slot < 0)
-            slot = PMSignalState->num_child_flags - 1;
-        if (PMSignalState->PMChildFlags[slot] == PM_CHILD_UNUSED)
+            slot = num_child_inuse - 1;
+        if (!PMChildInUse[slot])
         {
+            PMChildInUse[slot] = true;
             PMSignalState->PMChildFlags[slot] = PM_CHILD_ASSIGNED;
-            PMSignalState->next_child_flag = slot;
+            next_child_inuse = slot;
             return slot + 1;
         }
     }
@@ -254,7 +285,7 @@ ReleasePostmasterChildSlot(int slot)
 {
     bool        result;

-    Assert(slot > 0 && slot <= PMSignalState->num_child_flags);
+    Assert(slot > 0 && slot <= num_child_inuse);
     slot--;

     /*
@@ -264,17 +295,18 @@ ReleasePostmasterChildSlot(int slot)
      */
     result = (PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED);
     PMSignalState->PMChildFlags[slot] = PM_CHILD_UNUSED;
+    PMChildInUse[slot] = false;
     return result;
 }

 /*
  * IsPostmasterChildWalSender - check if given slot is in use by a
- * walsender process.
+ * walsender process.  This is called only by the postmaster.
  */
 bool
 IsPostmasterChildWalSender(int slot)
 {
-    Assert(slot > 0 && slot <= PMSignalState->num_child_flags);
+    Assert(slot > 0 && slot <= num_child_inuse);
     slot--;

     if (PMSignalState->PMChildFlags[slot] == PM_CHILD_WALSENDER)

Re: Non-robustness in pmsignal.c

From
Andres Freund
Date:
Hi,

On 2022-10-08 13:15:07 -0400, Tom Lane wrote:
> I wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> Only PM_CHILD_ACTIVE and PM_CHILD_WALSENDER though. We could afford another
> >> MaxLivePostmasterChildren() sized array...
> 
> > Oh, I see what you mean --- one private and one public array.
> > Maybe that makes more sense than what I did, not sure.
> 
> Yeah, that's definitely a better way.  I'll push this after the
> release freeze lifts.

Cool, thanks for exploring.


>  /*
>   * Signal handler to be notified if postmaster dies.
>   */
> @@ -142,7 +152,25 @@ PMSignalShmemInit(void)
>      {
>          /* initialize all flags to zeroes */
>          MemSet(unvolatize(PMSignalData *, PMSignalState), 0, PMSignalShmemSize());
> -        PMSignalState->num_child_flags = MaxLivePostmasterChildren();
> +        num_child_inuse = MaxLivePostmasterChildren();
> +        PMSignalState->num_child_flags = num_child_inuse;
> +
> +        /*
> +         * Also allocate postmaster's private PMChildInUse[] array.  We
> +         * might've already done that in a previous shared-memory creation
> +         * cycle, in which case free the old array to avoid a leak.  (Do it
> +         * like this to support the possibility that MaxLivePostmasterChildren
> +         * changed.)  In a standalone backend, we do not need this.
> +         */
> +        if (PostmasterContext != NULL)
> +        {
> +            if (PMChildInUse)
> +                pfree(PMChildInUse);
> +            PMChildInUse = (bool *)
> +                MemoryContextAllocZero(PostmasterContext,
> +                                       num_child_inuse * sizeof(bool));
> +        }
> +        next_child_inuse = 0;
>      }
>  }

When can PostmasterContext be NULL here, and why can we just continue without
(re-)allocating PMChildInUse?

Greetings,

Andres Freund



Re: Non-robustness in pmsignal.c

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> When can PostmasterContext be NULL here, and why can we just continue without
> (re-)allocating PMChildInUse?

We'd only get into the !found stanza in a postmaster or a
standalone backend.  A standalone backend isn't ever going to call
AssignPostmasterChildSlot or ReleasePostmasterChildSlot, so it
does not need the array; and it also doesn't have a PostmasterContext,
so there's not a good place to allocate the array either.

Perhaps there's a better way to distinguish am-I-a-postmaster,
but I thought checking PostmasterContext is fine since that ties
directly to what the code needs to do.

Yes, the code would malfunction if the PostmasterContext != NULL
condition changed from one cycle to the next, but that shouldn't
happen.

            regards, tom lane