Re: Crash during backend start when low on memory - Mailing list pgsql-bugs

From Alvaro Herrera
Subject Re: Crash during backend start when low on memory
Date
Msg-id 20230113115352.sbuv5egpjyoxn52i@alvherre.pgsql
Whole thread Raw
In response to Re: Crash during backend start when low on memory  (Mats Kindahl <mats@timescale.com>)
Responses Re: Crash during backend start when low on memory  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Crash during backend start when low on memory  (Mats Kindahl <mats@timescale.com>)
Re: Crash during backend start when low on memory  (Mats Kindahl <mats@timescale.com>)
List pgsql-bugs
On 2023-Jan-13, Mats Kindahl wrote:

> I have an improved patch based on raising an error when malloc fails. I
> used the version of mallocfail that is available at
> https://github.com/ralight/mallocfail which allows you to run the program
> with incrementally failing memory allocations but since the count is around
> 6886 or so when the postmaster starts to allow backend starts and other
> processes are starting in between, this count is not always accurate and
> something "smarter" is needed to get a reliable failure test.

Hmm, but I'm not sure I like what you did to BackendStartup:

@@ -4054,14 +4073,7 @@ BackendStartup(Port *port)
      * Create backend data structure.  Better before the fork() so we can
      * handle failure cleanly.
      */
-    bn = (Backend *) malloc(sizeof(Backend));
-    if (!bn)
-    {
-        ereport(LOG,
-                (errcode(ERRCODE_OUT_OF_MEMORY),
-                 errmsg("out of memory")));
-        return STATUS_ERROR;
-    }
+    CHECKED_CALL(bn = malloc(sizeof(Backend)));

With this code, now we will have postmaster raise ERROR if this malloc
fails, rather than returning STATUS_ERROR as previously.  How does
postmaster react to this?  I think it won't go very well.

(Stylistically, I'd rather have the test and corresponding reaction at
each call site rather than using a macro to hide it.  I especially don't
like the fact that the errdetail you propose would report a line of C
code to the user, rather than an intelligible explanation of what that
code is trying to do.)

> It is quite straightforward to test it by attaching a debugger and setting
> the return value of malloc, but to be able to have some "smarter" failures
> only when, for example, starting a backend it would be necessary to have
> something better integrated with the postgres code. Being able to attach a
> uprobe using bpftrace and override the return value would be the most
> elegant solution, but that is currently not possible, so considering
> whether to add something like the mallocfail above to the regular testing.

Hmm, I'm not sure that this type of testing is something to do on an
ongoing basis.

> Note: there are a bunch of other core dumps that occur during startup
> before the postmaster is ready to accept connections, but I think that is
> less of a problem since it fails during startup before the system is ready
> to accept connections.

That sounds reasonable.  The service watcher should restart postmaster
appropriately in that case (be it old SysV init, systemd, your container
management layer, or whatever.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)



pgsql-bugs by date:

Previous
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Logical Replica ReorderBuffer Size Accounting Issues
Next
From: Alex Richman
Date:
Subject: Re: Logical Replica ReorderBuffer Size Accounting Issues