Re: Crash during backend start when low on memory - Mailing list pgsql-bugs

From Mats Kindahl
Subject Re: Crash during backend start when low on memory
Date
Msg-id CA+14424mP36yJ=tUmJ-EBiQqUTc+7g6p_1+gBFymGoFxc-Lsow@mail.gmail.com
Whole thread Raw
In response to Re: Crash during backend start when low on memory  (Mats Kindahl <mats@timescale.com>)
Responses Re: Crash during backend start when low on memory  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-bugs


On Fri, Dec 16, 2022 at 9:46 AM Mats Kindahl <mats@timescale.com> wrote:
On Thu, Dec 15, 2022 at 6:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Dec-14, Mats Kindahl wrote:

> PostmasterContext is set quite early in the procedure so it should not be a
> problem to use it. I think we can use PostmasterContext rather than
> TopMemoryContext. The memory for the allocations are passed down to
> PostmasterMain() and used inside InitPostgres() but as far as I can tell,
> they are not used after the InitPostgres() call. The PostmasterContex is
> destroyed right after the call to InitPostgres() inside PostmasterMain()
> and then the rest is allocated in other contexts.

Well, this is what PostgresMain has to say about it:

        /*
         * If the PostmasterContext is still around, recycle the space; we don't
         * need it anymore after InitPostgres completes.  Note this does not trash
         * *MyProcPort, because ConnCreate() allocated that space with malloc()
         * ... else we'd need to copy the Port data first.  Also, subsidiary data
         * such as the username isn't lost either; see ProcessStartupPacket().
         */
        if (PostmasterContext)
        {
                MemoryContextDelete(PostmasterContext);


Ugh. Ok. Missed that comment.
 
So I think blowing away part of struct Port independently of the
containing struct is a bad idea.  Why isn't more appropriate to do what
ConnCreate does?

        if (!(port = (Port *) calloc(1, sizeof(Port))))
        {
                ereport(LOG,
                                (errcode(ERRCODE_OUT_OF_MEMORY),
                                 errmsg("out of memory")));
                ExitPostmaster(1);
        }

... ugh.  I have to admit that killing postmaster due to a transient
out-of-memory makes me a bit nervous -- do note that this
ExitPostmaster() happens before we've forked the backend, so this code
does bring the server down (!!).  If we want to make postmaster more
robust in the face of OOM, we should also change this code somehow, yes?

Seems reasonable, yes. I'll see what I can throw together.
 

I agree with Mats' premise when starting the thread -- even if the
server is not in great shape, it doesn't seem like taking the whole
service down is a great response.  It would be better to just refuse the
connection with some error and let existing connections chug along,
poorly though they can.  Maybe they'll also die soon, freeing memory;
but that'll be no good if Postmaster is gone altogether.

Now, it would be better if we could test this more thoroughly.  Perhaps
it's possible to use something like the mallocfail.c lib that Heikki
wrote a few years ago, teaching it to handle calloc
https://postgr.es/m/547480DE.4040408@vmware.com

 Thanks. Will take a look at that.

I have an improved patch based on raising an error when malloc fails. I used the version of mallocfail that is available at https://github.com/ralight/mallocfail which allows you to run the program with incrementally failing memory allocations but since the count is around 6886 or so when the postmaster starts to allow backend starts and other processes are starting in between, this count is not always accurate and something "smarter" is needed to get a reliable failure test.

It is quite straightforward to test it by attaching a debugger and setting the return value of malloc, but to be able to have some "smarter" failures only when, for example, starting a backend it would be necessary to have something better integrated with the postgres code. Being able to attach a uprobe using bpftrace and override the return value would be the most elegant solution, but that is currently not possible, so considering whether to add something like the mallocfail above to the regular testing.

Note: there are a bunch of other core dumps that occur during startup before the postmaster is ready to accept connections, but I think that is less of a problem since it fails during startup before the system is ready to accept connections.

Best wishes,
Mats Kindahl, Timescale




--
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)
Attachment

pgsql-bugs by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: BUG #17744: Fail Assert while recoverying from pg_basebackup
Next
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Logical Replica ReorderBuffer Size Accounting Issues