Re: recent failures on lorikeet - Mailing list pgsql-hackers

From Tom Lane
Subject Re: recent failures on lorikeet
Date
Msg-id 241120.1623691123@sss.pgh.pa.us
Whole thread Raw
In response to Re: recent failures on lorikeet  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: recent failures on lorikeet
List pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
> The line in lmgr.c is where the process title gets changed to "waiting".
> I recently stopped setting process title on this animal on REL_13_STABLE
> and its similar errors have largely gone away.

Oooh, that certainly seems like a smoking gun.

> I can do the same on
> HEAD. But it does make me wonder what the heck has changed to make this
> code fragile.

So what we've got there is

        old_status = get_ps_display(&len);
        new_status = (char *) palloc(len + 8 + 1);
        memcpy(new_status, old_status, len);
        strcpy(new_status + len, " waiting");
        set_ps_display(new_status);
        new_status[len] = '\0'; /* truncate off " waiting" */

Line 1831 is the strcpy, but it seems entirely impossible that that
could fail, unless palloc has shirked its job.  I'm thinking that
the crash is really in the memcpy --- looking at the other lines
in your trace, fingering the line after the call seems common.

What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.

A platform-specific problem in get_ps_display() seems plausible
enough.  The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Question about StartLogicalReplication() error path
Next
From: Tom Lane
Date:
Subject: Re: recent failures on lorikeet