Thread: recent failures on lorikeet

recent failures on lorikeet

From

Andrew Dunstan

Date:

14 June 2021, 12:19:19

I've been looking at the recent spate of intermittent failures on my
Cygwin animal lorikeet. Most of them look something like this, where
there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
which fails.


2021-06-14 05:04:00.220 EDT [60c71b7f.e8bf:60] pg_regress/vacuum LOG:  statement: VACUUM FULL pg_class;
2021-06-14 05:04:00.222 EDT [60c71b80.e8c0:7] pg_regress/typed_table LOG:  statement: CREATE TABLE persons OF
person_type;
2021-06-14 05:04:00.232 EDT [60c71b80.e8c1:3] pg_regress/inherit LOG:  statement: CREATE TABLE a (aa TEXT);
*** starting debugger for pid 59584, tid 9640
2021-06-14 05:04:14.678 EDT [60c71b53.e780:4] LOG:  server process (PID 59584) exited with exit code 127
2021-06-14 05:04:14.678 EDT [60c71b53.e780:5] DETAIL:  Failed process was running: CREATE TABLE persons OF
person_type;
2021-06-14 05:04:14.678 EDT [60c71b53.e780:6] LOG:  terminating any other active server processes


Getting stack traces in this platform can be very difficult. I'm going
to try forcing complete serialization of the regression tests
(MAX_CONNECTIONS=1) to see if the problem goes away. Any other
suggestions might be useful. Note that we're not getting the same issue
on REL_13_STABLE, where the same group pf tests run together (inherit
typed_table, vacuum)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: recent failures on lorikeet

From

Tom Lane

Date:

14 June 2021, 13:39:12

Andrew Dunstan <andrew@dunslane.net> writes:
> I've been looking at the recent spate of intermittent failures on my
> Cygwin animal lorikeet. Most of them look something like this, where
> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
> which fails.

Do you have any idea what "exit code 127" signifies on that platform?
(BTW, not all of them look like that; many are reported as plain
segfaults.)  I hadn't spotted the association with a concurrent "VACUUM
FULL pg_class" before, that does seem interesting.

> Getting stack traces in this platform can be very difficult. I'm going
> to try forcing complete serialization of the regression tests
> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other
> suggestions might be useful. Note that we're not getting the same issue
> on REL_13_STABLE, where the same group pf tests run together (inherit
> typed_table, vacuum)

If it does go away, that'd be interesting, but I don't see how it gets
us any closer to a fix.  Seems like a stack trace is a necessity to
narrow it down.

            regards, tom lane

Re: recent failures on lorikeet

From

Andrew Dunstan

Date:

14 June 2021, 16:33:18

On 6/14/21 9:39 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I've been looking at the recent spate of intermittent failures on my
>> Cygwin animal lorikeet. Most of them look something like this, where
>> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
>> which fails.
> Do you have any idea what "exit code 127" signifies on that platform?
> (BTW, not all of them look like that; many are reported as plain
> segfaults.)  I hadn't spotted the association with a concurrent "VACUUM
> FULL pg_class" before, that does seem interesting.
>
>> Getting stack traces in this platform can be very difficult. I'm going
>> to try forcing complete serialization of the regression tests
>> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other
>> suggestions might be useful. Note that we're not getting the same issue
>> on REL_13_STABLE, where the same group pf tests run together (inherit
>> typed_table, vacuum)
> If it does go away, that'd be interesting, but I don't see how it gets
> us any closer to a fix.  Seems like a stack trace is a necessity to
> narrow it down.
>
>             


Some have given stack traces and some not, not sure why. The one from
June 13 has this:


---- backtrace ----
??
??:0
WaitOnLock
src/backend/storage/lmgr/lock.c:1831
LockAcquireExtended
src/backend/storage/lmgr/lock.c:1119
LockRelationOid
src/backend/storage/lmgr/lmgr.c:135
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ScanPgRelation
src/backend/utils/cache/relcache.c:322
RelationBuildDesc
src/backend/utils/cache/relcache.c:1039
RelationIdGetRelation
src/backend/utils/cache/relcache.c:2045
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ExecInitPartitionInfo
src/backend/executor/execPartition.c:510
ExecPrepareTupleRouting
src/backend/executor/nodeModifyTable.c:2311
ExecModifyTable
src/backend/executor/nodeModifyTable.c:2559
ExecutePlan
src/backend/executor/execMain.c:1557



The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: recent failures on lorikeet

From

Andrew Dunstan

Date:

14 June 2021, 16:41:36

On 6/14/21 12:33 PM, Andrew Dunstan wrote:
>
> The line in lmgr.c is where the process title gets changed to "waiting".
> I recently stopped setting process title on this animal on REL_13_STABLE
> and its similar errors have largely gone away. I can do the same on
> HEAD. But it does make me wonder what the heck has changed to make this
> code fragile.


Of course I meant the line (1831) in lock.c.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: recent failures on lorikeet

From

Tom Lane

Date:

14 June 2021, 17:18:43

Andrew Dunstan <andrew@dunslane.net> writes:
> The line in lmgr.c is where the process title gets changed to "waiting".
> I recently stopped setting process title on this animal on REL_13_STABLE
> and its similar errors have largely gone away.

Oooh, that certainly seems like a smoking gun.

> I can do the same on
> HEAD. But it does make me wonder what the heck has changed to make this
> code fragile.

So what we've got there is

        old_status = get_ps_display(&len);
        new_status = (char *) palloc(len + 8 + 1);
        memcpy(new_status, old_status, len);
        strcpy(new_status + len, " waiting");
        set_ps_display(new_status);
        new_status[len] = '\0'; /* truncate off " waiting" */

Line 1831 is the strcpy, but it seems entirely impossible that that
could fail, unless palloc has shirked its job.  I'm thinking that
the crash is really in the memcpy --- looking at the other lines
in your trace, fingering the line after the call seems common.

What that'd have to imply is that get_ps_display() messed up,
returning a bad pointer or a bad length.

A platform-specific problem in get_ps_display() seems plausible
enough.  The apparent connection to a concurrent VACUUM FULL seems
pretty hard to explain that way ... but maybe that's a mirage.

            regards, tom lane

Re: recent failures on lorikeet

From

Tom Lane

Date:

14 June 2021, 17:29:41

I wrote:
> What that'd have to imply is that get_ps_display() messed up,
> returning a bad pointer or a bad length.
> A platform-specific problem in get_ps_display() seems plausible
> enough.  The apparent connection to a concurrent VACUUM FULL seems
> pretty hard to explain that way ... but maybe that's a mirage.

If I understand correctly that you're only seeing this in v13 and
HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API)
deserves a hard look.

            regards, tom lane