Thread: recent failures on lorikeet
I've been looking at the recent spate of intermittent failures on my Cygwin animal lorikeet. Most of them look something like this, where there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE' which fails. 2021-06-14 05:04:00.220 EDT [60c71b7f.e8bf:60] pg_regress/vacuum LOG: statement: VACUUM FULL pg_class; 2021-06-14 05:04:00.222 EDT [60c71b80.e8c0:7] pg_regress/typed_table LOG: statement: CREATE TABLE persons OF person_type; 2021-06-14 05:04:00.232 EDT [60c71b80.e8c1:3] pg_regress/inherit LOG: statement: CREATE TABLE a (aa TEXT); *** starting debugger for pid 59584, tid 9640 2021-06-14 05:04:14.678 EDT [60c71b53.e780:4] LOG: server process (PID 59584) exited with exit code 127 2021-06-14 05:04:14.678 EDT [60c71b53.e780:5] DETAIL: Failed process was running: CREATE TABLE persons OF person_type; 2021-06-14 05:04:14.678 EDT [60c71b53.e780:6] LOG: terminating any other active server processes Getting stack traces in this platform can be very difficult. I'm going to try forcing complete serialization of the regression tests (MAX_CONNECTIONS=1) to see if the problem goes away. Any other suggestions might be useful. Note that we're not getting the same issue on REL_13_STABLE, where the same group pf tests run together (inherit typed_table, vacuum) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I've been looking at the recent spate of intermittent failures on my > Cygwin animal lorikeet. Most of them look something like this, where > there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE' > which fails. Do you have any idea what "exit code 127" signifies on that platform? (BTW, not all of them look like that; many are reported as plain segfaults.) I hadn't spotted the association with a concurrent "VACUUM FULL pg_class" before, that does seem interesting. > Getting stack traces in this platform can be very difficult. I'm going > to try forcing complete serialization of the regression tests > (MAX_CONNECTIONS=1) to see if the problem goes away. Any other > suggestions might be useful. Note that we're not getting the same issue > on REL_13_STABLE, where the same group pf tests run together (inherit > typed_table, vacuum) If it does go away, that'd be interesting, but I don't see how it gets us any closer to a fix. Seems like a stack trace is a necessity to narrow it down. regards, tom lane
On 6/14/21 9:39 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I've been looking at the recent spate of intermittent failures on my >> Cygwin animal lorikeet. Most of them look something like this, where >> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE' >> which fails. > Do you have any idea what "exit code 127" signifies on that platform? > (BTW, not all of them look like that; many are reported as plain > segfaults.) I hadn't spotted the association with a concurrent "VACUUM > FULL pg_class" before, that does seem interesting. > >> Getting stack traces in this platform can be very difficult. I'm going >> to try forcing complete serialization of the regression tests >> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other >> suggestions might be useful. Note that we're not getting the same issue >> on REL_13_STABLE, where the same group pf tests run together (inherit >> typed_table, vacuum) > If it does go away, that'd be interesting, but I don't see how it gets > us any closer to a fix. Seems like a stack trace is a necessity to > narrow it down. > > Some have given stack traces and some not, not sure why. The one from June 13 has this: ---- backtrace ---- ?? ??:0 WaitOnLock src/backend/storage/lmgr/lock.c:1831 LockAcquireExtended src/backend/storage/lmgr/lock.c:1119 LockRelationOid src/backend/storage/lmgr/lmgr.c:135 relation_open src/backend/access/common/relation.c:59 table_open src/backend/access/table/table.c:43 ScanPgRelation src/backend/utils/cache/relcache.c:322 RelationBuildDesc src/backend/utils/cache/relcache.c:1039 RelationIdGetRelation src/backend/utils/cache/relcache.c:2045 relation_open src/backend/access/common/relation.c:59 table_open src/backend/access/table/table.c:43 ExecInitPartitionInfo src/backend/executor/execPartition.c:510 ExecPrepareTupleRouting src/backend/executor/nodeModifyTable.c:2311 ExecModifyTable src/backend/executor/nodeModifyTable.c:2559 ExecutePlan src/backend/executor/execMain.c:1557 The line in lmgr.c is where the process title gets changed to "waiting". I recently stopped setting process title on this animal on REL_13_STABLE and its similar errors have largely gone away. I can do the same on HEAD. But it does make me wonder what the heck has changed to make this code fragile. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 6/14/21 12:33 PM, Andrew Dunstan wrote: > > The line in lmgr.c is where the process title gets changed to "waiting". > I recently stopped setting process title on this animal on REL_13_STABLE > and its similar errors have largely gone away. I can do the same on > HEAD. But it does make me wonder what the heck has changed to make this > code fragile. Of course I meant the line (1831) in lock.c. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > The line in lmgr.c is where the process title gets changed to "waiting". > I recently stopped setting process title on this animal on REL_13_STABLE > and its similar errors have largely gone away. Oooh, that certainly seems like a smoking gun. > I can do the same on > HEAD. But it does make me wonder what the heck has changed to make this > code fragile. So what we've got there is old_status = get_ps_display(&len); new_status = (char *) palloc(len + 8 + 1); memcpy(new_status, old_status, len); strcpy(new_status + len, " waiting"); set_ps_display(new_status); new_status[len] = '\0'; /* truncate off " waiting" */ Line 1831 is the strcpy, but it seems entirely impossible that that could fail, unless palloc has shirked its job. I'm thinking that the crash is really in the memcpy --- looking at the other lines in your trace, fingering the line after the call seems common. What that'd have to imply is that get_ps_display() messed up, returning a bad pointer or a bad length. A platform-specific problem in get_ps_display() seems plausible enough. The apparent connection to a concurrent VACUUM FULL seems pretty hard to explain that way ... but maybe that's a mirage. regards, tom lane
I wrote: > What that'd have to imply is that get_ps_display() messed up, > returning a bad pointer or a bad length. > A platform-specific problem in get_ps_display() seems plausible > enough. The apparent connection to a concurrent VACUUM FULL seems > pretty hard to explain that way ... but maybe that's a mirage. If I understand correctly that you're only seeing this in v13 and HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API) deserves a hard look. regards, tom lane