Thread: Why is lorikeet so unstable in v14 branch only?

Why is lorikeet so unstable in v14 branch only?

From
Tom Lane
Date:
I chanced to notice that buildfarm member lorikeet has been
failing an awful lot lately in the v14 branch, but hardly
at all in other branches.  Here's a log extract from its
latest run [1]:

2022-03-26 06:31:47.245 EDT [623eeb93.d202:131] pg_regress/inherit LOG:  statement: create table mlparted_tab (a int, b
char,c text) partition by list (a); 
2022-03-26 06:31:47.247 EDT [623eeb93.d202:132] pg_regress/inherit LOG:  statement: create table mlparted_tab_part1
partitionof mlparted_tab for values in (1); 
2022-03-26 06:31:47.254 EDT [623eeb93.d203:60] pg_regress/vacuum LOG:  statement: VACUUM FULL pg_class;
2022-03-26 06:31:47.258 EDT [623eeb92.d201:90] pg_regress/typed_table LOG:  statement: SELECT a.attname,
      pg_catalog.format_type(a.atttypid, a.atttypmod),
      (SELECT pg_catalog.pg_get_expr(d.adbin, d.adrelid, true)
       FROM pg_catalog.pg_attrdef d
       WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef),
      a.attnotnull,
      (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type t
       WHERE c.oid = a.attcollation AND t.oid = a.atttypid AND a.attcollation <> t.typcollation) AS attcollation,
      a.attidentity,
      a.attgenerated
    FROM pg_catalog.pg_attribute a
    WHERE a.attrelid = '21770' AND a.attnum > 0 AND NOT a.attisdropped
    ORDER BY a.attnum;
*** starting debugger for pid 53762, tid 10536
2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:4] LOG:  server process (PID 53762) exited with exit code 127
2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:5] DETAIL:  Failed process was running: create table mlparted_tab_part1
partitionof mlparted_tab for values in (1); 
2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:6] LOG:  terminating any other active server processes

The failures are not all exactly like this one, but they're mostly in
CREATE TABLE operations nearby to this one.  I speculate what is happening
is that the "VACUUM FULL pg_class" is triggering some misbehavior in
concurrent partitioned-table creation.  The lack of failures in other
branches could be due to changes in the relative timing of the "vacuum"
and "inherit" test scripts.

Any chance we could get a stack trace from one of these crashes?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2022-03-26%2010%3A17%3A22



Re: Why is lorikeet so unstable in v14 branch only?

From
Andres Freund
Date:
On 2022-03-26 14:47:07 -0400, Tom Lane wrote:
> I chanced to notice that buildfarm member lorikeet has been
> failing an awful lot lately in the v14 branch, but hardly
> at all in other branches.  Here's a log extract from its
> latest run [1]:

One interesting bit in the config is:

                   'extra_config' => {
                   ...
                                       'HEAD' => [
                                                   'update_process_title = off'
                                                 ],
                                       'REL_13_STABLE' => [
                                                            'update_process_title = off'
                                                          ]


> *** starting debugger for pid 53762, tid 10536
> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:4] LOG:  server process (PID 53762) exited with exit code 127
> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:5] DETAIL:  Failed process was running: create table mlparted_tab_part1
partitionof mlparted_tab for values in (1);
 
> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:6] LOG:  terminating any other active server processes

I wonder what where the output of "starting debugger for pid 53762" ends up? I
assume it's triggered by
                                    'CYGWIN' => 'server error_start=c:\\ncygwin64\\bin\\dumper.exe -d %1 %2',

https://cygwin.org/cygwin-ug-net/using-cygwinenv.html
says "The filename of the executing program and it's Windows process id are appended to the command as arguments. "

but nothing about %1 and %2 :(. I those are just "executing program" and
"Windows process id" respectively?

Greetings,

Andres Freund



Re: Why is lorikeet so unstable in v14 branch only?

From
Andrew Dunstan
Date:
On 3/26/22 15:49, Andres Freund wrote:
> On 2022-03-26 14:47:07 -0400, Tom Lane wrote:
>> I chanced to notice that buildfarm member lorikeet has been
>> failing an awful lot lately in the v14 branch, but hardly
>> at all in other branches.  Here's a log extract from its
>> latest run [1]:
> One interesting bit in the config is:
>
>                    'extra_config' => {
>                    ...
>                                        'HEAD' => [
>                                                    'update_process_title = off'
>                                                  ],
>                                        'REL_13_STABLE' => [
>                                                             'update_process_title = off'
>                                                           ]
>

I'd forgotten about that. Let me do that for REL_14_STABLE and see where
we get to.


>> *** starting debugger for pid 53762, tid 10536
>> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:4] LOG:  server process (PID 53762) exited with exit code 127
>> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:5] DETAIL:  Failed process was running: create table mlparted_tab_part1
partitionof mlparted_tab for values in (1);
 
>> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:6] LOG:  terminating any other active server processes
> I wonder what where the output of "starting debugger for pid 53762" ends up? I
> assume it's triggered by
>                                     'CYGWIN' => 'server error_start=c:\\ncygwin64\\bin\\dumper.exe -d %1 %2',
>
> https://cygwin.org/cygwin-ug-net/using-cygwinenv.html
> says "The filename of the executing program and it's Windows process id are appended to the command as arguments. "
>
> but nothing about %1 and %2 :(. I those are just "executing program" and
> "Windows process id" respectively?



I don't remember where I got this invocation from. But see for example
<https://stackoverflow.com/questions/320001/using-a-stackdump-from-cygwin-executable>


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is lorikeet so unstable in v14 branch only?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 3/26/22 15:49, Andres Freund wrote:
>> One interesting bit in the config is:
>> [ lack of ]
>> 'update_process_title = off'

> I'd forgotten about that. Let me do that for REL_14_STABLE and see where
> we get to.

Hm.  But if that does mitigate it, it still seems like a bug no?
Why would that be preferentially crashing partitioned-table creation?

            regards, tom lane



Re: Why is lorikeet so unstable in v14 branch only?

From
Andrew Dunstan
Date:
On 3/26/22 17:19, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 3/26/22 15:49, Andres Freund wrote:
>>> One interesting bit in the config is:
>>> [ lack of ]
>>> 'update_process_title = off'
>> I'd forgotten about that. Let me do that for REL_14_STABLE and see where
>> we get to.
> Hm.  But if that does mitigate it, it still seems like a bug no?
> Why would that be preferentially crashing partitioned-table creation?


Yes it seems like a bug, but hard to diagnose. It seemed like a bug back
in May:  see
<https://postgr.es/m/4baee39d-0ebe-8327-7878-5bc11c95effa@dunslane.net>

I vaguely theorize about a buffer overrun somewhere that scribbles on
the stack.

The answer to Andres's question about where the stackdumps go is that
they go in the data directory, AFAIK. You can see the buildfarm logic
for collecting them at
<https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Utils.pm>
starting at line 149. There are various appropriate invocations of
get_stack_trace() in run_build.pl.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is lorikeet so unstable in v14 branch only?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Yes it seems like a bug, but hard to diagnose. It seemed like a bug back
> in May:  see
> <https://postgr.es/m/4baee39d-0ebe-8327-7878-5bc11c95effa@dunslane.net>

Ah, right, but that link is busted.  Here's the correct link:

https://www.postgresql.org/message-id/flat/e6f1fb3e-1e08-0188-9c71-2b5b894571de%40dunslane.net

> I vaguely theorize about a buffer overrun somewhere that scribbles on
> the stack.

I said in the earlier thread

> A platform-specific problem in get_ps_display() seems plausible
> enough.  The apparent connection to a concurrent VACUUM FULL seems
> pretty hard to explain that way ... but maybe that's a mirage.

but your one stack trace showed a crash while trying to lock pg_class for
ScanPgRelation, which'd potentially have blocked because of the VACUUM ---
and that'd result in a process title change, if not disabled.  So now
I feel like "something rotten in ps_status.c" is a theory that can fit
the available facts.

> If I understand correctly that you're only seeing this in v13 and
> HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API)
> deserves a hard look.

I still stand by this opinion.  Can you verify which of the ps_status.c
code paths gets used on this build?

            regards, tom lane



Re: Why is lorikeet so unstable in v14 branch only?

From
Andrew Dunstan
Date:
On 3/26/22 18:10, Tom Lane wrote:
>
>> If I understand correctly that you're only seeing this in v13 and
>> HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API)
>> deserves a hard look.
> I still stand by this opinion.  Can you verify which of the ps_status.c
> code paths gets used on this build?
>
>             



It appears that it is using PS_USE_NONE, as it doesn't have any of the
defines required for the other paths. I note that the branch for that in
get_ps_display() doesn't set *displen, which looks a tad suspicious. It
could be left with any old junk. And maybe there's a good case for also
surrounding some of the code in WaitOnLock() with "if (len) ..."


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is lorikeet so unstable in v14 branch only?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> It appears that it is using PS_USE_NONE, as it doesn't have any of the
> defines required for the other paths. I note that the branch for that in
> get_ps_display() doesn't set *displen, which looks a tad suspicious.

Indeed.  I forced it to use PS_USE_NONE on my Linux machine, and got
a core dump on the first try of the regression tests:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:516
516             VMOVNT  %VEC(0), (%r9)
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:516
#1  0x00000000008299b3 in WaitOnLock (locallock=locallock@entry=0x2a5e700, 
    owner=owner@entry=0x2aba8f0) at lock.c:1831
#2  0x000000000082adc6 in LockAcquireExtended (
    locktag=locktag@entry=0x7ffc864fad90, lockmode=lockmode@entry=1, 
    sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, 
    reportMemoryError=reportMemoryError@entry=true, 
    locallockp=locallockp@entry=0x7ffc864fad88) at lock.c:1101
#3  0x000000000082861f in LockRelationOid (relid=1259, lockmode=1)
    at lmgr.c:117
#4  0x000000000051c5ed in relation_open (relationId=1259, 
    lockmode=lockmode@entry=1) at relation.c:56
...

(gdb) f 1
#1  0x00000000008299b3 in WaitOnLock (locallock=locallock@entry=0x2a5e700, 
    owner=owner@entry=0x2aba8f0) at lock.c:1831
1831                    memcpy(new_status, old_status, len);
(gdb) p len
$1 = -1

Problem explained, good detective work!

> And maybe there's a good case for also
> surrounding some of the code in WaitOnLock() with "if (len) ..."

+1.  I'll make it so, and check the other callers too.

Once I push this, you should remove the update_process_title hack
from lorikeet's config, since that was just a workaround until
we tracked down the problem, which I think we just did.

            regards, tom lane



Re: Why is lorikeet so unstable in v14 branch only?

From
Tom Lane
Date:
I wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> And maybe there's a good case for also
>> surrounding some of the code in WaitOnLock() with "if (len) ..."

> +1.  I'll make it so, and check the other callers too.

I had second thoughts about that part after realizing that callers
cannot tell the difference between "ps_display is disabled" and
"the activity part of the display is currently empty".  In the latter
case I think we'd rather have WaitOnLock still append " waiting";
and it's not like PS_USE_NONE is so common as to be worth optimizing
for.  (Else we'd have identified this problem sooner.)

> Once I push this, you should remove the update_process_title hack
> from lorikeet's config, since that was just a workaround until
> we tracked down the problem, which I think we just did.

Minimal fix pushed, so please adjust that animal's config.

            regards, tom lane



Re: Why is lorikeet so unstable in v14 branch only?

From
Andrew Dunstan
Date:
On 3/27/22 13:01, Tom Lane wrote:
>
>> Once I push this, you should remove the update_process_title hack
>> from lorikeet's config, since that was just a workaround until
>> we tracked down the problem, which I think we just did.
> Minimal fix pushed, so please adjust that animal's config.
>

Done.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com