Thread: Why is lorikeet so unstable in v14 branch only?
I chanced to notice that buildfarm member lorikeet has been failing an awful lot lately in the v14 branch, but hardly at all in other branches. Here's a log extract from its latest run [1]: 2022-03-26 06:31:47.245 EDT [623eeb93.d202:131] pg_regress/inherit LOG: statement: create table mlparted_tab (a int, b char,c text) partition by list (a); 2022-03-26 06:31:47.247 EDT [623eeb93.d202:132] pg_regress/inherit LOG: statement: create table mlparted_tab_part1 partitionof mlparted_tab for values in (1); 2022-03-26 06:31:47.254 EDT [623eeb93.d203:60] pg_regress/vacuum LOG: statement: VACUUM FULL pg_class; 2022-03-26 06:31:47.258 EDT [623eeb92.d201:90] pg_regress/typed_table LOG: statement: SELECT a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod), (SELECT pg_catalog.pg_get_expr(d.adbin, d.adrelid, true) FROM pg_catalog.pg_attrdef d WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef), a.attnotnull, (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type t WHERE c.oid = a.attcollation AND t.oid = a.atttypid AND a.attcollation <> t.typcollation) AS attcollation, a.attidentity, a.attgenerated FROM pg_catalog.pg_attribute a WHERE a.attrelid = '21770' AND a.attnum > 0 AND NOT a.attisdropped ORDER BY a.attnum; *** starting debugger for pid 53762, tid 10536 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:4] LOG: server process (PID 53762) exited with exit code 127 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:5] DETAIL: Failed process was running: create table mlparted_tab_part1 partitionof mlparted_tab for values in (1); 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:6] LOG: terminating any other active server processes The failures are not all exactly like this one, but they're mostly in CREATE TABLE operations nearby to this one. I speculate what is happening is that the "VACUUM FULL pg_class" is triggering some misbehavior in concurrent partitioned-table creation. The lack of failures in other branches could be due to changes in the relative timing of the "vacuum" and "inherit" test scripts. Any chance we could get a stack trace from one of these crashes? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2022-03-26%2010%3A17%3A22
On 2022-03-26 14:47:07 -0400, Tom Lane wrote: > I chanced to notice that buildfarm member lorikeet has been > failing an awful lot lately in the v14 branch, but hardly > at all in other branches. Here's a log extract from its > latest run [1]: One interesting bit in the config is: 'extra_config' => { ... 'HEAD' => [ 'update_process_title = off' ], 'REL_13_STABLE' => [ 'update_process_title = off' ] > *** starting debugger for pid 53762, tid 10536 > 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:4] LOG: server process (PID 53762) exited with exit code 127 > 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:5] DETAIL: Failed process was running: create table mlparted_tab_part1 partitionof mlparted_tab for values in (1); > 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:6] LOG: terminating any other active server processes I wonder what where the output of "starting debugger for pid 53762" ends up? I assume it's triggered by 'CYGWIN' => 'server error_start=c:\\ncygwin64\\bin\\dumper.exe -d %1 %2', https://cygwin.org/cygwin-ug-net/using-cygwinenv.html says "The filename of the executing program and it's Windows process id are appended to the command as arguments. " but nothing about %1 and %2 :(. I those are just "executing program" and "Windows process id" respectively? Greetings, Andres Freund
On 3/26/22 15:49, Andres Freund wrote: > On 2022-03-26 14:47:07 -0400, Tom Lane wrote: >> I chanced to notice that buildfarm member lorikeet has been >> failing an awful lot lately in the v14 branch, but hardly >> at all in other branches. Here's a log extract from its >> latest run [1]: > One interesting bit in the config is: > > 'extra_config' => { > ... > 'HEAD' => [ > 'update_process_title = off' > ], > 'REL_13_STABLE' => [ > 'update_process_title = off' > ] > I'd forgotten about that. Let me do that for REL_14_STABLE and see where we get to. >> *** starting debugger for pid 53762, tid 10536 >> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:4] LOG: server process (PID 53762) exited with exit code 127 >> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:5] DETAIL: Failed process was running: create table mlparted_tab_part1 partitionof mlparted_tab for values in (1); >> 2022-03-26 06:32:02.158 EDT [623eeb6c.d0c2:6] LOG: terminating any other active server processes > I wonder what where the output of "starting debugger for pid 53762" ends up? I > assume it's triggered by > 'CYGWIN' => 'server error_start=c:\\ncygwin64\\bin\\dumper.exe -d %1 %2', > > https://cygwin.org/cygwin-ug-net/using-cygwinenv.html > says "The filename of the executing program and it's Windows process id are appended to the command as arguments. " > > but nothing about %1 and %2 :(. I those are just "executing program" and > "Windows process id" respectively? I don't remember where I got this invocation from. But see for example <https://stackoverflow.com/questions/320001/using-a-stackdump-from-cygwin-executable> cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 3/26/22 15:49, Andres Freund wrote: >> One interesting bit in the config is: >> [ lack of ] >> 'update_process_title = off' > I'd forgotten about that. Let me do that for REL_14_STABLE and see where > we get to. Hm. But if that does mitigate it, it still seems like a bug no? Why would that be preferentially crashing partitioned-table creation? regards, tom lane
On 3/26/22 17:19, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 3/26/22 15:49, Andres Freund wrote: >>> One interesting bit in the config is: >>> [ lack of ] >>> 'update_process_title = off' >> I'd forgotten about that. Let me do that for REL_14_STABLE and see where >> we get to. > Hm. But if that does mitigate it, it still seems like a bug no? > Why would that be preferentially crashing partitioned-table creation? Yes it seems like a bug, but hard to diagnose. It seemed like a bug back in May: see <https://postgr.es/m/4baee39d-0ebe-8327-7878-5bc11c95effa@dunslane.net> I vaguely theorize about a buffer overrun somewhere that scribbles on the stack. The answer to Andres's question about where the stackdumps go is that they go in the data directory, AFAIK. You can see the buildfarm logic for collecting them at <https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Utils.pm> starting at line 149. There are various appropriate invocations of get_stack_trace() in run_build.pl. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > Yes it seems like a bug, but hard to diagnose. It seemed like a bug back > in May: see > <https://postgr.es/m/4baee39d-0ebe-8327-7878-5bc11c95effa@dunslane.net> Ah, right, but that link is busted. Here's the correct link: https://www.postgresql.org/message-id/flat/e6f1fb3e-1e08-0188-9c71-2b5b894571de%40dunslane.net > I vaguely theorize about a buffer overrun somewhere that scribbles on > the stack. I said in the earlier thread > A platform-specific problem in get_ps_display() seems plausible > enough. The apparent connection to a concurrent VACUUM FULL seems > pretty hard to explain that way ... but maybe that's a mirage. but your one stack trace showed a crash while trying to lock pg_class for ScanPgRelation, which'd potentially have blocked because of the VACUUM --- and that'd result in a process title change, if not disabled. So now I feel like "something rotten in ps_status.c" is a theory that can fit the available facts. > If I understand correctly that you're only seeing this in v13 and > HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API) > deserves a hard look. I still stand by this opinion. Can you verify which of the ps_status.c code paths gets used on this build? regards, tom lane
On 3/26/22 18:10, Tom Lane wrote: > >> If I understand correctly that you're only seeing this in v13 and >> HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API) >> deserves a hard look. > I still stand by this opinion. Can you verify which of the ps_status.c > code paths gets used on this build? > > It appears that it is using PS_USE_NONE, as it doesn't have any of the defines required for the other paths. I note that the branch for that in get_ps_display() doesn't set *displen, which looks a tad suspicious. It could be left with any old junk. And maybe there's a good case for also surrounding some of the code in WaitOnLock() with "if (len) ..." cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > It appears that it is using PS_USE_NONE, as it doesn't have any of the > defines required for the other paths. I note that the branch for that in > get_ps_display() doesn't set *displen, which looks a tad suspicious. Indeed. I forced it to use PS_USE_NONE on my Linux machine, and got a core dump on the first try of the regression tests: Program terminated with signal SIGSEGV, Segmentation fault. #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:516 516 VMOVNT %VEC(0), (%r9) (gdb) bt #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:516 #1 0x00000000008299b3 in WaitOnLock (locallock=locallock@entry=0x2a5e700, owner=owner@entry=0x2aba8f0) at lock.c:1831 #2 0x000000000082adc6 in LockAcquireExtended ( locktag=locktag@entry=0x7ffc864fad90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffc864fad88) at lock.c:1101 #3 0x000000000082861f in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117 #4 0x000000000051c5ed in relation_open (relationId=1259, lockmode=lockmode@entry=1) at relation.c:56 ... (gdb) f 1 #1 0x00000000008299b3 in WaitOnLock (locallock=locallock@entry=0x2a5e700, owner=owner@entry=0x2aba8f0) at lock.c:1831 1831 memcpy(new_status, old_status, len); (gdb) p len $1 = -1 Problem explained, good detective work! > And maybe there's a good case for also > surrounding some of the code in WaitOnLock() with "if (len) ..." +1. I'll make it so, and check the other callers too. Once I push this, you should remove the update_process_title hack from lorikeet's config, since that was just a workaround until we tracked down the problem, which I think we just did. regards, tom lane
I wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> And maybe there's a good case for also >> surrounding some of the code in WaitOnLock() with "if (len) ..." > +1. I'll make it so, and check the other callers too. I had second thoughts about that part after realizing that callers cannot tell the difference between "ps_display is disabled" and "the activity part of the display is currently empty". In the latter case I think we'd rather have WaitOnLock still append " waiting"; and it's not like PS_USE_NONE is so common as to be worth optimizing for. (Else we'd have identified this problem sooner.) > Once I push this, you should remove the update_process_title hack > from lorikeet's config, since that was just a workaround until > we tracked down the problem, which I think we just did. Minimal fix pushed, so please adjust that animal's config. regards, tom lane
On 3/27/22 13:01, Tom Lane wrote: > >> Once I push this, you should remove the update_process_title hack >> from lorikeet's config, since that was just a workaround until >> we tracked down the problem, which I think we just did. > Minimal fix pushed, so please adjust that animal's config. > Done. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com