Thread: Why is infinite_recurse test suddenly failing?
In the past week, four different buildfarm members have shown non-reproducing segfaults in the "select infinite_recurse()" test case, rather than the expected detection of stack overrun before we get to the point of a segfault. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito&dt=2019-05-01%2023%3A05%3A36 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin&dt=2019-05-01%2008%3A16%3A48 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri&dt=2019-04-27%2023%3A54%3A46 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-04-27%2014%3A55%3A52 They're all on HEAD, and they all look like 2019-05-01 23:11:00.145 UTC [13933:65] LOG: server process (PID 17161) was terminated by signal 11: Segmentation fault 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL: Failed process was running: select infinite_recurse(); I scraped the buildfarm database and verified that there are no similar failures for at least three months back; nor, offhand, can I remember ever having seen this test fail in many years. So it seems we broke something recently. No idea what though. (Another possibility, seeing that these are all members of Mark's PPC64 flotilla, is that there's some common misconfiguration --- but it's hard to credit that such a problem would only affect HEAD not the back branches.) Thoughts? regards, tom lane
Hi, On 2019-05-02 11:02:03 -0400, Tom Lane wrote: > In the past week, four different buildfarm members have shown > non-reproducing segfaults in the "select infinite_recurse()" > test case, rather than the expected detection of stack overrun > before we get to the point of a segfault. I was just staring at bonito's failure in confusion. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito&dt=2019-05-01%2023%3A05%3A36 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin&dt=2019-05-01%2008%3A16%3A48 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri&dt=2019-04-27%2023%3A54%3A46 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-04-27%2014%3A55%3A52 > > They're all on HEAD, and they all look like > > 2019-05-01 23:11:00.145 UTC [13933:65] LOG: server process (PID 17161) was terminated by signal 11: Segmentation fault > 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL: Failed process was running: select infinite_recurse(); > > I scraped the buildfarm database and verified that there are no similar > failures for at least three months back; nor, offhand, can I remember ever > having seen this test fail in many years. So it seems we broke something > recently. No idea what though. I can't recall any recent changes to relevant area of code. > (Another possibility, seeing that these are all members of Mark's PPC64 > flotilla, is that there's some common misconfiguration --- but it's hard > to credit that such a problem would only affect HEAD not the back > branches.) Hm, I just noticed: 'HEAD' => [ 'force_parallel_mode = regress' ] on all those animals. So it's not necessarily the case that HEAD and backbranch runs are behaving all that identical. Note that isn't a recent config change, so it's not an explanation as to why they started to fail only recently. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Hm, I just noticed: > 'HEAD' => [ > 'force_parallel_mode = regress' > ] Oooh, I didn't see that. > on all those animals. So it's not necessarily the case that HEAD and > backbranch runs are behaving all that identical. Note that isn't a > recent config change, so it's not an explanation as to why they started > to fail only recently. No, but it does point at another area of the code in which a relevant change could've occurred. While we're looking at this --- Mark, if you could install gdb on your buildfarm hosts, that would be really handy. I think that's the only extra thing the buildfarm script needs to extract stack traces from core dumps. We'd likely already know where the problem is if we had a stack trace ... regards, tom lane
On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Hm, I just noticed: > > 'HEAD' => [ > > 'force_parallel_mode = regress' > > ] > > Oooh, I didn't see that. > > > on all those animals. So it's not necessarily the case that HEAD and > > backbranch runs are behaving all that identical. Note that isn't a > > recent config change, so it's not an explanation as to why they started > > to fail only recently. > > No, but it does point at another area of the code in which a relevant > change could've occurred. > > While we're looking at this --- Mark, if you could install gdb > on your buildfarm hosts, that would be really handy. I think that's > the only extra thing the buildfarm script needs to extract stack > traces from core dumps. We'd likely already know where the problem > is if we had a stack trace ... Ok, I think I have gdb installed now... Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
Mark Wong <mark@2ndquadrant.com> writes: > On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote: >> While we're looking at this --- Mark, if you could install gdb >> on your buildfarm hosts, that would be really handy. I think that's >> the only extra thing the buildfarm script needs to extract stack >> traces from core dumps. We'd likely already know where the problem >> is if we had a stack trace ... > Ok, I think I have gdb installed now... Thanks! regards, tom lane
Hi, On 2019-05-03 10:08:59 -0700, Mark Wong wrote: > Ok, I think I have gdb installed now... Thanks! Any chance you could turn on force_parallel_mode for the other branches it applies to too? Makes it easier to figure out whether breakage is related to that, or independent. Greetings, Andres Freund
We just got another one of these, on yet another member of Mark's ppc64 armada: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler&dt=2019-05-10%2014%3A04%3A34 Now we have a stack trace (thanks Mark!), but it is pretty unsurprising: Core was generated by `postgres: debian regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748 2748 malloc.c: No such file or directory. #0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748 #1 0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865 #2 0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928 #3 0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimizedout>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477 #4 0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>, input_collid=<optimizedout>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimizedout>) at clauses.c:4459 #5 simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimizedout>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimizedout>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040 #6 0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474 #7 0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>,context=0x3fffe3da15d0) at nodeFuncs.c:2893 #8 0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babe9c0, context=0x3fffe3da15d0) at clauses.c:3606 #9 0x00000000103510c8 in expression_tree_mutator (node=<optimized out>, mutator=<optimized out>, context=<optimized out>)at nodeFuncs.c:2942 #10 0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babea40, context=0x3fffe3da15d0) at clauses.c:3606 #11 0x00000000103d2ae8 in eval_const_expressions (root=<optimized out>, node=<optimized out>) at clauses.c:2266 #12 0x00000000103b6264 in preprocess_expression (root=0x1000babee18, expr=<optimized out>, kind=1) at planner.c:1087 #13 0x00000000103b496c in subquery_planner (glob=<optimized out>, parse=<optimized out>, parent_root=<optimized out>, hasRecursion=<optimizedout>, tuple_fraction=<optimized out>) at planner.c:769 #14 0x00000000103b3c58 in standard_planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimizedout>) at planner.c:406 #15 0x00000000103b3a68 in planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimized out>) atplanner.c:275 #16 0x00000000104cc2cc in pg_plan_query (querytree=0x1000babe7f8, cursorOptions=256, boundParams=0x0) at postgres.c:878 #17 0x00000000102ef850 in init_execution_state (lazyEvalOK=<optimized out>, queryTree_list=<optimized out>, fcache=<optimizedout>) at functions.c:507 #18 init_sql_fcache (finfo=<optimized out>, collation=<optimized out>, lazyEvalOK=<optimized out>) at functions.c:770 #19 fmgr_sql (fcinfo=<optimized out>) at functions.c:1053 #20 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625 #21 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized out>)at execExprInterp.c:1769 #22 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized out>)at ../../../src/include/executor/executor.h:307 #23 ExecProject (projInfo=0x1000bab41e0) at ../../../src/include/executor/executor.h:341 #24 ExecResult (pstate=<optimized out>) at nodeResult.c:136 #25 0x00000000102e319c in ExecProcNodeFirst (node=0x1000bab3dc0) at execProcnode.c:445 #26 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239 #27 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>, numberTuples=<optimizedout>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>, sendTuples=<optimizedout>) at execMain.c:1648 #28 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimizedout>) at execMain.c:365 #29 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimizedout>) at execMain.c:309 #30 0x00000000102efe84 in postquel_getnext (es=<optimized out>, fcache=<optimized out>) at functions.c:867 #31 fmgr_sql (fcinfo=<optimized out>) at functions.c:1153 #32 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625 #33 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized out>)at execExprInterp.c:1769 #34 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized out>)at ../../../src/include/executor/executor.h:307 #35 ExecProject (projInfo=0x1000baa8150) at ../../../src/include/executor/executor.h:341 #36 ExecResult (pstate=<optimized out>) at nodeResult.c:136 #37 0x00000000102e319c in ExecProcNodeFirst (node=0x1000baa7d30) at execProcnode.c:445 ... lots and lots of repetitions ... #11809 0x00000000102e319c in ExecProcNodeFirst (node=0x10008c01e90) at execProcnode.c:445 #11810 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239 #11811 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>,numberTuples=<optimized out>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>, sendTuples=<optimizedout>) at execMain.c:1648 #11812 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimizedout>) at execMain.c:365 #11813 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimizedout>) at execMain.c:309 #11814 0x00000000104d39ec in PortalRunSelect (portal=0x10008be9de8, forward=<optimized out>, count=0, dest=<optimized out>)at pquery.c:929 #11815 0x00000000104d34c0 in PortalRun (portal=0x10008be9de8, count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimizedout>, dest=<optimized out>, altdest=<optimized out>, completionTag=0x3fffe3ecd6c0 "") at pquery.c:770 #11816 0x00000000104d1bc4 in exec_simple_query (query_string=<optimized out>) at postgres.c:1215 #11817 0x00000000104ced50 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4249 #11818 0x00000000104110fc in BackendRun (port=<optimized out>) at postmaster.c:4430 #11819 BackendStartup (port=<optimized out>) at postmaster.c:4121 #11820 ServerLoop () at postmaster.c:1704 #11821 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1377 #11822 0x000000001034def4 in main (argc=8, argv=0x10008b7efb0) at main.c:228 So that lets out any theory that somehow we're getting into a weird control path that misses calling check_stack_depth; expression_tree_mutator does so for one, and it was called just nine stack frames down from the crash. Casting about for possible explanations, I notice that the failure seems to have occurred at a nesting depth of 982 SQL-function calls ((11809 - 25)/12). I'd previously scraped the buildfarm database to find out what nesting depths we normally get to before erroring out, by counting the number of "infinite_recurse" context messages in the postmaster log. Here's the entry for shoveler: shoveler | 2019-05-03 14:19:19 | 1674 So this failed at substantially less stack depth than it's successfully consumed in other runs, and *very* substantially less stack than ought to be there, considering we pay attention to getrlimit in setting max_stack_depth and add quite a generous amount of slop too. I am wondering if, somehow, the stack depth limit seen by the postmaster sometimes doesn't apply to its children. That would be pretty wacko kernel behavior, especially if it's only intermittently true. But we're running out of other explanations. regards, tom lane
Hi, On 2019-05-10 11:38:57 -0400, Tom Lane wrote: > Core was generated by `postgres: debian regression [local] SELECT '. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748 > 2748 malloc.c: No such file or directory. > #0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748 > #1 0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865 > #2 0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928 > #3 0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimizedout>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477 > #4 0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>, input_collid=<optimizedout>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimizedout>) at clauses.c:4459 > #5 simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimizedout>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimizedout>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040 > #6 0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474 > #7 0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>,context=0x3fffe3da15d0) at nodeFuncs.c:2893 > So that lets out any theory that somehow we're getting into a weird > control path that misses calling check_stack_depth; > expression_tree_mutator does so for one, and it was called just nine > stack frames down from the crash. Right. There's plenty places checking it... > I am wondering if, somehow, the stack depth limit seen by the postmaster > sometimes doesn't apply to its children. That would be pretty wacko > kernel behavior, especially if it's only intermittently true. > But we're running out of other explanations. I wonder if this is a SIGSEGV that actually signals an OOM situation. Linux, if it can't actually extend the stack on-demand due to OOM, sends a SIGSEGV. The signal has that information, but unfortunately the buildfarm code doesn't print it. p $_siginfo would show us some of that... Mark, how tight is the memory on that machine? Does dmesg have any other information (often segfaults are logged by the kernel with the code IIRC). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-05-10 11:38:57 -0400, Tom Lane wrote: >> I am wondering if, somehow, the stack depth limit seen by the postmaster >> sometimes doesn't apply to its children. That would be pretty wacko >> kernel behavior, especially if it's only intermittently true. >> But we're running out of other explanations. > I wonder if this is a SIGSEGV that actually signals an OOM > situation. Linux, if it can't actually extend the stack on-demand due to > OOM, sends a SIGSEGV. The signal has that information, but > unfortunately the buildfarm code doesn't print it. p $_siginfo would > show us some of that... > Mark, how tight is the memory on that machine? Does dmesg have any other > information (often segfaults are logged by the kernel with the code > IIRC). It does sort of smell like a resource exhaustion problem, especially if all these buildfarm animals are VMs running on the same underlying platform. But why would that manifest as "you can't have a measly two megabytes of stack" and not as any other sort of OOM symptom? Mark, if you don't mind modding your local copies of the buildfarm script, I think what Andres is asking for is a pretty trivial addition in PGBuild/Utils.pm's sub get_stack_trace: my $cmdfile = "./gdbcmd"; my $handle; open($handle, '>', $cmdfile) || die "opening $cmdfile: $!"; print $handle "bt\n"; + print $handle "p $_siginfo\n"; close($handle); regards, tom lane
On 5/10/19 3:35 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2019-05-10 11:38:57 -0400, Tom Lane wrote: >>> I am wondering if, somehow, the stack depth limit seen by the postmaster >>> sometimes doesn't apply to its children. That would be pretty wacko >>> kernel behavior, especially if it's only intermittently true. >>> But we're running out of other explanations. >> I wonder if this is a SIGSEGV that actually signals an OOM >> situation. Linux, if it can't actually extend the stack on-demand due to >> OOM, sends a SIGSEGV. The signal has that information, but >> unfortunately the buildfarm code doesn't print it. p $_siginfo would >> show us some of that... >> Mark, how tight is the memory on that machine? Does dmesg have any other >> information (often segfaults are logged by the kernel with the code >> IIRC). > It does sort of smell like a resource exhaustion problem, especially > if all these buildfarm animals are VMs running on the same underlying > platform. But why would that manifest as "you can't have a measly two > megabytes of stack" and not as any other sort of OOM symptom? > > Mark, if you don't mind modding your local copies of the buildfarm > script, I think what Andres is asking for is a pretty trivial addition > in PGBuild/Utils.pm's sub get_stack_trace: > > my $cmdfile = "./gdbcmd"; > my $handle; > open($handle, '>', $cmdfile) || die "opening $cmdfile: $!"; > print $handle "bt\n"; > + print $handle "p $_siginfo\n"; > close($handle); > > I think we'll need to write that as: print $handle 'p $_siginfo',"\n"; As you have it written perl will try to interpolate a variable called $_siginfo. cheers andrew
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote: > I think we'll need to write that as: > > print $handle 'p $_siginfo',"\n"; > > As you have it written perl will try to interpolate a variable called > $_siginfo. Anything in double quotes with a dollar sign would be interpreted as a variable, and single quotes make that safe. -- Michael
Attachment
On 5/12/19 4:57 AM, Michael Paquier wrote: > On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote: >> I think we'll need to write that as: >> >> print $handle 'p $_siginfo',"\n"; >> >> As you have it written perl will try to interpolate a variable called >> $_siginfo. > Anything in double quotes with a dollar sign would be interpreted as a > variable, and single quotes make that safe. Yes, that's why I did it that way. cheers andrew
On Fri, May 03, 2019 at 11:45:33AM -0700, Andres Freund wrote: > Hi, > > On 2019-05-03 10:08:59 -0700, Mark Wong wrote: > > Ok, I think I have gdb installed now... > > Thanks! Any chance you could turn on force_parallel_mode for the other > branches it applies to too? Makes it easier to figure out whether > breakage is related to that, or independent. Slowly catching up on my collection of ppc64le animals... I still need to upgrade the build farm client (v8) on: * dhole * vulpes * wobbegong * cuon * batfish * devario * cardinalfish The following I've enabled force_parallel_mode for HEAD, 11, 10, and 9.6: * buri * urocryon * ayu * shoveler * chimaera * bonito * takin * bufflehead * elasmobranch * demoiselle * cavefish Regards, Mark
Mark Wong <mark@2ndQuadrant.com> writes: > The following I've enabled force_parallel_mode for HEAD, 11, 10, and > 9.6: Thanks Mark! In theory, the stack trace we now have from shoveler proves that parallel mode has nothing to do with this, because the crash happens during planning (specifically, inlining SQL functions). I wonder though if it's possible that previous force_parallel_mode queries have had some undesirable effect on the process's stack allocation. Just grasping at straws, because it's sure not clear what's going wrong. regards, tom lane
Mark Wong <mark@2ndQuadrant.com> writes: > Slowly catching up on my collection of ppc64le animals... Oh, btw ... you didn't answer my question from before: are these animals all running on a common platform (and if so, what is that), or are they really different hardware? If they're all VMs on one machine then it might be that there's some common-mode effect from the underlying system. (Another thing I notice, now, is that these are all Linux variants; I'd been thinking you had some BSDen in there too, but now I see that none of those are ppc64. Hm.) regards, tom lane
On Fri, May 10, 2019 at 11:27:07AM -0700, Andres Freund wrote: > Hi, > > On 2019-05-10 11:38:57 -0400, Tom Lane wrote: > > Core was generated by `postgres: debian regression [local] SELECT '. > > Program terminated with signal SIGSEGV, Segmentation fault. > > #0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748 > > 2748 malloc.c: No such file or directory. > > #0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748 > > #1 0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865 > > #2 0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928 > > #3 0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimizedout>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477 > > #4 0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>,input_collid=<optimized out>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimizedout>) at clauses.c:4459 > > #5 simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimizedout>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimizedout>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040 > > #6 0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474 > > #7 0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>,context=0x3fffe3da15d0) at nodeFuncs.c:2893 > > > > So that lets out any theory that somehow we're getting into a weird > > control path that misses calling check_stack_depth; > > expression_tree_mutator does so for one, and it was called just nine > > stack frames down from the crash. > > Right. There's plenty places checking it... > > > > I am wondering if, somehow, the stack depth limit seen by the postmaster > > sometimes doesn't apply to its children. That would be pretty wacko > > kernel behavior, especially if it's only intermittently true. > > But we're running out of other explanations. > > I wonder if this is a SIGSEGV that actually signals an OOM > situation. Linux, if it can't actually extend the stack on-demand due to > OOM, sends a SIGSEGV. The signal has that information, but > unfortunately the buildfarm code doesn't print it. p $_siginfo would > show us some of that... > > Mark, how tight is the memory on that machine? There's about 2GB allocated: debian@postgresql-debian:~$ cat /proc/meminfo MemTotal: 2080704 kB MemFree: 1344768 kB MemAvailable: 1824192 kB At the moment it looks like plenty. :) Maybe I should set something up to monitor these things. > Does dmesg have any other > information (often segfaults are logged by the kernel with the code > IIRC). It's been up for about 49 days: debian@postgresql-debian:~$ uptime 14:54:30 up 49 days, 14:59, 3 users, load average: 0.00, 0.34, 1.04 I see one line from dmesg that is related to postgres: [3939350.616849] postgres[17057]: bad frame in setup_rt_frame: 00003fffe3d9fe00 nip 00003fff915bdba0 lr 00003fff915bde9c But only that one time in 49 days up. Otherwise I see a half dozen hung_task_timeout_secs messages around jdb2 and dhclient. Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On Tue, May 14, 2019 at 10:52:07AM -0400, Tom Lane wrote: > Mark Wong <mark@2ndQuadrant.com> writes: > > Slowly catching up on my collection of ppc64le animals... > > Oh, btw ... you didn't answer my question from before: are these animals > all running on a common platform (and if so, what is that), or are they > really different hardware? If they're all VMs on one machine then it > might be that there's some common-mode effect from the underlying system. Sorry, I was almost there. :) These systems are provisioned with OpenStack. Additionally, a couple more (cardinalfish, devario) are using docker under that. > (Another thing I notice, now, is that these are all Linux variants; > I'd been thinking you had some BSDen in there too, but now I see > that none of those are ppc64. Hm.) Right, the BSDen I have are on different hardware. Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote: > > On 5/10/19 3:35 PM, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > >> On 2019-05-10 11:38:57 -0400, Tom Lane wrote: > >>> I am wondering if, somehow, the stack depth limit seen by the postmaster > >>> sometimes doesn't apply to its children. That would be pretty wacko > >>> kernel behavior, especially if it's only intermittently true. > >>> But we're running out of other explanations. > >> I wonder if this is a SIGSEGV that actually signals an OOM > >> situation. Linux, if it can't actually extend the stack on-demand due to > >> OOM, sends a SIGSEGV. The signal has that information, but > >> unfortunately the buildfarm code doesn't print it. p $_siginfo would > >> show us some of that... > >> Mark, how tight is the memory on that machine? Does dmesg have any other > >> information (often segfaults are logged by the kernel with the code > >> IIRC). > > It does sort of smell like a resource exhaustion problem, especially > > if all these buildfarm animals are VMs running on the same underlying > > platform. But why would that manifest as "you can't have a measly two > > megabytes of stack" and not as any other sort of OOM symptom? > > > > Mark, if you don't mind modding your local copies of the buildfarm > > script, I think what Andres is asking for is a pretty trivial addition > > in PGBuild/Utils.pm's sub get_stack_trace: > > > > my $cmdfile = "./gdbcmd"; > > my $handle; > > open($handle, '>', $cmdfile) || die "opening $cmdfile: $!"; > > print $handle "bt\n"; > > + print $handle "p $_siginfo\n"; > > close($handle); > > > > > > > I think we'll need to write that as: > > > print $handle 'p $_siginfo',"\n"; Ok, I have this added to everyone now. I think I also have caught up on this thread, but let me know if I missed anything. Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
Hi, On 2019-05-14 08:31:37 -0700, Mark Wong wrote: > Ok, I have this added to everyone now. > > I think I also have caught up on this thread, but let me know if I > missed anything. I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23 failed recently, but unfortunately does not appear to have gdb installed? Or the buildfarm version is too old? Or ulimits are set strictly on a system wide basis? Greetings, Andres Freund
On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote: > Hi, > > On 2019-05-14 08:31:37 -0700, Mark Wong wrote: > > Ok, I have this added to everyone now. > > > > I think I also have caught up on this thread, but let me know if I > > missed anything. > > I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23 > failed recently, but unfortunately does not appear to have gdb > installed? Or the buildfarm version is too old? Or ulimits are set > strictly on a system wide basis? It looks like I did have gdb on there: opensuse@postgresql-opensuse-p9:~> gdb --version GNU gdb (GDB; openSUSE Leap 15.0) 8.1 I'm on v9 of the build-farm here (have it on my todo list to get everything up to 10.) I hope nothing is overriding my core size ulimit: opensuse@postgresql-opensuse-p9:~> ulimit -c unlimited This animal is using clang. I wonder if gdb is disagreeing with the clang binaries? Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On Mon, May 20, 2019 at 12:15:49PM -0700, Mark Wong wrote: > On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote: > > Hi, > > > > On 2019-05-14 08:31:37 -0700, Mark Wong wrote: > > > Ok, I have this added to everyone now. > > > > > > I think I also have caught up on this thread, but let me know if I > > > missed anything. > > > > I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23 > > failed recently, but unfortunately does not appear to have gdb > > installed? Or the buildfarm version is too old? Or ulimits are set > > strictly on a system wide basis? > > I'm on v9 of the build-farm here (have it on my todo list to get > everything up to 10.) Andrew let me know I need to get on v10. I've upgraded demoiselle, and am trying to work through the rest now... Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On Tue, May 21, 2019 at 9:22 AM Mark Wong <mark@2ndquadrant.com> wrote: > Andrew let me know I need to get on v10. I've upgraded demoiselle, and > am trying to work through the rest now... Here's another crash like that. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish&dt=2019-07-13%2003%3A49%3A38 2019-07-13 04:01:23.437 UTC [9365:70] LOG: server process (PID 12951) was terminated by signal 11: Segmentation fault 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL: Failed process was running: select infinite_recurse(); -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > Here's another crash like that. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish&dt=2019-07-13%2003%3A49%3A38 > 2019-07-13 04:01:23.437 UTC [9365:70] LOG: server process (PID 12951) > was terminated by signal 11: Segmentation fault > 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL: Failed process was > running: select infinite_recurse(); It occurred to me to scrape the buildfarm database for these failures, and what I got was sysname | branch | snapshot | stage | data | architecture --------------+---------------+---------------------+-----------------+---------------------------------------------------------------------------------------------------------------+------------------ demoiselle | HEAD | 2019-04-27 14:55:52 | pg_upgradeCheck | 2019-04-27 15:00:42.736 UTC [1457:66] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) buri | HEAD | 2019-04-27 23:54:46 | Check | 2019-04-28 00:01:49.794 UTC [3041:66] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) takin | HEAD | 2019-05-01 08:16:48 | pg_upgradeCheck | 2019-05-01 08:23:27.159 UTC [32303:59] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le bonito | HEAD | 2019-05-01 23:05:36 | Check | 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) shoveler | HEAD | 2019-05-10 14:04:34 | Check | 2019-05-10 14:11:26.833 UTC [13456:73] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER8) demoiselle | HEAD | 2019-05-19 14:22:23 | pg_upgradeCheck | 2019-05-19 14:26:17.002 UTC [23275:80] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) vulpes | HEAD | 2019-06-15 09:16:45 | pg_upgradeCheck | 2019-06-15 09:22:22.268 UTC [4885:77] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le ayu | HEAD | 2019-06-19 22:13:23 | pg_upgradeCheck | 2019-06-19 22:18:16.805 UTC [2708:71] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER8) quokka | HEAD | 2019-07-10 14:20:13 | pg_upgradeCheck | 2019-07-10 15:24:06.102 BST [5d25f4fb.2644:5] DETAIL: Failed process was running: select infinite_recurse(); | ppc64 cavefish | HEAD | 2019-07-13 03:49:38 | pg_upgradeCheck | 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) pintail | REL_12_STABLE | 2019-07-13 19:36:51 | Check | 2019-07-13 19:39:29.013 UTC [31086:5] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) bonito | HEAD | 2019-07-19 23:13:01 | Check | 2019-07-19 23:16:33.330 UTC [24191:70] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) takin | HEAD | 2019-07-24 08:24:56 | Check | 2019-07-24 08:28:01.735 UTC [16366:75] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le quokka | HEAD | 2019-07-31 02:00:07 | pg_upgradeCheck | 2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL: Failed process was running: select infinite_recurse(); | ppc64 elasmobranch | HEAD | 2019-08-01 03:13:38 | Check | 2019-08-01 03:19:05.394 UTC [22888:62] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) buri | HEAD | 2019-08-02 00:10:23 | Check | 2019-08-02 00:17:11.075 UTC [28222:73] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) urocryon | HEAD | 2019-08-02 05:43:46 | Check | 2019-08-02 05:51:51.944 UTC [2724:64] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le batfish | HEAD | 2019-08-04 19:02:36 | pg_upgradeCheck | 2019-08-04 19:08:11.728 UTC [23899:79] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le buri | REL_12_STABLE | 2019-08-07 00:03:29 | pg_upgradeCheck | 2019-08-07 00:11:24.500 UTC [1405:5] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) quokka | REL_12_STABLE | 2019-08-08 02:43:45 | pg_upgradeCheck | 2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL: Failed process was running: select infinite_recurse(); | ppc64 quokka | HEAD | 2019-08-08 14:00:08 | Check | 2019-08-08 15:02:59.770 BST [5d4c2b88.cad9:5] DETAIL: Failed process was running: select infinite_recurse(); | ppc64 mereswine | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL: Failedprocess was running: select infinite_recurse(); | ARMv7 takin | HEAD | 2019-08-11 08:02:48 | Check | 2019-08-11 08:05:57.789 UTC [11500:67] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le mereswine | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL: Failedprocess was running: select infinite_recurse(); | ARMv7 mereswine | HEAD | 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL: Failedprocess was running: select infinite_recurse(); | ARMv7 demoiselle | HEAD | 2019-08-11 14:51:38 | pg_upgradeCheck | 2019-08-11 14:57:29.422 UTC [9436:70] DETAIL: Failedprocess was running: select infinite_recurse(); | ppc64le (POWER9) (26 rows) This is from a scan going back 9 months (to mid-December), so the lack of any matches before late April is pretty notable: it seems highly probable that some change we made during April is related. A cursory scan of commits during April finds only one that seems conceivably related (though perhaps I just lack enough imagination): Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL_12_BR [798070ec0] 2019-04-11 18:16:50 -0400 Re-order some regression test scripts for more parallelism. Move the strings, numerology, insert, insert_conflict, select and errors tests to be parts of nearby parallel groups, instead of executing by themselves. So that leads to the thought that "the infinite_recurse test is fine if it runs by itself, but it tends to fall over if there are concurrently-running backends". I have absolutely no idea how that would happen on anything that passes for a platform built in this century. Still, it's a place to start, which we hadn't before. Also notable is that we now have a couple of hits on ARM, not only ppc64. Don't know what to make of that. regards, tom lane
On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So that leads to the thought that "the infinite_recurse test is fine > if it runs by itself, but it tends to fall over if there are > concurrently-running backends". I have absolutely no idea how that > would happen on anything that passes for a platform built in this > century. Still, it's a place to start, which we hadn't before. Hmm. mereswin's recent failure on REL_11_STABLE was running the serial schedule. I read about 3 ways to get SEGV from stack-related faults: you can exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV (per man pages), you can access a page that is inside the mapping but is beyond the stack pointer (with some tolerance, exact details vary by arch), and you can fail to allocate a page due to low memory. The first kind of failure doesn't seem right -- we carefully set max_stack_size based on RLIMIT_STACK minus some slop, so that theory would require child processes to have different stack limits than the postmaster as you said (perhaps OpenStack, Docker, related tooling or concurrent activity on the host system is capable of changing it?), or a bug in our slop logic. The second kind of failure would imply that we have a bug -- we're accessing something below the stack pointer -- but that doesn't seem right either -- I think various address sanitising tools would have told us about that, and it's hard to believe there is a bug in the powerpc and arm implementation of the stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c). That leaves the third explanation, except then I'd expect to see other kinds of problems like OOM etc before you get to that stage, and why just here? Confused. > Also notable is that we now have a couple of hits on ARM, not > only ppc64. Don't know what to make of that. Yeah, that is indeed interesting. -- Thomas Munro https://enterprisedb.com
Hello hackers, 15.08.2019 10:17, Thomas Munro wrote: > On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So that leads to the thought that "the infinite_recurse test is fine >> if it runs by itself, but it tends to fall over if there are >> concurrently-running backends". I have absolutely no idea how that >> would happen on anything that passes for a platform built in this >> century. Still, it's a place to start, which we hadn't before. > Hmm. mereswin's recent failure on REL_11_STABLE was running the > serial schedule. > > I read about 3 ways to get SEGV from stack-related faults: you can > exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV > (per man pages), you can access a page that is inside the mapping but > is beyond the stack pointer (with some tolerance, exact details vary > by arch), and you can fail to allocate a page due to low memory. > > The first kind of failure doesn't seem right -- we carefully set > max_stack_size based on RLIMIT_STACK minus some slop, so that theory > would require child processes to have different stack limits than the > postmaster as you said (perhaps OpenStack, Docker, related tooling or > concurrent activity on the host system is capable of changing it?), or > a bug in our slop logic. The second kind of failure would imply that > we have a bug -- we're accessing something below the stack pointer -- > but that doesn't seem right either -- I think various address > sanitising tools would have told us about that, and it's hard to > believe there is a bug in the powerpc and arm implementation of the > stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c). That > leaves the third explanation, except then I'd expect to see other > kinds of problems like OOM etc before you get to that stage, and why > just here? Confused. > >> Also notable is that we now have a couple of hits on ARM, not >> only ppc64. Don't know what to make of that. > Yeah, that is indeed interesting. Excuse me for reviving this ancient thread, but aforementioned mereswine animal has failed again recently [1]: 002_pg_upgrade_old_node.log contains: 2024-06-26 02:49:06.742 PDT [29121:4] LOG: server process (PID 30908) was terminated by signal 9: Killed 2024-06-26 02:49:06.742 PDT [29121:5] DETAIL: Failed process was running: select infinite_recurse(); I believe this time it's caused by OOM condition and I think this issue occurs on armv7 mereswine because 1) armv7 uses the stack very efficiently (thanks to 32-bitness and maybe also the Link Register) and 2) such old machines are usually tight on memory. I've analyzed buildfarm logs and found from the check stage of that failed run: wget [2] -O log grep 'SQL function "infinite_recurse" statement 1' log | wc -l 5818 (that is, the nesting depth is 5818 levels for a successful run of the test) For comparison, mereswine on HEAD [3], [4] shows 5691 levels; alimoche (aarch64) on HEAD [5] — 3535; lapwing (i686) on HEAD [6] — 5034; alligator (x86_64) on HEAD [7] — 3965; So it seems to me that unlike [9] this failure may be explained by reaching OOM condition. I have an armv7 device with 2GB RAM that doesn't pass `make check` nor even `TESTS=infinite_recurse make -s check-tests` from time to time due to: 2024-06-28 12:40:49.947 UTC postmaster[20019] LOG: server process (PID 20078) was terminated by signal 11: Segmentation fault 2024-06-28 12:40:49.947 UTC postmaster[20019] DETAIL: Failed process was running: select infinite_recurse(); ... Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". Core was generated by `postgres: android regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 downcase_identifier (ident=0xa006d837 "infinite_recurse", len=16, warn=true, truncate=truncate@entry=true) at scansup.c:52 52 result = palloc(len + 1); (gdb) p $sp $1 = (void *) 0xbe9b0020 It looks more like [9], but I think the OOM effect is OS/kernel dependent. Though the test passes reliably with lower max_stack_depth values, so I've analyzed how much memory the backend consumes (total size and the size of it's largest segment) depending on the value: 1500kB adfe1000 220260K rw--- [ anon ] total 419452K --- 1600kB ac7e5000 234748K rw--- [ anon ] total 434040 --- 1700kB acf61000 249488K rw--- [ anon ] total 448880K --- default value (2048kB) aac65000 300528K rw--- [ anon ] total 501424K (roughly, increasing max_stack_depth by 100kB increases the backend memory consumption by 15MB during the test) So I think reducing max_stack_depth for mereswine, say to 1000kB, should prevent such failures in the future. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45&stg=check&raw=1 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07 [4] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07&stg=check&raw=1 [5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alimoche&dt=2024-06-27%2021%3A55%3A06 [6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2024-06-28%2004%3A12%3A16 [7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-06-28%2005%3A23%3A19 [8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ayu&dt=2024-03-29%2013%3A08%3A06 [9] https://www.postgresql.org/message-id/95461160-1214-4ac4-d65b-086182797b1d%40gmail.com Best regards, Alexander