Re: Why is infinite_recurse test suddenly failing? - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: Why is infinite_recurse test suddenly failing?
Date
Msg-id 3dd3b9bd-62c8-1846-d9e1-a6ff18740aff@gmail.com
Whole thread Raw
In response to Re: Why is infinite_recurse test suddenly failing?  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hello hackers,

15.08.2019 10:17, Thomas Munro wrote:
> On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So that leads to the thought that "the infinite_recurse test is fine
>> if it runs by itself, but it tends to fall over if there are
>> concurrently-running backends".  I have absolutely no idea how that
>> would happen on anything that passes for a platform built in this
>> century.  Still, it's a place to start, which we hadn't before.
> Hmm.  mereswin's recent failure on REL_11_STABLE was running the
> serial schedule.
>
> I read about 3 ways to get SEGV from stack-related faults: you can
> exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV
> (per man pages), you can access a page that is inside the mapping but
> is beyond the stack pointer (with some tolerance, exact details vary
> by arch), and you can fail to allocate a page due to low memory.
>
> The first kind of failure doesn't seem right -- we carefully set
> max_stack_size based on RLIMIT_STACK minus some slop, so that theory
> would require child processes to have different stack limits than the
> postmaster as you said (perhaps OpenStack, Docker, related tooling or
> concurrent activity on the host system is capable of changing it?), or
> a bug in our slop logic.  The second kind of failure would imply that
> we have a bug -- we're accessing something below the stack pointer --
> but that doesn't seem right either -- I think various address
> sanitising tools would have told us about that, and it's hard to
> believe there is a bug in the powerpc and arm implementation of the
> stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c).  That
> leaves the third explanation, except then I'd expect to see other
> kinds of problems like OOM etc before you get to that stage, and why
> just here?  Confused.
>
>> Also notable is that we now have a couple of hits on ARM, not
>> only ppc64.  Don't know what to make of that.
> Yeah, that is indeed interesting.

Excuse me for reviving this ancient thread, but aforementioned mereswine
animal has failed again recently [1]:
002_pg_upgrade_old_node.log contains:
2024-06-26 02:49:06.742 PDT [29121:4] LOG:  server process (PID 30908) was terminated by signal 9: Killed
2024-06-26 02:49:06.742 PDT [29121:5] DETAIL:  Failed process was running: select infinite_recurse();

I believe this time it's caused by OOM condition and I think this issue
occurs on armv7 mereswine because 1) armv7 uses the stack very
efficiently (thanks to 32-bitness and maybe also the Link Register) and
2) such old machines are usually tight on memory.

I've analyzed buildfarm logs and found from the check stage of that failed run:
wget [2] -O log
grep 'SQL function "infinite_recurse" statement 1' log | wc -l
5818
(that is, the nesting depth is 5818 levels for a successful run of the test)

For comparison, mereswine on HEAD [3], [4] shows 5691 levels;
alimoche (aarch64) on HEAD [5] — 3535;
lapwing (i686) on HEAD [6] — 5034;
alligator (x86_64) on HEAD [7] — 3965;

So it seems to me that unlike [9] this failure may be explained by reaching
OOM condition.

I have an armv7 device with 2GB RAM that doesn't pass `make check` nor
even `TESTS=infinite_recurse make -s check-tests` from time to time due to:
2024-06-28 12:40:49.947 UTC postmaster[20019] LOG:  server process (PID 20078) was terminated by signal 11:
Segmentation
 
fault
2024-06-28 12:40:49.947 UTC postmaster[20019] DETAIL:  Failed process was running: select infinite_recurse();
...
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `postgres: android regression [local] SELECT                                   '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  downcase_identifier (ident=0xa006d837 "infinite_recurse", len=16, warn=true, truncate=truncate@entry=true)
     at scansup.c:52
52              result = palloc(len + 1);
(gdb) p $sp
$1 = (void *) 0xbe9b0020

It looks more like [9], but I think the OOM effect is OS/kernel dependent.

Though the test passes reliably with lower max_stack_depth values, so I've
analyzed how much memory the backend consumes (total size and the size of
it's largest segment) depending on the value:
1500kB
adfe1000 220260K rw---   [ anon ]
  total   419452K
---
1600kB
ac7e5000 234748K rw---   [ anon ]
  total   434040
---
1700kB
acf61000 249488K rw---   [ anon ]
  total   448880K
---
default value (2048kB)
aac65000 300528K rw---   [ anon ]
  total   501424K

(roughly, increasing max_stack_depth by 100kB increases the backend memory
consumption by 15MB during the test)

So I think reducing max_stack_depth for mereswine, say to 1000kB, should
prevent such failures in the future.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45
[2]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45&stg=check&raw=1
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07
[4]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07&stg=check&raw=1
[5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alimoche&dt=2024-06-27%2021%3A55%3A06
[6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2024-06-28%2004%3A12%3A16
[7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-06-28%2005%3A23%3A19
[8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ayu&dt=2024-03-29%2013%3A08%3A06
[9] https://www.postgresql.org/message-id/95461160-1214-4ac4-d65b-086182797b1d%40gmail.com

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: remaining sql/json patches
Next
From: Andrei Lepikhov
Date:
Subject: Re: Check lateral references within PHVs for memoize cache keys