Re: Why is infinite_recurse test suddenly failing? - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: Why is infinite_recurse test suddenly failing? |
Date | |
Msg-id | 3dd3b9bd-62c8-1846-d9e1-a6ff18740aff@gmail.com Whole thread Raw |
In response to | Re: Why is infinite_recurse test suddenly failing? (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hello hackers, 15.08.2019 10:17, Thomas Munro wrote: > On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So that leads to the thought that "the infinite_recurse test is fine >> if it runs by itself, but it tends to fall over if there are >> concurrently-running backends". I have absolutely no idea how that >> would happen on anything that passes for a platform built in this >> century. Still, it's a place to start, which we hadn't before. > Hmm. mereswin's recent failure on REL_11_STABLE was running the > serial schedule. > > I read about 3 ways to get SEGV from stack-related faults: you can > exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV > (per man pages), you can access a page that is inside the mapping but > is beyond the stack pointer (with some tolerance, exact details vary > by arch), and you can fail to allocate a page due to low memory. > > The first kind of failure doesn't seem right -- we carefully set > max_stack_size based on RLIMIT_STACK minus some slop, so that theory > would require child processes to have different stack limits than the > postmaster as you said (perhaps OpenStack, Docker, related tooling or > concurrent activity on the host system is capable of changing it?), or > a bug in our slop logic. The second kind of failure would imply that > we have a bug -- we're accessing something below the stack pointer -- > but that doesn't seem right either -- I think various address > sanitising tools would have told us about that, and it's hard to > believe there is a bug in the powerpc and arm implementation of the > stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c). That > leaves the third explanation, except then I'd expect to see other > kinds of problems like OOM etc before you get to that stage, and why > just here? Confused. > >> Also notable is that we now have a couple of hits on ARM, not >> only ppc64. Don't know what to make of that. > Yeah, that is indeed interesting. Excuse me for reviving this ancient thread, but aforementioned mereswine animal has failed again recently [1]: 002_pg_upgrade_old_node.log contains: 2024-06-26 02:49:06.742 PDT [29121:4] LOG: server process (PID 30908) was terminated by signal 9: Killed 2024-06-26 02:49:06.742 PDT [29121:5] DETAIL: Failed process was running: select infinite_recurse(); I believe this time it's caused by OOM condition and I think this issue occurs on armv7 mereswine because 1) armv7 uses the stack very efficiently (thanks to 32-bitness and maybe also the Link Register) and 2) such old machines are usually tight on memory. I've analyzed buildfarm logs and found from the check stage of that failed run: wget [2] -O log grep 'SQL function "infinite_recurse" statement 1' log | wc -l 5818 (that is, the nesting depth is 5818 levels for a successful run of the test) For comparison, mereswine on HEAD [3], [4] shows 5691 levels; alimoche (aarch64) on HEAD [5] — 3535; lapwing (i686) on HEAD [6] — 5034; alligator (x86_64) on HEAD [7] — 3965; So it seems to me that unlike [9] this failure may be explained by reaching OOM condition. I have an armv7 device with 2GB RAM that doesn't pass `make check` nor even `TESTS=infinite_recurse make -s check-tests` from time to time due to: 2024-06-28 12:40:49.947 UTC postmaster[20019] LOG: server process (PID 20078) was terminated by signal 11: Segmentation fault 2024-06-28 12:40:49.947 UTC postmaster[20019] DETAIL: Failed process was running: select infinite_recurse(); ... Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". Core was generated by `postgres: android regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 downcase_identifier (ident=0xa006d837 "infinite_recurse", len=16, warn=true, truncate=truncate@entry=true) at scansup.c:52 52 result = palloc(len + 1); (gdb) p $sp $1 = (void *) 0xbe9b0020 It looks more like [9], but I think the OOM effect is OS/kernel dependent. Though the test passes reliably with lower max_stack_depth values, so I've analyzed how much memory the backend consumes (total size and the size of it's largest segment) depending on the value: 1500kB adfe1000 220260K rw--- [ anon ] total 419452K --- 1600kB ac7e5000 234748K rw--- [ anon ] total 434040 --- 1700kB acf61000 249488K rw--- [ anon ] total 448880K --- default value (2048kB) aac65000 300528K rw--- [ anon ] total 501424K (roughly, increasing max_stack_depth by 100kB increases the backend memory consumption by 15MB during the test) So I think reducing max_stack_depth for mereswine, say to 1000kB, should prevent such failures in the future. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45&stg=check&raw=1 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07 [4] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07&stg=check&raw=1 [5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alimoche&dt=2024-06-27%2021%3A55%3A06 [6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2024-06-28%2004%3A12%3A16 [7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-06-28%2005%3A23%3A19 [8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ayu&dt=2024-03-29%2013%3A08%3A06 [9] https://www.postgresql.org/message-id/95461160-1214-4ac4-d65b-086182797b1d%40gmail.com Best regards, Alexander
pgsql-hackers by date: