Thread: IA64 versus effective stack limit
Sergey was kind enough to lend me use of buildfarm member dugong (IA64, Debian Etch) so I could poke into why its behavior in the recursion-related regression tests was so odd. I had previously tried and failed to reproduce the behavior on a Red Hat IA64 test machine (running RHEL of course) so I was feeling a bit baffled. Here's what I found out: 1. Debian Etch has the make-resets-the-stack-rlimit bug that I reported about yesterday, whereas the RHEL version I was testing had the fix for that. So that's why I couldn't reproduce max_stack_depth getting set to 100kB. 2. IA64 is a very weird architecture: it has two separate hardware stacks. One is reserved for saving registers, which IA64 has got a lot of, and the other "normal" stack holds everything else. The method we use in check_stack_depth (ie, measure the difference in addresses of local variables) effectively measures the depth of the normal stack. I don't know of any simple way to find out the depth of the register stack. You can get gdb to tell you about both stacks, though. I found out that with PG HEAD, the recursion distance for the "infinite_recurse()" regression test is 160 bytes of normal stack and 928 bytes of register stack per fmgr_sql call level. This is with gcc (I got identical numbers on dugong and the RHEL machine). But, if you build PG with icc as the buildfarm critter is doing, that bloats to 3232 bytes of normal stack and 2832 bytes of register stack. For comparison, my x86_64 Fedora 13 box uses 704 bytes of stack per recursion level. I don't know why icc is so much worse than gcc on this measure of stack depth consumption, but clearly the combination of that and the 100kB max_stack_depth explains why dugong is failing to do very many levels of recursion before erroring out. Fixing get_stack_depth_rlimit as I proposed yesterday should give it a reasonable stack depth. However, we're not out of the woods yet. Because check_stack_depth is only checking the normal stack depth, and the two stacks don't grow at the same rate, it's possible for a crash to occur due to running out of register stack space. We haven't seen that happen on dugong because, as shown above, with icc the register stack grows more slowly than the normal stack (at least for the specific functions we care about here). But with gcc, the same code eats register stack a lot faster than normal stack --- and in fact I observed a crash in the infinite_recurse() test when building with gcc and testing in a manually-started postmaster. The manually-started postmaster was under ulimit -s 8MB, which apparently Debian interprets as "8MB for normal stack and another 8MB for register stack". Even though check_stack_depth was trying to constrain the normal stack to just 2MB, the register stack grew 5.8 times faster and so blew through 8MB before check_stack_depth thought there was a problem. Raising ulimit -s allowed it to work. (Curiously, I did *not* see the same type of crash on the RHEL machine. I surmise that Red Hat has tweaked the kernel to allow the register stack to grow more than the normal stack, but I haven't tried to verify that.) So this means we have a problem. To some extent it's new in HEAD: before the changes I made last week to not keep a local FunctionCallInfoData in ExecMakeFunctionResult, there would have been at least another 900 bytes of normal stack per recursion level, so even with gcc the register stack would grow slower than normal stack in this test, and you wouldn't have seen any crash in the regression tests. But I'm sure there are lots of other potentially recursive routines in PG where register stack could grow faster than normal stack, so we shouldn't suppose that this fmgr_sql recursion is the only trouble spot. As I said above, I don't know of any good way to measure register stack depth directly. It's probably possible to find out by asking the kernel or something like that, but we surely do not want to introduce a kernel call into check_stack_depth(). So a good solution for this is hard to see. The best idea I have at the moment is to reduce the reported stack limit by some arbitrary factor, ie do something like #ifdef __IA64__ val /= 8;#endif in get_stack_depth_rlimit(). Anyone have a better idea? BTW, this also suggests to me that it'd be a real good idea to have a buildfarm critter for IA64+gcc --- the differences between gcc and icc are clearly pretty significant on this hardware. regards, tom lane
On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > As I said above, I don't know of any good way to measure register stack > depth directly. It's probably possible to find out by asking the kernel > or something like that, but we surely do not want to introduce a kernel > call into check_stack_depth(). It seems more likely it would be some kind of asm than a trap. This might be wishful thinking but is it too much to hope that glibc already exposes it through some function? It looks like the relevant registers are ar.bsp and ar.bspstore. Just taking the difference apparently gives you the amount of memory used in the current backing store. However some of the comments I'm reading seem to imply that the OS can allocate discontiguous backing store partitions, presumably if the backing store pointer reaches an unmapped address there has to be some way to trap to the OS to allocate more and maybe then it has a chance to tweak the bsp address? This was quite interesting (especially the "The Register Stack Engine" section of the second one): http://msdn.microsoft.com/en-us/magazine/cc301708.aspx http://msdn.microsoft.com/en-us/magazine/cc301711.aspx Also I found the following: (lists some registers) http://www.cs.clemson.edu/~mark/subroutines/itanium.html (helper functions in glibc asm includes that calculate bspstore-bsp to count the number of registers used) http://www.koders.com/c/fidE15CABBBA63E7C24928D7F7C9A95653D101451D2.aspx?s=queue Also I found http://www.nongnu.org/libunwind/man/libunwind(3).html which I found cool though not really relevant. The ia64 implementation fiddles with the RSE registers as well of course. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> As I said above, I don't know of any good way to measure register stack >> depth directly. �It's probably possible to find out by asking the kernel >> or something like that, but we surely do not want to introduce a kernel >> call into check_stack_depth(). > It seems more likely it would be some kind of asm than a trap. This > might be wishful thinking but is it too much to hope that glibc > already exposes it through some function? Yeah, I suppose some asm might be a possible solution, but I was a bit discouraged after reading some Intel documentation that said that the register-stack top wasn't exposed in the architectural model. You apparently can only find out what's been spilled to memory. (But perhaps that's close enough, for the purposes here?) regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > It seems more likely it would be some kind of asm than a trap. I seem to be getting plausible results from this bit of crockery: #include <asm/ia64regs.h> static __inline__ void * get_bsp(void) { void *ret; #ifndef __INTEL_COMPILER __asm__ __volatile__( ";;\n mov %0=ar.bsp\n" :"=r"(ret)); #else ret = (void *) __getReg(_IA64_REG_AR_BSP); #endif return ret; } I'll clean this up and commit, assuming it actually fixes the problem. regards, tom lane
I wrote: > I don't know why icc is so much worse than gcc on this measure of > stack depth consumption, but clearly the combination of that and > the 100kB max_stack_depth explains why dugong is failing to do > very many levels of recursion before erroring out. I figured out why icc looked so much worse here: I had accidentally built with optimization disabled. Selecting -O2 causes its numbers to come a lot closer to gcc's. In particular, it flips around from using more normal stack than register stack to using more register stack than normal. (This might be the case for gcc as well; I did not test an unoptimized gcc build.) This means that, at least for icc, *an optimized build is unsafe* without code to check for register stack growth. It turns out that buildfarm member dugong has been building without optimization all along, which is why we'd not noticed the issue. I think it'd be a good idea for dugong to turn on optimization so it's testing something closer to a production build. However, at this moment only HEAD is likely to pass regression tests with that turned on. We'd have to back-patch the just-committed code for checking register stack growth before the back branches would survive that. I'm normally hesitant to back-patch code that might create portability issues, but in this case perhaps it's a good idea. Comments? regards, tom lane
On Sat, Nov 6, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I don't know why icc is so much worse than gcc on this measure of >> stack depth consumption, but clearly the combination of that and >> the 100kB max_stack_depth explains why dugong is failing to do >> very many levels of recursion before erroring out. > > I figured out why icc looked so much worse here: I had accidentally > built with optimization disabled. Selecting -O2 causes its numbers > to come a lot closer to gcc's. In particular, it flips around from > using more normal stack than register stack to using more register > stack than normal. (This might be the case for gcc as well; I did > not test an unoptimized gcc build.) > > This means that, at least for icc, *an optimized build is unsafe* > without code to check for register stack growth. It turns out that > buildfarm member dugong has been building without optimization all > along, which is why we'd not noticed the issue. > > I think it'd be a good idea for dugong to turn on optimization > so it's testing something closer to a production build. However, > at this moment only HEAD is likely to pass regression tests with > that turned on. We'd have to back-patch the just-committed code > for checking register stack growth before the back branches would > survive that. > > I'm normally hesitant to back-patch code that might create portability > issues, but in this case perhaps it's a good idea. Comments? Yeah, I think it might be a good idea. Crashing is bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company