Thread: IA64 versus effective stack limit

IA64 versus effective stack limit

From

Tom Lane

Date:

06 November 2010, 14:34:55

Sergey was kind enough to lend me use of buildfarm member dugong
(IA64, Debian Etch) so I could poke into why its behavior in the
recursion-related regression tests was so odd.  I had previously
tried and failed to reproduce the behavior on a Red Hat IA64 test
machine (running RHEL of course) so I was feeling a bit baffled.
Here's what I found out:

1. Debian Etch has the make-resets-the-stack-rlimit bug that I reported
about yesterday, whereas the RHEL version I was testing had the fix
for that.  So that's why I couldn't reproduce max_stack_depth getting
set to 100kB.

2. IA64 is a very weird architecture: it has two separate hardware
stacks.  One is reserved for saving registers, which IA64 has got
a lot of, and the other "normal" stack holds everything else.
The method we use in check_stack_depth (ie, measure the difference
in addresses of local variables) effectively measures the depth of
the normal stack.  I don't know of any simple way to find out the
depth of the register stack.  You can get gdb to tell you about
both stacks, though.  I found out that with PG HEAD, the recursion
distance for the "infinite_recurse()" regression test is 160 bytes
of normal stack and 928 bytes of register stack per fmgr_sql call
level.  This is with gcc (I got identical numbers on dugong and the
RHEL machine).  But, if you build PG with icc as the buildfarm
critter is doing, that bloats to 3232 bytes of normal stack and
2832 bytes of register stack.  For comparison, my x86_64 Fedora 13 box
uses 704 bytes of stack per recursion level.

I don't know why icc is so much worse than gcc on this measure of
stack depth consumption, but clearly the combination of that and
the 100kB max_stack_depth explains why dugong is failing to do
very many levels of recursion before erroring out.  Fixing
get_stack_depth_rlimit as I proposed yesterday should give it
a reasonable stack depth.

However, we're not out of the woods yet.  Because check_stack_depth is
only checking the normal stack depth, and the two stacks don't grow at
the same rate, it's possible for a crash to occur due to running out of
register stack space.  We haven't seen that happen on dugong because,
as shown above, with icc the register stack grows more slowly than the
normal stack (at least for the specific functions we care about here).
But with gcc, the same code eats register stack a lot faster than normal
stack --- and in fact I observed a crash in the infinite_recurse() test
when building with gcc and testing in a manually-started postmaster.
The manually-started postmaster was under ulimit -s 8MB, which
apparently Debian interprets as "8MB for normal stack and another 8MB
for register stack".  Even though check_stack_depth was trying to
constrain the normal stack to just 2MB, the register stack grew 5.8
times faster and so blew through 8MB before check_stack_depth thought
there was a problem.  Raising ulimit -s allowed it to work.

(Curiously, I did *not* see the same type of crash on the RHEL machine.
I surmise that Red Hat has tweaked the kernel to allow the register
stack to grow more than the normal stack, but I haven't tried to verify
that.)

So this means we have a problem.  To some extent it's new in HEAD:
before the changes I made last week to not keep a local
FunctionCallInfoData in ExecMakeFunctionResult, there would have been at
least another 900 bytes of normal stack per recursion level, so even
with gcc the register stack would grow slower than normal stack in this
test, and you wouldn't have seen any crash in the regression tests.
But I'm sure there are lots of other potentially recursive routines in
PG where register stack could grow faster than normal stack, so we
shouldn't suppose that this fmgr_sql recursion is the only trouble spot.

As I said above, I don't know of any good way to measure register stack
depth directly.  It's probably possible to find out by asking the kernel
or something like that, but we surely do not want to introduce a kernel
call into check_stack_depth().  So a good solution for this is hard to
see.  The best idea I have at the moment is to reduce the reported stack
limit by some arbitrary factor, ie do something like
#ifdef __IA64__    val /= 8;#endif

in get_stack_depth_rlimit().  Anyone have a better idea?

BTW, this also suggests to me that it'd be a real good idea to have
a buildfarm critter for IA64+gcc --- the differences between gcc and
icc are clearly pretty significant on this hardware.
        regards, tom lane

Re: IA64 versus effective stack limit

From

Greg Stark

Date:

06 November 2010, 17:26:16

On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As I said above, I don't know of any good way to measure register stack
> depth directly.  It's probably possible to find out by asking the kernel
> or something like that, but we surely do not want to introduce a kernel
> call into check_stack_depth().

It seems more likely it would be some kind of asm than a trap. This
might be wishful thinking but is it too much to hope that glibc
already exposes it through some function?

It looks like the relevant registers are ar.bsp and ar.bspstore. Just
taking the difference apparently gives you the amount of memory used
in the current backing store.

However some of the comments I'm reading seem to imply that the OS can
allocate discontiguous backing store partitions, presumably if the
backing store pointer reaches an unmapped address there has to be some
way to trap to the OS to allocate more and maybe then it has a chance
to tweak the bsp address?

This was quite interesting (especially the "The Register Stack Engine"
section of the second one):

http://msdn.microsoft.com/en-us/magazine/cc301708.aspx
http://msdn.microsoft.com/en-us/magazine/cc301711.aspx

Also I found the following:

(lists some registers)
http://www.cs.clemson.edu/~mark/subroutines/itanium.html

(helper functions in glibc asm includes that calculate bspstore-bsp to
count the number of registers used)
http://www.koders.com/c/fidE15CABBBA63E7C24928D7F7C9A95653D101451D2.aspx?s=queue

Also I found http://www.nongnu.org/libunwind/man/libunwind(3).html
which I found cool though not really relevant. The ia64 implementation
fiddles with the RSE registers as well of course.

--
greg

Re: IA64 versus effective stack limit

From

Tom Lane

Date:

06 November 2010, 18:16:45

Greg Stark <gsstark@mit.edu> writes:
> On Sat, Nov 6, 2010 at 5:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> As I said above, I don't know of any good way to measure register stack
>> depth directly. �It's probably possible to find out by asking the kernel
>> or something like that, but we surely do not want to introduce a kernel
>> call into check_stack_depth().

> It seems more likely it would be some kind of asm than a trap. This
> might be wishful thinking but is it too much to hope that glibc
> already exposes it through some function?

Yeah, I suppose some asm might be a possible solution, but I was a bit
discouraged after reading some Intel documentation that said that the
register-stack top wasn't exposed in the architectural model.  You
apparently can only find out what's been spilled to memory.  (But
perhaps that's close enough, for the purposes here?)
        regards, tom lane

Re: IA64 versus effective stack limit

From

Tom Lane

Date:

06 November 2010, 19:17:29

Greg Stark <gsstark@mit.edu> writes:
> It seems more likely it would be some kind of asm than a trap.

I seem to be getting plausible results from this bit of crockery:


#include <asm/ia64regs.h>

static __inline__ void *
get_bsp(void)
{ void *ret;
#ifndef __INTEL_COMPILER __asm__ __volatile__(                      ";;\n    mov %0=ar.bsp\n"
:"=r"(ret));
#else ret = (void *) __getReg(_IA64_REG_AR_BSP);
#endif return ret;
}


I'll clean this up and commit, assuming it actually fixes the problem.
        regards, tom lane

Re: IA64 versus effective stack limit

From

Tom Lane

Date:

06 November 2010, 21:01:14

I wrote:
> I don't know why icc is so much worse than gcc on this measure of
> stack depth consumption, but clearly the combination of that and
> the 100kB max_stack_depth explains why dugong is failing to do
> very many levels of recursion before erroring out.

I figured out why icc looked so much worse here: I had accidentally
built with optimization disabled.  Selecting -O2 causes its numbers
to come a lot closer to gcc's.  In particular, it flips around from
using more normal stack than register stack to using more register
stack than normal.  (This might be the case for gcc as well; I did
not test an unoptimized gcc build.)

This means that, at least for icc, *an optimized build is unsafe*
without code to check for register stack growth.  It turns out that
buildfarm member dugong has been building without optimization all
along, which is why we'd not noticed the issue.

I think it'd be a good idea for dugong to turn on optimization
so it's testing something closer to a production build.  However,
at this moment only HEAD is likely to pass regression tests with
that turned on.  We'd have to back-patch the just-committed code
for checking register stack growth before the back branches would
survive that.

I'm normally hesitant to back-patch code that might create portability
issues, but in this case perhaps it's a good idea.  Comments?
        regards, tom lane

Re: IA64 versus effective stack limit

From

Robert Haas

Date:

06 November 2010, 21:50:18

On Sat, Nov 6, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> I don't know why icc is so much worse than gcc on this measure of
>> stack depth consumption, but clearly the combination of that and
>> the 100kB max_stack_depth explains why dugong is failing to do
>> very many levels of recursion before erroring out.
>
> I figured out why icc looked so much worse here: I had accidentally
> built with optimization disabled.  Selecting -O2 causes its numbers
> to come a lot closer to gcc's.  In particular, it flips around from
> using more normal stack than register stack to using more register
> stack than normal.  (This might be the case for gcc as well; I did
> not test an unoptimized gcc build.)
>
> This means that, at least for icc, *an optimized build is unsafe*
> without code to check for register stack growth.  It turns out that
> buildfarm member dugong has been building without optimization all
> along, which is why we'd not noticed the issue.
>
> I think it'd be a good idea for dugong to turn on optimization
> so it's testing something closer to a production build.  However,
> at this moment only HEAD is likely to pass regression tests with
> that turned on.  We'd have to back-patch the just-committed code
> for checking register stack growth before the back branches would
> survive that.
>
> I'm normally hesitant to back-patch code that might create portability
> issues, but in this case perhaps it's a good idea.  Comments?

Yeah, I think it might be a good idea.  Crashing is bad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company