Thread: unexpected and reproducable crash in pl/pgsql function

unexpected and reproducable crash in pl/pgsql function

From
"Merlin Moncure"
Date:
Ok, I have a fairly nasty situation.  I am having a production server
that is crashing upon execution of a pl/pgsql function...on code that
has been working flawlessly for weeks.  My production server is running
8.0 on win32 and I was able to 'sort-of' reproduce the behavior on my
development machine here at the office.

What happens:
On the production machine, upon execution of the function (with a very
specific parameter value, all others work ok), all server backends
freeze and completely stop working.  Any attempt to connect to the
server hangs psql in limbo.  In addition, the service fails to shut
down, and the only way to get working again is to kill postmaster.exe
and all instances of postgres.exe.  However after that everything runs
o.k. until I try to run the function again.  There is nothing useful in
the log.

Following this, I did a dump of the production database and loaded it
into my office machine.  Here, I try and execute the function and I get:

esp=#   select generate_oe_bom(18208);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

And server seems to recover from this.  Looking at the event log, I see:
NOTICE:  hello
LOG:  server process (PID 5720) exited with unexpected status 128
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2005-03-03 14:16:15 Eastern
Standard Time
LOG:  checkpoint record is at 6/D7546E28
LOG:  redo record is at 6/D7546E28; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 11208897; next OID: 62532404
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 6/D7546E68
LOG:  unexpected pageaddr 6/CF5BA000 in log file 6, segment 215, offset
6004736
LOG:  redo done at 6/D75B7F90 LOG:  database system is ready

This will repeat if the function is run again.
Only the exact parameter (order# 18208) will cause the crash. Another
order, 18150, runs through ok.  I would expect data corrumption to be
the cause of the problem except I was able to reproduce the problem on a
different server following a dump/restore.  Unfortunately, this is
sensitive data.

Attached is the pl/pgsql code.  There is a raise notice 'hello'.  This
gets raised exactly once before the crash.

Merlin

Attachment

Re: unexpected and reproducable crash in pl/pgsql function

From
"Merlin Moncure"
Date:
I wrote:
> Ok, I have a fairly nasty situation.  I am having a production server
> that is crashing upon execution of a pl/pgsql function...on code that
> has been working flawlessly for weeks.  My production server is
running
> 8.0 on win32 and I was able to 'sort-of' reproduce the behavior on my
> development machine here at the office.

Ok, problem was due to recursive pl/pgsql function and a recursion loop
in the data.  I traced this problem to the data: somebody disabled the
recursion check constraint.

I've never had this actually happen before.  It totally nuked the
server.

Merlin

Re: unexpected and reproducable crash in pl/pgsql function

From
Tom Lane
Date:
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> Ok, problem was due to recursive pl/pgsql function and a recursion loop
> in the data.  I traced this problem to the data: somebody disabled the
> recursion check constraint.
> I've never had this actually happen before.  It totally nuked the
> server.

I thought we'd fixed things so that the stack depth on Windows is
actually greater than max_stack_depth?  None of this weirdness could
happen if the stack depth check were kicking in properly.

            regards, tom lane

Re: unexpected and reproducable crash in pl/pgsql function

From
"Merlin Moncure"
Date:
> "Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> > Ok, problem was due to recursive pl/pgsql function and a recursion
loop
> > in the data.  I traced this problem to the data: somebody disabled
the
> > recursion check constraint.
> > I've never had this actually happen before.  It totally nuked the
> > server.
>
> I thought we'd fixed things so that the stack depth on Windows is
> actually greater than max_stack_depth?  None of this weirdness could
> happen if the stack depth check were kicking in properly.

I thought so too.  I'll play with it a bit and see what I come up with.

Merlin