Thread: backend reset of database

backend reset of database

From
Geoffrey
Date:
We have built into the backend of the database a third party application
called pcmiler that produces mileage between two locations.  This has
been working in the past, but after upgrading to RHEL 4, Postgresql
7.4.16 and pcmiler 20 (from pcmiler 17), the server process is
terminated with a signal 11.  I know that's a lot of variables that
changed, but one required the other, thus we can't easily backout
anything (outside of possibly moving down to 7.4.13 of Postgres).

The log of the server process contains:

LOG:  server process (PID 32441) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.


A backtrace against this process produces:

Program received signal SIGSEGV, Segmentation fault.
0x0814acc9 in FileAccess (file=168481968) at fd.c:717
717             if (FileIsNotOpen(file))
(gdb) bt
#0  0x0814acc9 in FileAccess (file=168481968) at fd.c:717
#1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)
     at fd.c:972
#2  0x00203ecc in ?? ()
#3  0x0a0ad4b0 in ?? ()
#4  0xbff816ce in ?? ()
#5  0x00000002 in ?? ()
#6  0x00000001 in ?? ()
#7  0x0a04fdc0 in ?? ()
#8  0xbff83110 in ?? ()
#9  0x00000001 in ?? ()
#10 0x00000002 in ?? ()
#11 0xbff83110 in ?? ()
#12 0x00000000 in ?? ()

I'm assuming that the portion of the backtrace from frame 2-12 is likely
produced from the pcmiler binaries as we do not have source and they
don't contain debugging symbols.

Any insights would be greatly appreciated.

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Tom Lane
Date:
Geoffrey <esoteric@3times25.net> writes:
> A backtrace against this process produces:

> Program received signal SIGSEGV, Segmentation fault.
> 0x0814acc9 in FileAccess (file=168481968) at fd.c:717
> 717             if (FileIsNotOpen(file))
> (gdb) bt
> #0  0x0814acc9 in FileAccess (file=168481968) at fd.c:717
> #1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)
>      at fd.c:972
> #2  0x00203ecc in ?? ()

> I'm assuming that the portion of the backtrace from frame 2-12 is likely
> produced from the pcmiler binaries as we do not have source and they
> don't contain debugging symbols.

Is pcmiler a Postgres-specific backend extension?  It seems fairly
unlikely that it would be calling FileRead() if not.  Do you have any
other Postgres-specific libraries loaded into the backend?

The immediate issue is that FileRead() is being passed a bogus file
number, which might suggest an unexpected change in the contents of
a struct or something like that, but I see no record of any such changes
in the CVS log between 7.4.13 and 7.4.16.

            regards, tom lane

Re: backend reset of database

From
Geoffrey
Date:
Tom Lane wrote:
> Geoffrey <esoteric@3times25.net> writes:
>> A backtrace against this process produces:
>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x0814acc9 in FileAccess (file=168481968) at fd.c:717
>> 717             if (FileIsNotOpen(file))
>> (gdb) bt
>> #0  0x0814acc9 in FileAccess (file=168481968) at fd.c:717
>> #1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)
>>      at fd.c:972
>> #2  0x00203ecc in ?? ()
>
>> I'm assuming that the portion of the backtrace from frame 2-12 is likely
>> produced from the pcmiler binaries as we do not have source and they
>> don't contain debugging symbols.
>
> Is pcmiler a Postgres-specific backend extension?

No.  It's a standalone application that provides an api to build calls
into there product from other applications.  What we've done is created
functions that are called from the database that use their api.  for
example:

CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer)
     RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler'
     LANGUAGE 'c' WITH (isStrict);

To produce the above back trace I executed the following from a psql window:

select pcm_getmiles_s('sparta,nc', 'buffalo,ny',0);

> It seems fairly
> unlikely that it would be calling FileRead() if not.

I find that unlikely as well.

> Do you have any
> other Postgres-specific libraries loaded into the backend?

I don't believe so.

> The immediate issue is that FileRead() is being passed a bogus file
> number, which might suggest an unexpected change in the contents of
> a struct or something like that, but I see no record of any such changes
> in the CVS log between 7.4.13 and 7.4.16.

This is what threw me as well.  It definitely looks like something
internal to Postgresql.

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Tom Lane
Date:
Geoffrey <esoteric@3times25.net> writes:
> Tom Lane wrote:
>> Is pcmiler a Postgres-specific backend extension?

> No.  It's a standalone application that provides an api to build calls
> into there product from other applications.  What we've done is created
> functions that are called from the database that use their api.

Did you recompile/relink these glue functions when moving up to the new
version?  Are they built with debugging enabled?

The trace is *really* suspicious given what you say here.  I'm inclined
to think that what is really happening is that something is jumping to
never-never land (via a clobbered function pointer or overwriting a
return address on the stack) and it just happens to end up in FileRead.
Unfortunately that guess doesn't provide much help for debugging it :-(

            regards, tom lane

Re: backend reset of database

From
Tom Lane
Date:
Geoffrey <esoteric@3times25.net> writes:
> (gdb) bt
> #0  0x0814acc9 in FileAccess (file=168481968) at fd.c:717
> #1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)
>      at fd.c:972
> #2  0x00203ecc in ?? ()
> #3  0x0a0ad4b0 in ?? ()
> #4  0xbff816ce in ?? ()
> #5  0x00000002 in ?? ()
> #6  0x00000001 in ?? ()
> #7  0x0a04fdc0 in ?? ()
> #8  0xbff83110 in ?? ()
> #9  0x00000001 in ?? ()
> #10 0x00000002 in ?? ()
> #11 0xbff83110 in ?? ()
> #12 0x00000000 in ?? ()

> I'm assuming that the portion of the backtrace from frame 2-12 is likely
> produced from the pcmiler binaries as we do not have source and they
> don't contain debugging symbols.

Actually, on looking closer, most of those return addresses are
obviously silly: control was certainly never passed from addresses 0,
1, or 2, and I suspect a bit of disassembling with gdb would show
that the others aren't immediately after call instructions either.
So it looks like gdb got confused about which words in the stack are
return addresses --- which lends a bit more credibility to the idea
about a stack clobber, but still offers no direct help for debugging.

What I'd be inclined to look for first is local-variable buffers that
might not be big enough for the values that get written into them.
Perhaps the reason the failure occurred now was something as prosaic
as the new installation having longer path names.  In any case you
need to be taking a very hard look at your glue functions, since those
are the weakest link in the chain as far as the extent of portability
testing they've gotten goes (assuming that pcmiler is a reasonably
well-tested library).

            regards, tom lane

Re: backend reset of database

From
Magnus Hagander
Date:
Geoffrey wrote:
> Tom Lane wrote:
>> Geoffrey <esoteric@3times25.net> writes:
>>> A backtrace against this process produces:
>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x0814acc9 in FileAccess (file=168481968) at fd.c:717
>>> 717             if (FileIsNotOpen(file))
>>> (gdb) bt
>>> #0  0x0814acc9 in FileAccess (file=168481968) at fd.c:717
>>> #1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "",
>>> amount=2)
>>>      at fd.c:972
>>> #2  0x00203ecc in ?? ()
>>
>>> I'm assuming that the portion of the backtrace from frame 2-12 is
>>> likely produced from the pcmiler binaries as we do not have source
>>> and they don't contain debugging symbols.
>>
>> Is pcmiler a Postgres-specific backend extension?
>
> No.  It's a standalone application that provides an api to build calls
> into there product from other applications.  What we've done is created
> functions that are called from the database that use their api.  for
> example:
>
> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer)
>     RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler'
>     LANGUAGE 'c' WITH (isStrict);

Is this function actually a PostgreSQL callable function? Or is it just
a general C function along the line of
float pcm_getmiles_s(char*, char*, int)
?

If it's a general C function, you'll need to write a wrapper function
that calls it, and returns the data, using the PostgreSQL macros to
extract the parameters.

//Magnus


Re: backend reset of database

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Geoffrey wrote:
>> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer)
>> RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler'
>> LANGUAGE 'c' WITH (isStrict);

> Is this function actually a PostgreSQL callable function? Or is it just
> a general C function along the line of
> float pcm_getmiles_s(char*, char*, int)
> ?

The code would never have worked at all if that were the case (since
text* is not like char*).  I suspect some subtler portability issue in
Geoffrey's glue functions.  It seems unlikely that moving from PG 7.4.13
to 7.4.16 would in itself have exposed such a problem, but the update in
operating environment from RHEL3 to RHEL4 might've.

            regards, tom lane

Re: backend reset of database

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Geoffrey wrote:
>>> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer)
>>> RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler'
>>> LANGUAGE 'c' WITH (isStrict);
>
>> Is this function actually a PostgreSQL callable function? Or is it just
>> a general C function along the line of
>> float pcm_getmiles_s(char*, char*, int)
>> ?
>
> The code would never have worked at all if that were the case (since
> text* is not like char*).  I suspect some subtler portability issue in
> Geoffrey's glue functions.  It seems unlikely that moving from PG 7.4.13
> to 7.4.16 would in itself have exposed such a problem, but the update in
> operating environment from RHEL3 to RHEL4 might've.

That's what I get for not reading the whole threa dback, I missed the
fact that it had ever worked. Oops :-)

//Magnus

Re: backend reset of database

From
Geoffrey
Date:
Tom Lane wrote:
> Geoffrey <esoteric@3times25.net> writes:
>> Tom Lane wrote:
>>> Is pcmiler a Postgres-specific backend extension?
>
>> No.  It's a standalone application that provides an api to build calls
>> into there product from other applications.  What we've done is created
>> functions that are called from the database that use their api.
>
> Did you recompile/relink these glue functions when moving up to the new
> version?  Are they built with debugging enabled?

What we are building is built with debugging, but the libraries provided
by pcmiler do not have debugging symbols.

> The trace is *really* suspicious given what you say here.  I'm inclined
> to think that what is really happening is that something is jumping to
> never-never land (via a clobbered function pointer or overwriting a
> return address on the stack) and it just happens to end up in FileRead.
> Unfortunately that guess doesn't provide much help for debugging it :-(

I just find it hard to believe that an address is getting clobbered or
some such thing and we just happen to fall into the very beginning of a
function.

I would also have expected that all the parms to the FileRead call would
have been garbage, yet that does not seem to be the case:

#1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)

One other note regarding this issue.  We have built standalone programs
that execute from the shell to call the same routines from the third
party api and they work.

The point at which the actual failure occurs on the code is the point at
which the process attempts to connect to the pcmiler server process.

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Geoffrey
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Geoffrey wrote:
>>> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer)
>>> RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler'
>>> LANGUAGE 'c' WITH (isStrict);
>
>> Is this function actually a PostgreSQL callable function? Or is it just
>> a general C function along the line of
>> float pcm_getmiles_s(char*, char*, int)
>> ?
>
> The code would never have worked at all if that were the case (since
> text* is not like char*).  I suspect some subtler portability issue in
> Geoffrey's glue functions.  It seems unlikely that moving from PG 7.4.13
> to 7.4.16 would in itself have exposed such a problem, but the update in
> operating environment from RHEL3 to RHEL4 might've.

This is an interesting point.  I might try to see if I can duplicate the
problem on a RH3 box.

We'll take a good hard look at our code to see if we have some sloppy
pointers or string length issues.

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Tom Lane
Date:
Geoffrey <esoteric@3times25.net> writes:
> Tom Lane wrote:
>> The trace is *really* suspicious given what you say here.  I'm inclined
>> to think that what is really happening is that something is jumping to
>> never-never land (via a clobbered function pointer or overwriting a
>> return address on the stack) and it just happens to end up in FileRead.
>> Unfortunately that guess doesn't provide much help for debugging it :-(

> I just find it hard to believe that an address is getting clobbered or
> some such thing and we just happen to fall into the very beginning of a
> function.

Well, this trace doesn't prove any such thing.  If it was a wild jump,
we can surmise that it landed someplace in the first few instructions of
FileRead (before the call to FileAccess), but there's no proof here that
it landed on-the-nose at the first instruction.

> I would also have expected that all the parms to the FileRead call would
> have been garbage, yet that does not seem to be the case:

> #1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)

That's actually one of the things I find suspicious: AFAICS all the
calls to FileRead in the backend (and there aren't many) use a constant
BLCKSZ value for "amount".  So I doubt the 2 is an intentionally passed
parameter, it's more likely whatever happened to be in the relevant word
of the stack.

            regards, tom lane

Re: backend reset of database

From
Geoffrey
Date:
Tom Lane wrote:
> Geoffrey <esoteric@3times25.net> writes:
>> Tom Lane wrote:
>>> The trace is *really* suspicious given what you say here.  I'm inclined
>>> to think that what is really happening is that something is jumping to
>>> never-never land (via a clobbered function pointer or overwriting a
>>> return address on the stack) and it just happens to end up in FileRead.
>>> Unfortunately that guess doesn't provide much help for debugging it :-(
>
>> I just find it hard to believe that an address is getting clobbered or
>> some such thing and we just happen to fall into the very beginning of a
>> function.
>
> Well, this trace doesn't prove any such thing.  If it was a wild jump,
> we can surmise that it landed someplace in the first few instructions of
> FileRead (before the call to FileAccess), but there's no proof here that
> it landed on-the-nose at the first instruction.

So you're saying that the debugger will simply point to that function if
the address it 'jumps to' is somewhere in address space of this
function?  I don't know the intricacies of the debugger to understand
how all that works.  My expectation is that a random address is unlikely
to make any sense to the debugger.

>> I would also have expected that all the parms to the FileRead call would
>> have been garbage, yet that does not seem to be the case:
>
>> #1  0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2)
>
> That's actually one of the things I find suspicious: AFAICS all the
> calls to FileRead in the backend (and there aren't many) use a constant
> BLCKSZ value for "amount".  So I doubt the 2 is an intentionally passed
> parameter, it's more likely whatever happened to be in the relevant word
> of the stack.

Thanks, that makes sense.  Now I've just got to figure out how it's
'getting there' to start with. :)

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Tom Lane
Date:
Geoffrey <esoteric@3times25.net> writes:
> Tom Lane wrote:
>> Well, this trace doesn't prove any such thing.  If it was a wild jump,
>> we can surmise that it landed someplace in the first few instructions of
>> FileRead (before the call to FileAccess), but there's no proof here that
>> it landed on-the-nose at the first instruction.

> So you're saying that the debugger will simply point to that function if
> the address it 'jumps to' is somewhere in address space of this
> function?  I don't know the intricacies of the debugger to understand
> how all that works.  My expectation is that a random address is unlikely
> to make any sense to the debugger.

It is a bit surprising that it didn't jump to a nonexistent address and
die immediately, but if you're on a little-endian machine then maybe
this isn't quite as improbable as it looks.  A stack clobber that writes
just a few more bytes than the intended buffer can hold might overwrite
just the first couple bytes of a return address, which are the LSBs on a
little-endian machine, leading to a return address somewhere within 64K
of where it should have been, which most likely is within the code area
rather than off in an unmapped range.  The fact that the code area is
only a small part of the address space doesn't change the odds if that's
what happened.

            regards, tom lane

Re: backend reset of database

From
Geoffrey
Date:
I've done some further research on this problem.  I'm not a big user of
strace, so I'm looking at getting some insight here.  Basically, there
are two different versions of the pcmiler libraries.  One set works, the
other does not.  The following is the strace output from both processes.
  The first is the one that works, the second is the one that does not
work.  Anyone see anything here that might identify the problem?  Thanks:

working strace:

send(3, "Q\0\0\0008select pcm_getmiles_s(\'spar"..., 57, 0) = 57
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0\'\0\1pcm_getmiles_s\0\0\0\0\0\0\0\0\0\2\275"..., 16384,
0) = 74ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {c_iflags=0x2d02,
c_oflags=0x5, c_cflags=0x4bf, c_lflags=0x8a3b, c_line=0,
c_cc="\x03\x1c\x7f\x15\x04\x00\x01\xff\x11\x13\x1a\xff\x12\x0f\x17\x16\xff\x00\x00"})
= 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {c_iflags=0x2d02, c_oflags=0x5,
c_cflags=0x4bf, c_lflags=0x8a3b, c_line=0,
c_cc="\x03\x1c\x7f\x15\x04\x00\x01\xff\x11\x13\x1a\xff\x12\x0f\x17\x16\xff\x00\x00"})
= 0
ioctl(1, TIOCGWINSZ, {ws_row=24, ws_col=80, ws_xpixel=0, ws_ypixel=0}) = 0

broken strace:

send(3, "Q\0\0\0008select pcm_getmiles_s(\'spar"..., 57, 0) = 57
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN|POLLHUP}], 1, -1) = 1
recv(3, "", 16384, 0)                   = 0
time(NULL)                              = 1176403599
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN|POLLHUP}], 1, 0) = 1
recv(3, "", 16384, 0)                   = 0
close(3)                                = 0
write(2, "server closed the connection une"..., 137server closed the
connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
) = 137
write(2, "The connection to the server was"..., 57The connection to the
server was lost. Attempting reset: ) = 57

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Geoffrey
Date:
Geoffrey wrote:

I forgot to mention the one difference I did see in the strace output.
I don't understand why there would be difference in the call parms to
poll().  Then again, I'm not altogether sure what code is calling poll
at this point.  I'm assuming it's from the database engine. I've noted
these lines below:

> working strace:
>
> poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1

> broken strace:

> poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN|POLLHUP}], 1, -1) = 1

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Tom Lane
Date:
Geoffrey <esoteric@3times25.net> writes:
> I've done some further research on this problem.  I'm not a big user of
> strace, so I'm looking at getting some insight here.  Basically, there
> are two different versions of the pcmiler libraries.  One set works, the
> other does not.  The following is the strace output from both processes.

You strace'd the wrong process, ie, the client.

            regards, tom lane

Re: backend reset of database

From
Geoffrey
Date:
Tom Lane wrote:
> Geoffrey <esoteric@3times25.net> writes:
>> I've done some further research on this problem.  I'm not a big user of
>> strace, so I'm looking at getting some insight here.  Basically, there
>> are two different versions of the pcmiler libraries.  One set works, the
>> other does not.  The following is the strace output from both processes.
>
> You strace'd the wrong process, ie, the client.

Okay, so how stupid can I be?  (you could have sent it as private email) :)

Now that I have my head on my shoulders where it belongs, I'll give this
another spin...

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin

Re: backend reset of database

From
Geoffrey
Date:
Tom Lane wrote:
> Geoffrey <esoteric@3times25.net> writes:
>> I've done some further research on this problem.  I'm not a big user of
>> strace, so I'm looking at getting some insight here.  Basically, there
>> are two different versions of the pcmiler libraries.  One set works, the
>> other does not.  The following is the strace output from both processes.
>
> You strace'd the wrong process, ie, the client.


Well, when I strace the proper process, it appears to point back at the
3rd party application, thus I'm following up with them.

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin