Thread: backend reset of database
We have built into the backend of the database a third party application called pcmiler that produces mileage between two locations. This has been working in the past, but after upgrading to RHEL 4, Postgresql 7.4.16 and pcmiler 20 (from pcmiler 17), the server process is terminated with a signal 11. I know that's a lot of variables that changed, but one required the other, thus we can't easily backout anything (outside of possibly moving down to 7.4.13 of Postgres). The log of the server process contains: LOG: server process (PID 32441) was terminated by signal 11 LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. A backtrace against this process produces: Program received signal SIGSEGV, Segmentation fault. 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 717 if (FileIsNotOpen(file)) (gdb) bt #0 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) at fd.c:972 #2 0x00203ecc in ?? () #3 0x0a0ad4b0 in ?? () #4 0xbff816ce in ?? () #5 0x00000002 in ?? () #6 0x00000001 in ?? () #7 0x0a04fdc0 in ?? () #8 0xbff83110 in ?? () #9 0x00000001 in ?? () #10 0x00000002 in ?? () #11 0xbff83110 in ?? () #12 0x00000000 in ?? () I'm assuming that the portion of the backtrace from frame 2-12 is likely produced from the pcmiler binaries as we do not have source and they don't contain debugging symbols. Any insights would be greatly appreciated. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Geoffrey <esoteric@3times25.net> writes: > A backtrace against this process produces: > Program received signal SIGSEGV, Segmentation fault. > 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 > 717 if (FileIsNotOpen(file)) > (gdb) bt > #0 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 > #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) > at fd.c:972 > #2 0x00203ecc in ?? () > I'm assuming that the portion of the backtrace from frame 2-12 is likely > produced from the pcmiler binaries as we do not have source and they > don't contain debugging symbols. Is pcmiler a Postgres-specific backend extension? It seems fairly unlikely that it would be calling FileRead() if not. Do you have any other Postgres-specific libraries loaded into the backend? The immediate issue is that FileRead() is being passed a bogus file number, which might suggest an unexpected change in the contents of a struct or something like that, but I see no record of any such changes in the CVS log between 7.4.13 and 7.4.16. regards, tom lane
Tom Lane wrote: > Geoffrey <esoteric@3times25.net> writes: >> A backtrace against this process produces: > >> Program received signal SIGSEGV, Segmentation fault. >> 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 >> 717 if (FileIsNotOpen(file)) >> (gdb) bt >> #0 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 >> #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) >> at fd.c:972 >> #2 0x00203ecc in ?? () > >> I'm assuming that the portion of the backtrace from frame 2-12 is likely >> produced from the pcmiler binaries as we do not have source and they >> don't contain debugging symbols. > > Is pcmiler a Postgres-specific backend extension? No. It's a standalone application that provides an api to build calls into there product from other applications. What we've done is created functions that are called from the database that use their api. for example: CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer) RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler' LANGUAGE 'c' WITH (isStrict); To produce the above back trace I executed the following from a psql window: select pcm_getmiles_s('sparta,nc', 'buffalo,ny',0); > It seems fairly > unlikely that it would be calling FileRead() if not. I find that unlikely as well. > Do you have any > other Postgres-specific libraries loaded into the backend? I don't believe so. > The immediate issue is that FileRead() is being passed a bogus file > number, which might suggest an unexpected change in the contents of > a struct or something like that, but I see no record of any such changes > in the CVS log between 7.4.13 and 7.4.16. This is what threw me as well. It definitely looks like something internal to Postgresql. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Geoffrey <esoteric@3times25.net> writes: > Tom Lane wrote: >> Is pcmiler a Postgres-specific backend extension? > No. It's a standalone application that provides an api to build calls > into there product from other applications. What we've done is created > functions that are called from the database that use their api. Did you recompile/relink these glue functions when moving up to the new version? Are they built with debugging enabled? The trace is *really* suspicious given what you say here. I'm inclined to think that what is really happening is that something is jumping to never-never land (via a clobbered function pointer or overwriting a return address on the stack) and it just happens to end up in FileRead. Unfortunately that guess doesn't provide much help for debugging it :-( regards, tom lane
Geoffrey <esoteric@3times25.net> writes: > (gdb) bt > #0 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 > #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) > at fd.c:972 > #2 0x00203ecc in ?? () > #3 0x0a0ad4b0 in ?? () > #4 0xbff816ce in ?? () > #5 0x00000002 in ?? () > #6 0x00000001 in ?? () > #7 0x0a04fdc0 in ?? () > #8 0xbff83110 in ?? () > #9 0x00000001 in ?? () > #10 0x00000002 in ?? () > #11 0xbff83110 in ?? () > #12 0x00000000 in ?? () > I'm assuming that the portion of the backtrace from frame 2-12 is likely > produced from the pcmiler binaries as we do not have source and they > don't contain debugging symbols. Actually, on looking closer, most of those return addresses are obviously silly: control was certainly never passed from addresses 0, 1, or 2, and I suspect a bit of disassembling with gdb would show that the others aren't immediately after call instructions either. So it looks like gdb got confused about which words in the stack are return addresses --- which lends a bit more credibility to the idea about a stack clobber, but still offers no direct help for debugging. What I'd be inclined to look for first is local-variable buffers that might not be big enough for the values that get written into them. Perhaps the reason the failure occurred now was something as prosaic as the new installation having longer path names. In any case you need to be taking a very hard look at your glue functions, since those are the weakest link in the chain as far as the extent of portability testing they've gotten goes (assuming that pcmiler is a reasonably well-tested library). regards, tom lane
Geoffrey wrote: > Tom Lane wrote: >> Geoffrey <esoteric@3times25.net> writes: >>> A backtrace against this process produces: >> >>> Program received signal SIGSEGV, Segmentation fault. >>> 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 >>> 717 if (FileIsNotOpen(file)) >>> (gdb) bt >>> #0 0x0814acc9 in FileAccess (file=168481968) at fd.c:717 >>> #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", >>> amount=2) >>> at fd.c:972 >>> #2 0x00203ecc in ?? () >> >>> I'm assuming that the portion of the backtrace from frame 2-12 is >>> likely produced from the pcmiler binaries as we do not have source >>> and they don't contain debugging symbols. >> >> Is pcmiler a Postgres-specific backend extension? > > No. It's a standalone application that provides an api to build calls > into there product from other applications. What we've done is created > functions that are called from the database that use their api. for > example: > > CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer) > RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler' > LANGUAGE 'c' WITH (isStrict); Is this function actually a PostgreSQL callable function? Or is it just a general C function along the line of float pcm_getmiles_s(char*, char*, int) ? If it's a general C function, you'll need to write a wrapper function that calls it, and returns the data, using the PostgreSQL macros to extract the parameters. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Geoffrey wrote: >> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer) >> RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler' >> LANGUAGE 'c' WITH (isStrict); > Is this function actually a PostgreSQL callable function? Or is it just > a general C function along the line of > float pcm_getmiles_s(char*, char*, int) > ? The code would never have worked at all if that were the case (since text* is not like char*). I suspect some subtler portability issue in Geoffrey's glue functions. It seems unlikely that moving from PG 7.4.13 to 7.4.16 would in itself have exposed such a problem, but the update in operating environment from RHEL3 to RHEL4 might've. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Geoffrey wrote: >>> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer) >>> RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler' >>> LANGUAGE 'c' WITH (isStrict); > >> Is this function actually a PostgreSQL callable function? Or is it just >> a general C function along the line of >> float pcm_getmiles_s(char*, char*, int) >> ? > > The code would never have worked at all if that were the case (since > text* is not like char*). I suspect some subtler portability issue in > Geoffrey's glue functions. It seems unlikely that moving from PG 7.4.13 > to 7.4.16 would in itself have exposed such a problem, but the update in > operating environment from RHEL3 to RHEL4 might've. That's what I get for not reading the whole threa dback, I missed the fact that it had ever worked. Oops :-) //Magnus
Tom Lane wrote: > Geoffrey <esoteric@3times25.net> writes: >> Tom Lane wrote: >>> Is pcmiler a Postgres-specific backend extension? > >> No. It's a standalone application that provides an api to build calls >> into there product from other applications. What we've done is created >> functions that are called from the database that use their api. > > Did you recompile/relink these glue functions when moving up to the new > version? Are they built with debugging enabled? What we are building is built with debugging, but the libraries provided by pcmiler do not have debugging symbols. > The trace is *really* suspicious given what you say here. I'm inclined > to think that what is really happening is that something is jumping to > never-never land (via a clobbered function pointer or overwriting a > return address on the stack) and it just happens to end up in FileRead. > Unfortunately that guess doesn't provide much help for debugging it :-( I just find it hard to believe that an address is getting clobbered or some such thing and we just happen to fall into the very beginning of a function. I would also have expected that all the parms to the FileRead call would have been garbage, yet that does not seem to be the case: #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) One other note regarding this issue. We have built standalone programs that execute from the shell to call the same routines from the third party api and they work. The point at which the actual failure occurs on the code is the point at which the process attempts to connect to the pcmiler server process. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Geoffrey wrote: >>> CREATE OR REPLACE FUNCTION pcm_getmiles_s(text,text,integer) >>> RETURNS float8 AS '/esc/pgrnd/prog/libpcmiler' >>> LANGUAGE 'c' WITH (isStrict); > >> Is this function actually a PostgreSQL callable function? Or is it just >> a general C function along the line of >> float pcm_getmiles_s(char*, char*, int) >> ? > > The code would never have worked at all if that were the case (since > text* is not like char*). I suspect some subtler portability issue in > Geoffrey's glue functions. It seems unlikely that moving from PG 7.4.13 > to 7.4.16 would in itself have exposed such a problem, but the update in > operating environment from RHEL3 to RHEL4 might've. This is an interesting point. I might try to see if I can duplicate the problem on a RH3 box. We'll take a good hard look at our code to see if we have some sloppy pointers or string length issues. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Geoffrey <esoteric@3times25.net> writes: > Tom Lane wrote: >> The trace is *really* suspicious given what you say here. I'm inclined >> to think that what is really happening is that something is jumping to >> never-never land (via a clobbered function pointer or overwriting a >> return address on the stack) and it just happens to end up in FileRead. >> Unfortunately that guess doesn't provide much help for debugging it :-( > I just find it hard to believe that an address is getting clobbered or > some such thing and we just happen to fall into the very beginning of a > function. Well, this trace doesn't prove any such thing. If it was a wild jump, we can surmise that it landed someplace in the first few instructions of FileRead (before the call to FileAccess), but there's no proof here that it landed on-the-nose at the first instruction. > I would also have expected that all the parms to the FileRead call would > have been garbage, yet that does not seem to be the case: > #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) That's actually one of the things I find suspicious: AFAICS all the calls to FileRead in the backend (and there aren't many) use a constant BLCKSZ value for "amount". So I doubt the 2 is an intentionally passed parameter, it's more likely whatever happened to be in the relevant word of the stack. regards, tom lane
Tom Lane wrote: > Geoffrey <esoteric@3times25.net> writes: >> Tom Lane wrote: >>> The trace is *really* suspicious given what you say here. I'm inclined >>> to think that what is really happening is that something is jumping to >>> never-never land (via a clobbered function pointer or overwriting a >>> return address on the stack) and it just happens to end up in FileRead. >>> Unfortunately that guess doesn't provide much help for debugging it :-( > >> I just find it hard to believe that an address is getting clobbered or >> some such thing and we just happen to fall into the very beginning of a >> function. > > Well, this trace doesn't prove any such thing. If it was a wild jump, > we can surmise that it landed someplace in the first few instructions of > FileRead (before the call to FileAccess), but there's no proof here that > it landed on-the-nose at the first instruction. So you're saying that the debugger will simply point to that function if the address it 'jumps to' is somewhere in address space of this function? I don't know the intricacies of the debugger to understand how all that works. My expectation is that a random address is unlikely to make any sense to the debugger. >> I would also have expected that all the parms to the FileRead call would >> have been garbage, yet that does not seem to be the case: > >> #1 0x0814b2e7 in FileRead (file=168481968, buffer=0xbff816ce "", amount=2) > > That's actually one of the things I find suspicious: AFAICS all the > calls to FileRead in the backend (and there aren't many) use a constant > BLCKSZ value for "amount". So I doubt the 2 is an intentionally passed > parameter, it's more likely whatever happened to be in the relevant word > of the stack. Thanks, that makes sense. Now I've just got to figure out how it's 'getting there' to start with. :) -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Geoffrey <esoteric@3times25.net> writes: > Tom Lane wrote: >> Well, this trace doesn't prove any such thing. If it was a wild jump, >> we can surmise that it landed someplace in the first few instructions of >> FileRead (before the call to FileAccess), but there's no proof here that >> it landed on-the-nose at the first instruction. > So you're saying that the debugger will simply point to that function if > the address it 'jumps to' is somewhere in address space of this > function? I don't know the intricacies of the debugger to understand > how all that works. My expectation is that a random address is unlikely > to make any sense to the debugger. It is a bit surprising that it didn't jump to a nonexistent address and die immediately, but if you're on a little-endian machine then maybe this isn't quite as improbable as it looks. A stack clobber that writes just a few more bytes than the intended buffer can hold might overwrite just the first couple bytes of a return address, which are the LSBs on a little-endian machine, leading to a return address somewhere within 64K of where it should have been, which most likely is within the code area rather than off in an unmapped range. The fact that the code area is only a small part of the address space doesn't change the odds if that's what happened. regards, tom lane
I've done some further research on this problem. I'm not a big user of strace, so I'm looking at getting some insight here. Basically, there are two different versions of the pcmiler libraries. One set works, the other does not. The following is the strace output from both processes. The first is the one that works, the second is the one that does not work. Anyone see anything here that might identify the problem? Thanks: working strace: send(3, "Q\0\0\0008select pcm_getmiles_s(\'spar"..., 57, 0) = 57 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 recv(3, "T\0\0\0\'\0\1pcm_getmiles_s\0\0\0\0\0\0\0\0\0\2\275"..., 16384, 0) = 74ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {c_iflags=0x2d02, c_oflags=0x5, c_cflags=0x4bf, c_lflags=0x8a3b, c_line=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\xff\x11\x13\x1a\xff\x12\x0f\x17\x16\xff\x00\x00"}) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {c_iflags=0x2d02, c_oflags=0x5, c_cflags=0x4bf, c_lflags=0x8a3b, c_line=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\xff\x11\x13\x1a\xff\x12\x0f\x17\x16\xff\x00\x00"}) = 0 ioctl(1, TIOCGWINSZ, {ws_row=24, ws_col=80, ws_xpixel=0, ws_ypixel=0}) = 0 broken strace: send(3, "Q\0\0\0008select pcm_getmiles_s(\'spar"..., 57, 0) = 57 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN|POLLHUP}], 1, -1) = 1 recv(3, "", 16384, 0) = 0 time(NULL) = 1176403599 poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN|POLLHUP}], 1, 0) = 1 recv(3, "", 16384, 0) = 0 close(3) = 0 write(2, "server closed the connection une"..., 137server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ) = 137 write(2, "The connection to the server was"..., 57The connection to the server was lost. Attempting reset: ) = 57 -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Geoffrey wrote: I forgot to mention the one difference I did see in the strace output. I don't understand why there would be difference in the call parms to poll(). Then again, I'm not altogether sure what code is calling poll at this point. I'm assuming it's from the database engine. I've noted these lines below: > working strace: > > poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 > broken strace: > poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN|POLLHUP}], 1, -1) = 1 -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Geoffrey <esoteric@3times25.net> writes: > I've done some further research on this problem. I'm not a big user of > strace, so I'm looking at getting some insight here. Basically, there > are two different versions of the pcmiler libraries. One set works, the > other does not. The following is the strace output from both processes. You strace'd the wrong process, ie, the client. regards, tom lane
Tom Lane wrote: > Geoffrey <esoteric@3times25.net> writes: >> I've done some further research on this problem. I'm not a big user of >> strace, so I'm looking at getting some insight here. Basically, there >> are two different versions of the pcmiler libraries. One set works, the >> other does not. The following is the strace output from both processes. > > You strace'd the wrong process, ie, the client. Okay, so how stupid can I be? (you could have sent it as private email) :) Now that I have my head on my shoulders where it belongs, I'll give this another spin... -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Tom Lane wrote: > Geoffrey <esoteric@3times25.net> writes: >> I've done some further research on this problem. I'm not a big user of >> strace, so I'm looking at getting some insight here. Basically, there >> are two different versions of the pcmiler libraries. One set works, the >> other does not. The following is the strace output from both processes. > > You strace'd the wrong process, ie, the client. Well, when I strace the proper process, it appears to point back at the 3rd party application, thus I'm following up with them. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin