Thread: Re: postmaster dies (was Re: Very disappointing performance)
secret <secret@kearneydev.com> writes: >>>> PostgreSQL is also crashing 1-2 times a day on me, although I have a >>>> handy perl script to keep it alive now <grin>... > basically the server randomly dies with a: > ERROR: postmaster: StreamConnection: accept: Invalid argument > pmdie 3 > (then signals all children to drop dead) Hmm. That shouldn't happen, especially not randomly; if the accept works the first time then it should work forever after, since the arguments being passed in never change. The error is coming from StreamConnection() in pgsql/src/backend/libpq/pqcomm.c. Could you maybe add some debugging code to the routine to see what the server_fd and port arguments are when accept() fails? I think just changing the first elog() to elog(ERROR, "postmaster: StreamConnection: accept: %m\nserver_fd = %d, port = %p", server_fd, port); would do for starters. This would let us eliminate the possibility that the routine is getting passed bad arguments. An alternative possibility is to run the postmaster under truss so you can see what arguments are passed to the kernel on every kernel call, but that'd generate a pretty verbose logfile. regards, tom lane
Tom Lane wrote: > secret <secret@kearneydev.com> writes: > >>>> PostgreSQL is also crashing 1-2 times a day on me, although I have a > >>>> handy perl script to keep it alive now <grin>... > > > basically the server randomly dies with a: > > ERROR: postmaster: StreamConnection: accept: Invalid argument > > pmdie 3 > > (then signals all children to drop dead) > > Hmm. That shouldn't happen, especially not randomly; if the accept > works the first time then it should work forever after, since the > arguments being passed in never change. > > The error is coming from StreamConnection() in > pgsql/src/backend/libpq/pqcomm.c. Could you maybe add some debugging > code to the routine to see what the server_fd and port arguments are > when accept() fails? I think just changing the first elog() to > > elog(ERROR, > "postmaster: StreamConnection: accept: %m\nserver_fd = %d, port = %p", > server_fd, port); > > would do for starters. This would let us eliminate the possibility that > the routine is getting passed bad arguments. > > An alternative possibility is to run the postmaster under truss so you > can see what arguments are passed to the kernel on every kernel call, > but that'd generate a pretty verbose logfile. > > regards, tom lane Done. I'll install the new binaries at the end of the day when no one is using the database and give you a copy of the logs when it dies again. Thank you for the help on this, it's very much appreciated. David Secret MIS Director Kearney Development Co., Inc.
Tom Lane wrote: > secret <secret@kearneydev.com> writes: > >>>> PostgreSQL is also crashing 1-2 times a day on me, although I have a > >>>> handy perl script to keep it alive now <grin>... > > > basically the server randomly dies with a: > > ERROR: postmaster: StreamConnection: accept: Invalid argument > > pmdie 3 > > (then signals all children to drop dead) > > Hmm. That shouldn't happen, especially not randomly; if the accept > works the first time then it should work forever after, since the > arguments being passed in never change. > > The error is coming from StreamConnection() in > pgsql/src/backend/libpq/pqcomm.c. Could you maybe add some debugging > code to the routine to see what the server_fd and port arguments are > when accept() fails? I think just changing the first elog() to > > elog(ERROR, > "postmaster: StreamConnection: accept: %m\nserver_fd = %d, port = %p", > server_fd, port); > > would do for starters. This would let us eliminate the possibility that > the routine is getting passed bad arguments. > > An alternative possibility is to run the postmaster under truss so you > can see what arguments are passed to the kernel on every kernel call, > but that'd generate a pretty verbose logfile. > > regards, tom lane query: SELECT "material_id" ,"name" ,"short_name" ,"legacy" FROM "material" ORDE R BY "legacy" DESC,"name" ProcessQuery ! system usage stats: ! 0.017961 elapsed 0.020000 user 0.000000 system sec ! [0.050000 user 0.020000 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 6/24 [127/201] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 0 read, 0 written, buffer hit rate = 10 0.00% ! Local blocks: 0 read, 0 written, buffer hit rate = 0. 00% ! Direct blocks: 0 read, 0 written CommitTransactionCommand ERROR: postmaster: StreamConnection: accept: Invalid argument server_fd = 3, port = 0x816aa70 pmdie 3 SignalChildren: sending signal 15 to process 16943 SignalChildren: sending signal 15 to process 16942 SignalChildren: sending signal 15 to process 16941 There we go, it crashed this morning...(interestingly it went all of yesterday without crashing)... Does this shed some light? If not what would you like me to do next? I have 700M+ to keep a log file, as long as it doesn't generate that much in a day we should be okay with a very verbose log. Just tell me what code mods or runtime options to use... David Secret MIS Director Kearney Development Co., Inc.
secret <secret@kearneydev.com> writes: > ERROR: postmaster: StreamConnection: accept: Invalid argument > server_fd = 3, port = 0x816aa70 > There we go, it crashed this morning...(interestingly it went all of > yesterday without crashing)... Does this shed some light? Not much ... it shows pretty much what we expected, ie, nothing obviously wrong. What I would suggest doing next is running the postmaster under 'truss' or some similar utility that can generate a logfile of all the kernel calls made by the postmaster. I can't give you any details on how to do that --- perhaps some other reader can help? What we're looking for is anything that might have changed the state of file descriptor 3 shortly before the crash. BTW, some tips on debugging this. Maybe these are obvious, maybe not: 1. This accept call is not associated with normal query processing, but with receiving connection requests from new clients. Almost certainly the bug is not triggered by processing queries but by connection attempts. You probably could make the crash happen sooner by starting and stopping clients in a steady stream (not that you want a crash sooner on your real system, of course, but for debugging it'd be nice not to have to wait for long). 2. You might want to build a playpen system that you can stress into crashing without taking out your live server. The easiest way to do that is just to duplicate your installation on another machine, but if no other machine is handy (or if you suspect a platform-dependent bug, which I do here) the best bet is to build a debugging version of Postgres that has nonstandard values for the installation directory and server's port address. For example I usually build trial versions with ./configure --with-pgport=5440 --prefix=/users/postgres/testversion (plus any options you normally use, of course). I think it might also be possible to set these values while running initdb and starting the test postmaster, without having to recompile; but I don't know the exact incantations to use to do it that way. regards, tom lane
Tom Lane wrote: > secret <secret@kearneydev.com> writes: > > ERROR: postmaster: StreamConnection: accept: Invalid argument > > server_fd = 3, port = 0x816aa70 > > > There we go, it crashed this morning...(interestingly it went all of > > yesterday without crashing)... Does this shed some light? > > Not much ... it shows pretty much what we expected, ie, nothing > obviously wrong. > > What I would suggest doing next is running the postmaster under 'truss' > or some similar utility that can generate a logfile of all the kernel > calls made by the postmaster. I can't give you any details on how to do > that --- perhaps some other reader can help? What we're looking for is > anything that might have changed the state of file descriptor 3 shortly > before the crash. > > BTW, some tips on debugging this. Maybe these are obvious, maybe not: > > 1. This accept call is not associated with normal query processing, but > with receiving connection requests from new clients. Almost certainly > the bug is not triggered by processing queries but by connection > attempts. You probably could make the crash happen sooner by starting > and stopping clients in a steady stream (not that you want a crash > sooner on your real system, of course, but for debugging it'd be nice > not to have to wait for long). > > 2. You might want to build a playpen system that you can stress into > crashing without taking out your live server. The easiest way to do > that is just to duplicate your installation on another machine, but if > no other machine is handy (or if you suspect a platform-dependent bug, > which I do here) the best bet is to build a debugging version of > Postgres that has nonstandard values for the installation directory > and server's port address. For example I usually build trial versions > with > > ./configure --with-pgport=5440 --prefix=/users/postgres/testversion > > (plus any options you normally use, of course). I think it might also > be possible to set these values while running initdb and starting the > test postmaster, without having to recompile; but I don't know the > exact incantations to use to do it that way. > > regards, tom lane Would strace work instead of truss? I have strace... Will you be able to interpret the strace files & determine the problem do you think? You've been the only one to respond on this, so I'm a tad worried about being left out in the cold on this one... I'd be glad to pay for support if there is a place I can do that, heck I pay for support on other software products, why not PostgreSQL? Please let me know. I'll begin an strace tonight... David
> Would strace work instead of truss? I have strace... Will you be able to > interpret the strace files & determine the problem do you think? > > You've been the only one to respond on this, so I'm a tad worried about > being left out in the cold on this one... I'd be glad to pay for support if > there is a place I can do that, heck I pay for support on other software > products, why not PostgreSQL? > > Please let me know. I'll begin an strace tonight... I can't imagine he has enough disk space for truss/ktrace output for a full day of backend activity, does he? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Would strace work instead of truss? I have strace... Will you be able to > > interpret the strace files & determine the problem do you think? > > > > You've been the only one to respond on this, so I'm a tad worried about > > being left out in the cold on this one... I'd be glad to pay for support if > > there is a place I can do that, heck I pay for support on other software > > products, why not PostgreSQL? > > > > Please let me know. I'll begin an strace tonight... > > I can't imagine he has enough disk space for truss/ktrace output for a > full day of backend activity, does he? > > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Ur, I'll postpone this to Thursday, when I can monitor the disk space very carefully, how much space are we talking about here? 1G? 2G? 3G? 10G? Maybe I can temporarily install a hard disk just for that purpose.... There are only a few users on the database, it really isn't *THAT* active. --David
> > Ur, I'll postpone this to Thursday, when I can monitor the disk space very > carefully, how much space are we talking about here? 1G? 2G? 3G? 10G? > > Maybe I can temporarily install a hard disk just for that purpose.... There > are only a few users on the database, it really isn't *THAT* active. Hard to say. I would turn it on for 15 minutes and see. ktrace can generate a 1MB files in a minute. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > I can't imagine he has enough disk space for truss/ktrace output for a > full day of backend activity, does he? That's why I was encouraging him to set up a playpen and actively work at crashing it, rather than waiting around to see whether it'd happen before his disk fills up ;-) regards, tom lane
Tom Lane wrote: > Bruce Momjian <maillist@candle.pha.pa.us> writes: > > I can't imagine he has enough disk space for truss/ktrace output for a > > full day of backend activity, does he? > > That's why I was encouraging him to set up a playpen and actively > work at crashing it, rather than waiting around to see whether it'd > happen before his disk fills up ;-) > > regards, tom lane I've built a simple program to record the last N lines(currently 5000...Suggestions?) of input... What I'd like to do is pipe STDIN and STDERR to this program, but "|" doesn't do this, do you all have a suggestion on how to do this? If I can then I can get you the system trace and hopefully get this crash bug fixed.
On Tue, 23 Mar 1999, secret wrote: > I've built a simple program to record the last N lines(currently >5000...Suggestions?) of input... What I'd like to do is pipe STDIN and >STDERR to this program, but "|" doesn't do this, do you all have a >suggestion on how to do this? If I can then I can get you the system trace >and hopefully get this crash bug fixed. strace ... 2>&1 | tail -5000 Note that tail is a standard *nix program. Taral