Thread: Re: postmaster dies (was Re: Very disappointing performance)

Re: postmaster dies (was Re: Very disappointing performance)

From
Tom Lane
Date:
secret <secret@kearneydev.com> writes:
>>>> PostgreSQL is also crashing 1-2 times a day on me, although I have a
>>>> handy perl script to keep it alive now <grin>...

> basically the server randomly dies with a:
> ERROR:  postmaster: StreamConnection: accept: Invalid argument
> pmdie 3
> (then signals all children to drop dead)

Hmm.  That shouldn't happen, especially not randomly; if the accept
works the first time then it should work forever after, since the
arguments being passed in never change.

The error is coming from StreamConnection() in
pgsql/src/backend/libpq/pqcomm.c.  Could you maybe add some debugging
code to the routine to see what the server_fd and port arguments are
when accept() fails?  I think just changing the first elog() to

elog(ERROR,    "postmaster: StreamConnection: accept: %m\nserver_fd = %d, port = %p",    server_fd, port);

would do for starters.  This would let us eliminate the possibility that
the routine is getting passed bad arguments.

An alternative possibility is to run the postmaster under truss so you
can see what arguments are passed to the kernel on every kernel call,
but that'd generate a pretty verbose logfile.
        regards, tom lane


Re: postmaster dies (was Re: Very disappointing performance)

From
secret
Date:
Tom Lane wrote:

> secret <secret@kearneydev.com> writes:
> >>>> PostgreSQL is also crashing 1-2 times a day on me, although I have a
> >>>> handy perl script to keep it alive now <grin>...
>
> > basically the server randomly dies with a:
> > ERROR:  postmaster: StreamConnection: accept: Invalid argument
> > pmdie 3
> > (then signals all children to drop dead)
>
> Hmm.  That shouldn't happen, especially not randomly; if the accept
> works the first time then it should work forever after, since the
> arguments being passed in never change.
>
> The error is coming from StreamConnection() in
> pgsql/src/backend/libpq/pqcomm.c.  Could you maybe add some debugging
> code to the routine to see what the server_fd and port arguments are
> when accept() fails?  I think just changing the first elog() to
>
> elog(ERROR,
>      "postmaster: StreamConnection: accept: %m\nserver_fd = %d, port = %p",
>      server_fd, port);
>
> would do for starters.  This would let us eliminate the possibility that
> the routine is getting passed bad arguments.
>
> An alternative possibility is to run the postmaster under truss so you
> can see what arguments are passed to the kernel on every kernel call,
> but that'd generate a pretty verbose logfile.
>
>                         regards, tom lane
   Done.  I'll install the new binaries at the end of the day when no one is
using the database and give you a copy of the logs when it dies again.  Thank
you for the help on this, it's very much appreciated.

David Secret
MIS Director
Kearney Development Co., Inc.




Re: postmaster dies (was Re: Very disappointing performance)

From
secret
Date:
Tom Lane wrote:

> secret <secret@kearneydev.com> writes:
> >>>> PostgreSQL is also crashing 1-2 times a day on me, although I have a
> >>>> handy perl script to keep it alive now <grin>...
>
> > basically the server randomly dies with a:
> > ERROR:  postmaster: StreamConnection: accept: Invalid argument
> > pmdie 3
> > (then signals all children to drop dead)
>
> Hmm.  That shouldn't happen, especially not randomly; if the accept
> works the first time then it should work forever after, since the
> arguments being passed in never change.
>
> The error is coming from StreamConnection() in
> pgsql/src/backend/libpq/pqcomm.c.  Could you maybe add some debugging
> code to the routine to see what the server_fd and port arguments are
> when accept() fails?  I think just changing the first elog() to
>
> elog(ERROR,
>      "postmaster: StreamConnection: accept: %m\nserver_fd = %d, port = %p",
>      server_fd, port);
>
> would do for starters.  This would let us eliminate the possibility that
> the routine is getting passed bad arguments.
>
> An alternative possibility is to run the postmaster under truss so you
> can see what arguments are passed to the kernel on every kernel call,
> but that'd generate a pretty verbose logfile.
>
>                         regards, tom lane

query: SELECT "material_id" ,"name" ,"short_name" ,"legacy" FROM "material"
ORDE
R BY "legacy" DESC,"name"
ProcessQuery
! system usage stats:
!       0.017961 elapsed 0.020000 user 0.000000 system sec
!       [0.050000 user 0.020000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       6/24 [127/201] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:          0 read,          0 written, buffer hit rate =
10
0.00%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.
00%
!       Direct blocks:          0 read,          0 written
CommitTransactionCommand
ERROR:  postmaster: StreamConnection: accept: Invalid argument
server_fd = 3, port = 0x816aa70
pmdie 3
SignalChildren: sending signal 15 to process 16943
SignalChildren: sending signal 15 to process 16942
SignalChildren: sending signal 15 to process 16941
   There we go, it crashed this morning...(interestingly it went all of
yesterday without crashing)... Does this shed some light?  If not what would
you like me to do next?  I have 700M+ to keep a log file, as long as it doesn't
generate that much in a day we should be okay with a very verbose log.
   Just tell me what code mods or runtime options to use...

David Secret
MIS Director
Kearney Development Co., Inc.




Re: postmaster dies (was Re: Very disappointing performance)

From
Tom Lane
Date:
secret <secret@kearneydev.com> writes:
> ERROR:  postmaster: StreamConnection: accept: Invalid argument
> server_fd = 3, port = 0x816aa70

>     There we go, it crashed this morning...(interestingly it went all of
> yesterday without crashing)... Does this shed some light?

Not much ... it shows pretty much what we expected, ie, nothing
obviously wrong.

What I would suggest doing next is running the postmaster under 'truss'
or some similar utility that can generate a logfile of all the kernel
calls made by the postmaster.  I can't give you any details on how to do
that --- perhaps some other reader can help?  What we're looking for is
anything that might have changed the state of file descriptor 3 shortly
before the crash.

BTW, some tips on debugging this.  Maybe these are obvious, maybe not:

1. This accept call is not associated with normal query processing, but
with receiving connection requests from new clients.  Almost certainly
the bug is not triggered by processing queries but by connection
attempts.  You probably could make the crash happen sooner by starting
and stopping clients in a steady stream (not that you want a crash
sooner on your real system, of course, but for debugging it'd be nice
not to have to wait for long).

2. You might want to build a playpen system that you can stress into
crashing without taking out your live server.  The easiest way to do
that is just to duplicate your installation on another machine, but if
no other machine is handy (or if you suspect a platform-dependent bug,
which I do here) the best bet is to build a debugging version of
Postgres that has nonstandard values for the installation directory
and server's port address.  For example I usually build trial versions
with

./configure --with-pgport=5440 --prefix=/users/postgres/testversion

(plus any options you normally use, of course).  I think it might also
be possible to set these values while running initdb and starting the
test postmaster, without having to recompile; but I don't know the
exact incantations to use to do it that way.
        regards, tom lane


Re: postmaster dies (was Re: Very disappointing performance)

From
secret
Date:
Tom Lane wrote:

> secret <secret@kearneydev.com> writes:
> > ERROR:  postmaster: StreamConnection: accept: Invalid argument
> > server_fd = 3, port = 0x816aa70
>
> >     There we go, it crashed this morning...(interestingly it went all of
> > yesterday without crashing)... Does this shed some light?
>
> Not much ... it shows pretty much what we expected, ie, nothing
> obviously wrong.
>
> What I would suggest doing next is running the postmaster under 'truss'
> or some similar utility that can generate a logfile of all the kernel
> calls made by the postmaster.  I can't give you any details on how to do
> that --- perhaps some other reader can help?  What we're looking for is
> anything that might have changed the state of file descriptor 3 shortly
> before the crash.
>
> BTW, some tips on debugging this.  Maybe these are obvious, maybe not:
>
> 1. This accept call is not associated with normal query processing, but
> with receiving connection requests from new clients.  Almost certainly
> the bug is not triggered by processing queries but by connection
> attempts.  You probably could make the crash happen sooner by starting
> and stopping clients in a steady stream (not that you want a crash
> sooner on your real system, of course, but for debugging it'd be nice
> not to have to wait for long).
>
> 2. You might want to build a playpen system that you can stress into
> crashing without taking out your live server.  The easiest way to do
> that is just to duplicate your installation on another machine, but if
> no other machine is handy (or if you suspect a platform-dependent bug,
> which I do here) the best bet is to build a debugging version of
> Postgres that has nonstandard values for the installation directory
> and server's port address.  For example I usually build trial versions
> with
>
> ./configure --with-pgport=5440 --prefix=/users/postgres/testversion
>
> (plus any options you normally use, of course).  I think it might also
> be possible to set these values while running initdb and starting the
> test postmaster, without having to recompile; but I don't know the
> exact incantations to use to do it that way.
>
>                         regards, tom lane
   Would strace work instead of truss?  I have strace... Will you be able to
interpret the strace files & determine the problem do you think?
   You've been the only one to respond on this, so I'm a tad worried about
being left out in the cold on this one... I'd be glad to pay for support if
there is a place I can do that, heck I pay for support on other software
products, why not PostgreSQL?
   Please let me know.  I'll begin an strace tonight...

David




Re: [HACKERS] Re: postmaster dies (was Re: Very disappointing performance)

From
Bruce Momjian
Date:
>     Would strace work instead of truss?  I have strace... Will you be able to
> interpret the strace files & determine the problem do you think?
> 
>     You've been the only one to respond on this, so I'm a tad worried about
> being left out in the cold on this one... I'd be glad to pay for support if
> there is a place I can do that, heck I pay for support on other software
> products, why not PostgreSQL?
> 
>     Please let me know.  I'll begin an strace tonight...

I can't imagine he has enough disk space for truss/ktrace output for a
full day of backend activity, does he?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Bruce Momjian wrote:

> >     Would strace work instead of truss?  I have strace... Will you be able to
> > interpret the strace files & determine the problem do you think?
> >
> >     You've been the only one to respond on this, so I'm a tad worried about
> > being left out in the cold on this one... I'd be glad to pay for support if
> > there is a place I can do that, heck I pay for support on other software
> > products, why not PostgreSQL?
> >
> >     Please let me know.  I'll begin an strace tonight...
>
> I can't imagine he has enough disk space for truss/ktrace output for a
> full day of backend activity, does he?
>
> --
>   Bruce Momjian                        |  http://www.op.net/~candle
>   maillist@candle.pha.pa.us            |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
   Ur, I'll postpone this to Thursday, when I can monitor the disk space very
carefully, how much space are we talking about here?  1G? 2G? 3G? 10G?
   Maybe I can temporarily install a hard disk just for that purpose....  There
are only a few users on the database, it really isn't *THAT* active.

--David




Re: [HACKERS] Re: postmaster dies (was Re: Very disappointing performance)

From
Bruce Momjian
Date:
> 
>     Ur, I'll postpone this to Thursday, when I can monitor the disk space very
> carefully, how much space are we talking about here?  1G? 2G? 3G? 10G?
> 
>     Maybe I can temporarily install a hard disk just for that purpose....  There
> are only a few users on the database, it really isn't *THAT* active.

Hard to say.  I would turn it on for 15 minutes and see.  ktrace can
generate a 1MB files in a minute.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Bruce Momjian <maillist@candle.pha.pa.us> writes:
> I can't imagine he has enough disk space for truss/ktrace output for a
> full day of backend activity, does he?

That's why I was encouraging him to set up a playpen and actively
work at crashing it, rather than waiting around to see whether it'd
happen before his disk fills up ;-)
        regards, tom lane


Tom Lane wrote:

> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > I can't imagine he has enough disk space for truss/ktrace output for a
> > full day of backend activity, does he?
>
> That's why I was encouraging him to set up a playpen and actively
> work at crashing it, rather than waiting around to see whether it'd
> happen before his disk fills up ;-)
>
>                         regards, tom lane
   I've built a simple program to record the last N lines(currently
5000...Suggestions?) of input... What I'd like to do is pipe STDIN and
STDERR to this program, but "|" doesn't do this, do you all have a
suggestion on how to do this?  If I can then I can get you the system trace
and hopefully get this crash bug fixed.





On Tue, 23 Mar 1999, secret wrote:
>    I've built a simple program to record the last N lines(currently
>5000...Suggestions?) of input... What I'd like to do is pipe STDIN and
>STDERR to this program, but "|" doesn't do this, do you all have a
>suggestion on how to do this?  If I can then I can get you the system trace
>and hopefully get this crash bug fixed.

strace ... 2>&1 | tail -5000

Note that tail is a standard *nix program.

Taral