Thread: Postgres Crash

Postgres Crash

From
Samuel Stearns
Date:

Howdy,

 

Environment:

 

Solaris 10

Postgres 8.3.12

 

Postgres crashed and left 26 postmaster processes active in it’s wake.  Killed the children and re-started postgres successfully.  Messages from the log:

 

Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG:  setsockopt(TCP_NODELAY) failed: Invalid argument

Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-1] host=,user=,db= FATAL:  pre-existing shared memory block (key 5432001, ID 0) is still in use

Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-2] host=,user=,db= HINT:  If you're sure there are no old server processes still running, remove the shared memory block with the

Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-3]  command "ipcclean", "ipcrm", or just delete the file "postmaster.pid".

 

With the ‘FATAL’ and ‘HINT’ lines repeating.

 

Any ideas what occurred here?

 

Thanks,

 

Sam

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
> Environment:

> Solaris 10
> Postgres 8.3.12

> Postgres crashed and left 26 postmaster processes active in it's wake.  Killed the children and re-started postgres
successfully. Messages from the log: 

> Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG:  setsockopt(TCP_NODELAY)
failed:Invalid argument 
> Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-1] host=,user=,db= FATAL:  pre-existing shared
memoryblock (key 5432001, ID 0) is still in use 
> Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-2] host=,user=,db= HINT:  If you're sure there are
noold server processes still running, remove the shared memory block with the 
> Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-3]  command "ipcclean", "ipcrm", or just delete the
file"postmaster.pid". 

> With the 'FATAL' and 'HINT' lines repeating.

> Any ideas what occurred here?

Nope.  The log entries above are from the restart attempt, and give no
information about the crash.  If you have any log entries from before
that, or have a core file that would yield a backtrace, maybe we could
draw some conclusions from that info.

            regards, tom lane

Re: Postgres Crash

From
Shoaib Mir
Date:
On Fri, Dec 10, 2010 at 2:17 PM, Samuel Stearns <SStearns@internode.com.au> wrote:

Howdy,

 

Environment:

 

Solaris 10

Postgres 8.3.12

 

Postgres crashed and left 26 postmaster processes active in it’s wake.  Killed the children and re-started postgres successfully.  Messages from the log:

 

Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG:  setsockopt(TCP_NODELAY) failed: Invalid argument



Did ypu try deleting postmaster.pid file and then restarting??

--
Shoaib Mir
http://shoaibmir.wordpress.com/

Re: Postgres Crash

From
Samuel Stearns
Date:

Thanks Tom and Shoaib,

 

Shoaib, I did not delete postmaster.pid.  I killed the children and re-started successfully.

 

Tom, no useful messages in the log prior.  I do have a 47M core dump.  What should I do with that?

 

Sam

 

From: Shoaib Mir [mailto:shoaibmir@gmail.com]
Sent: Friday, 10 December 2010 2:00 PM
To: Samuel Stearns
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

 

On Fri, Dec 10, 2010 at 2:17 PM, Samuel Stearns <SStearns@internode.com.au> wrote:

Howdy,

 

Environment:

 

Solaris 10

Postgres 8.3.12

 

Postgres crashed and left 26 postmaster processes active in it’s wake.  Killed the children and re-started postgres successfully.  Messages from the log:

 

Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG:  setsockopt(TCP_NODELAY) failed: Invalid argument



Did ypu try deleting postmaster.pid file and then restarting??


--
Shoaib Mir
http://shoaibmir.wordpress.com/

Re: Postgres Crash

From
Shoaib Mir
Date:
On Fri, Dec 10, 2010 at 2:33 PM, Samuel Stearns <SStearns@internode.com.au> wrote:

Thanks Tom and Shoaib,

 

Shoaib, I did not delete postmaster.pid.  I killed the children and re-started successfully.

 


So is the database server all good and working fine now??

--
Shoaib Mir
http://shoaibmir.wordpress.com/

Re: Postgres Crash

From
Samuel Stearns
Date:

Yes.

 

From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Shoaib Mir
Sent: Friday, 10 December 2010 2:06 PM
To: Samuel Stearns
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

 

On Fri, Dec 10, 2010 at 2:33 PM, Samuel Stearns <SStearns@internode.com.au> wrote:

Thanks Tom and Shoaib,

 

Shoaib, I did not delete postmaster.pid.  I killed the children and re-started successfully.

 


So is the database server all good and working fine now??


--
Shoaib Mir
http://shoaibmir.wordpress.com/

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
> Tom, no useful messages in the log prior.  I do have a 47M core dump.  What should I do with that?

If you use gdb, try

    $ gdb /path/to/postmaster /path/to/corefile
    gdb> bt
    ... useful info here ...
    gdb> quit

I think the preferred debugger on Solaris might not be gdb, but if so
you'll need to consult its docs to find out how to get a stack trace.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Thanks Tom,

We don't have gdb.  We have mdb and pstack.  From the core:

[root@udrv] # mdb /opt/postgres/8.3-community/bin/postmaster /root/core
Loading modules: [ libc.so.1 ld.so.1 ]
> ::status
debugging core file of postmaster (32-bit) from udrv
file: /opt/postgres/8.3-community/bin/postmaster
initial argv: /opt/postgres/8.3-community/bin/postmaster -F
threading model: multi-threaded
status: process terminated by SIGSEGV (Segmentation Fault)
> ::regs
%cs = 0x003b            %eax = 0x083b7fe0
%ds = 0x0043            %ebx = 0x00000000
%ss = 0x0043            %ecx = 0x00000000
%es = 0x0043            %edx = 0x00000000
%fs = 0x0000            %esi = 0x00000001
%gs = 0x01c3            %edi = 0x00000005

 %eip = 0x081a8562 ConnCreate+0xb6
 %ebp = 0x08047c88
%kesp = 0x00000000

%eflags = 0x00010206
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,IF,tf,sf,zf,af,PF,cf>

   %esp = 0x08047c78
%trapno = 0xe
   %err = 0x6

 [root@udrv] # pstack /root/core
core '/root/core' of 771:       /opt/postgres/8.3-community/bin/postmaster -F
 081a8562 ConnCreate (5) + b6
 081a791b ServerLoop (8047e68, 83b7930, 2, fead58be, 8047e68, 83c28b8) + db
 081a73f1 PostmasterMain (2, 83b7930) + ab5
 08164e3a main     (2, 8047e44, 8047e50) + 17a
 080891fa _start   (2, 8047ed0, 8047efb, 0, 8047efe, 8047f2e) + 7a
>

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 2:17 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> Tom, no useful messages in the log prior.  I do have a 47M core dump.  What should I do with that?

If you use gdb, try

    $ gdb /path/to/postmaster /path/to/corefile
    gdb> bt
    ... useful info here ...
    gdb> quit

I think the preferred debugger on Solaris might not be gdb, but if so
you'll need to consult its docs to find out how to get a stack trace.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Tom,

Could it possibly be this?:


http://postgresql.1045698.n5.nabble.com/BUG-5731-postmaster-sometimes-dumps-core-when-handling-local-connections-td3239029.html

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 2:17 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> Tom, no useful messages in the log prior.  I do have a 47M core dump.  What should I do with that?

If you use gdb, try

    $ gdb /path/to/postmaster /path/to/corefile
    gdb> bt
    ... useful info here ...
    gdb> quit

I think the preferred debugger on Solaris might not be gdb, but if so
you'll need to consult its docs to find out how to get a stack trace.

            regards, tom lane

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
>  [root@udrv] # pstack /root/core
> core '/root/core' of 771:       /opt/postgres/8.3-community/bin/postmaster -F
>  081a8562 ConnCreate (5) + b6
>  081a791b ServerLoop (8047e68, 83b7930, 2, fead58be, 8047e68, 83c28b8) + db
>  081a73f1 PostmasterMain (2, 83b7930) + ab5
>  08164e3a main     (2, 8047e44, 8047e50) + 17a
>  080891fa _start   (2, 8047ed0, 8047efb, 0, 8047efe, 8047f2e) + 7a

Hmmm ... does your build have GSS enabled (configure --with-gssapi)?
If so I think you ran into this recently-discovered issue:
http://archives.postgresql.org/pgsql-committers/2010-10/msg00253.php

I had originally thought that your log message about
setsockopt(TCP_NODELAY) failed: Invalid argument
was post-crash, but if it was pre-crash it supports that theory,
because that error would in fact lead to the core dump in ConnCreate
if you had ENABLE_GSS on.

In any case that log message is pretty odd: it's not at all clear how
the setsockopt call could have failed.  Failure to establish a socket
should bail out earlier.

            regards, tom lane

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
> Could it possibly be this?:

>
http://postgresql.1045698.n5.nabble.com/BUG-5731-postmaster-sometimes-dumps-core-when-handling-local-connections-td3239029.html

Yeah, I'd just been off digging through the code to arrive at that same
theory.  Did you build with GSSAPI support?

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Its not our build - its the one downloaded some the postgres homepage
from http://www.postgresql.org/ftp/binary/v8.3.12/solaris/solaris10/i386/

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 2:53 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> Could it possibly be this?:

>
http://postgresql.1045698.n5.nabble.com/BUG-5731-postmaster-sometimes-dumps-core-when-handling-local-connections-td3239029.html

Yeah, I'd just been off digging through the code to arrive at that same
theory.  Did you build with GSSAPI support?

            regards, tom lane

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
> Its not our build - its the one downloaded some the postgres homepage
> from http://www.postgresql.org/ftp/binary/v8.3.12/solaris/solaris10/i386/

pg_config --configure would tell you how it was built.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Tom,

Yes, with gssapi.

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 3:02 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> Its not our build - its the one downloaded some the postgres homepage
> from http://www.postgresql.org/ftp/binary/v8.3.12/solaris/solaris10/i386/

pg_config --configure would tell you how it was built.

            regards, tom lane

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
> Yes, with gssapi.

Well, then we have our smoking gun, but it's still not clear *why*
the setsockopt() call failed.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Tom,

So you are in agreement that the fix is:

This simple patch seem to fix the problem

--- src/backend/postmaster/postmaster.c.orig    2010-10-27
19:07:42.000000000 +0400
+++ src/backend/postmaster/postmaster.c 2010-10-27 19:08:25.000000000 +0400
@@ -1917,7 +1917,7 @@
                if (port->sock >= 0)
                        StreamClose(port->sock);
                ConnFree(port);
-               port = NULL;
+               return NULL;
        }
        else
        {

--

From that previous link?

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 3:22 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> Yes, with gssapi.

Well, then we have our smoking gun, but it's still not clear *why*
the setsockopt() call failed.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Tom,

I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler.  Is this fixed in
alater release of postgres? 

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 3:22 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> Yes, with gssapi.

Well, then we have our smoking gun, but it's still not clear *why*
the setsockopt() call failed.

            regards, tom lane

Re: Postgres Crash

From
Tom Lane
Date:
Samuel Stearns <SStearns@internode.com.au> writes:
> I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler.  Is this fixed
ina later release of postgres? 

The fix will be in next week's releases.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
So will that be an 8.3.13?

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 3:32 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler.  Is this fixed
ina later release of postgres? 

The fix will be in next week's releases.

            regards, tom lane

Re: Postgres Crash

From
Samuel Stearns
Date:
Thanks for all the help with this, Tom.

How do I go about finding the release with the fix applied?

Sam

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, 10 December 2010 3:32 PM
To: Samuel Stearns
Cc: Shoaib Mir; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash

Samuel Stearns <SStearns@internode.com.au> writes:
> I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler.  Is this fixed
ina later release of postgres? 

The fix will be in next week's releases.

            regards, tom lane

Re: Postgres Crash

From
agfk
Date:
Tom Lane-2 wrote:
>
> Samuel Stearns <SStearns@internode.com.au> writes:
>> Yes, with gssapi.
>
> Well, then we have our smoking gun, but it's still not clear *why*
> the setsockopt() call failed.
>
>             regards, tom lane
>
I know why it failed.   I just had two independent database servers (8.4.4
and 9.0.0 on Solaris) crash AT THE SAME TIME with the exact same error.

Reason?   nmap.    Our system administrator ran nmap on the subnet in
question to scan for hosts with open TCP ports, which caused the
setsockopt() call to fail.

I know the bug has already been fixed, but it's good to know anyway.

--Al

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Postgres-Crash-tp3299776p3888697.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.

Re: Postgres Crash

From
digant
Date:
Hi

Thanks for the info.
Where was it fixed? in postgres or nmap? do you know also the version?

br,
Antonio


--
View this message in context: http://postgresql.1045698.n5.nabble.com/Postgres-Crash-tp3299776p4418707.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.