Thread: Postgres Crash
Howdy,
Environment:
Solaris 10
Postgres 8.3.12
Postgres crashed and left 26 postmaster processes active in it’s wake. Killed the children and re-started postgres successfully. Messages from the log:
Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG: setsockopt(TCP_NODELAY) failed: Invalid argument
Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-1] host=,user=,db= FATAL: pre-existing shared memory block (key 5432001, ID 0) is still in use
Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-2] host=,user=,db= HINT: If you're sure there are no old server processes still running, remove the shared memory block with the
Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-3] command "ipcclean", "ipcrm", or just delete the file "postmaster.pid".
With the ‘FATAL’ and ‘HINT’ lines repeating.
Any ideas what occurred here?
Thanks,
Sam
Samuel Stearns <SStearns@internode.com.au> writes: > Environment: > Solaris 10 > Postgres 8.3.12 > Postgres crashed and left 26 postmaster processes active in it's wake. Killed the children and re-started postgres successfully. Messages from the log: > Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG: setsockopt(TCP_NODELAY) failed:Invalid argument > Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-1] host=,user=,db= FATAL: pre-existing shared memoryblock (key 5432001, ID 0) is still in use > Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-2] host=,user=,db= HINT: If you're sure there are noold server processes still running, remove the shared memory block with the > Dec 10 11:52:20 udrv postgres[2183]: [ID 748848 local0.error] [1-3] command "ipcclean", "ipcrm", or just delete the file"postmaster.pid". > With the 'FATAL' and 'HINT' lines repeating. > Any ideas what occurred here? Nope. The log entries above are from the restart attempt, and give no information about the crash. If you have any log entries from before that, or have a core file that would yield a backtrace, maybe we could draw some conclusions from that info. regards, tom lane
Howdy,
Environment:
Solaris 10
Postgres 8.3.12
Postgres crashed and left 26 postmaster processes active in it’s wake. Killed the children and re-started postgres successfully. Messages from the log:
Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG: setsockopt(TCP_NODELAY) failed: Invalid argument
Did ypu try deleting postmaster.pid file and then restarting??
--
Shoaib Mir
http://shoaibmir.wordpress.com/
Thanks Tom and Shoaib,
Shoaib, I did not delete postmaster.pid. I killed the children and re-started successfully.
Tom, no useful messages in the log prior. I do have a 47M core dump. What should I do with that?
Sam
From: Shoaib Mir [mailto:shoaibmir@gmail.com]
Sent: Friday, 10 December 2010 2:00 PM
To: Samuel Stearns
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash
On Fri, Dec 10, 2010 at 2:17 PM, Samuel Stearns <SStearns@internode.com.au> wrote:
Howdy,
Environment:
Solaris 10
Postgres 8.3.12
Postgres crashed and left 26 postmaster processes active in it’s wake. Killed the children and re-started postgres successfully. Messages from the log:
Dec 10 11:52:15 udrv postgres[771]: [ID 748848 local0.info] [6-1] host=,user=,db= LOG: setsockopt(TCP_NODELAY) failed: Invalid argument
Did ypu try deleting postmaster.pid file and then restarting??
--
Shoaib Mir
http://shoaibmir.wordpress.com/
Thanks Tom and Shoaib,
Shoaib, I did not delete postmaster.pid. I killed the children and re-started successfully.
So is the database server all good and working fine now??
--
Shoaib Mir
http://shoaibmir.wordpress.com/
Yes.
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Shoaib Mir
Sent: Friday, 10 December 2010 2:06 PM
To: Samuel Stearns
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Postgres Crash
On Fri, Dec 10, 2010 at 2:33 PM, Samuel Stearns <SStearns@internode.com.au> wrote:
Thanks Tom and Shoaib,
Shoaib, I did not delete postmaster.pid. I killed the children and re-started successfully.
So is the database server all good and working fine now??
--
Shoaib Mir
http://shoaibmir.wordpress.com/
Samuel Stearns <SStearns@internode.com.au> writes: > Tom, no useful messages in the log prior. I do have a 47M core dump. What should I do with that? If you use gdb, try $ gdb /path/to/postmaster /path/to/corefile gdb> bt ... useful info here ... gdb> quit I think the preferred debugger on Solaris might not be gdb, but if so you'll need to consult its docs to find out how to get a stack trace. regards, tom lane
Thanks Tom, We don't have gdb. We have mdb and pstack. From the core: [root@udrv] # mdb /opt/postgres/8.3-community/bin/postmaster /root/core Loading modules: [ libc.so.1 ld.so.1 ] > ::status debugging core file of postmaster (32-bit) from udrv file: /opt/postgres/8.3-community/bin/postmaster initial argv: /opt/postgres/8.3-community/bin/postmaster -F threading model: multi-threaded status: process terminated by SIGSEGV (Segmentation Fault) > ::regs %cs = 0x003b %eax = 0x083b7fe0 %ds = 0x0043 %ebx = 0x00000000 %ss = 0x0043 %ecx = 0x00000000 %es = 0x0043 %edx = 0x00000000 %fs = 0x0000 %esi = 0x00000001 %gs = 0x01c3 %edi = 0x00000005 %eip = 0x081a8562 ConnCreate+0xb6 %ebp = 0x08047c88 %kesp = 0x00000000 %eflags = 0x00010206 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=<of,df,IF,tf,sf,zf,af,PF,cf> %esp = 0x08047c78 %trapno = 0xe %err = 0x6 [root@udrv] # pstack /root/core core '/root/core' of 771: /opt/postgres/8.3-community/bin/postmaster -F 081a8562 ConnCreate (5) + b6 081a791b ServerLoop (8047e68, 83b7930, 2, fead58be, 8047e68, 83c28b8) + db 081a73f1 PostmasterMain (2, 83b7930) + ab5 08164e3a main (2, 8047e44, 8047e50) + 17a 080891fa _start (2, 8047ed0, 8047efb, 0, 8047efe, 8047f2e) + 7a > Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 2:17 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > Tom, no useful messages in the log prior. I do have a 47M core dump. What should I do with that? If you use gdb, try $ gdb /path/to/postmaster /path/to/corefile gdb> bt ... useful info here ... gdb> quit I think the preferred debugger on Solaris might not be gdb, but if so you'll need to consult its docs to find out how to get a stack trace. regards, tom lane
Tom, Could it possibly be this?: http://postgresql.1045698.n5.nabble.com/BUG-5731-postmaster-sometimes-dumps-core-when-handling-local-connections-td3239029.html Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 2:17 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > Tom, no useful messages in the log prior. I do have a 47M core dump. What should I do with that? If you use gdb, try $ gdb /path/to/postmaster /path/to/corefile gdb> bt ... useful info here ... gdb> quit I think the preferred debugger on Solaris might not be gdb, but if so you'll need to consult its docs to find out how to get a stack trace. regards, tom lane
Samuel Stearns <SStearns@internode.com.au> writes: > [root@udrv] # pstack /root/core > core '/root/core' of 771: /opt/postgres/8.3-community/bin/postmaster -F > 081a8562 ConnCreate (5) + b6 > 081a791b ServerLoop (8047e68, 83b7930, 2, fead58be, 8047e68, 83c28b8) + db > 081a73f1 PostmasterMain (2, 83b7930) + ab5 > 08164e3a main (2, 8047e44, 8047e50) + 17a > 080891fa _start (2, 8047ed0, 8047efb, 0, 8047efe, 8047f2e) + 7a Hmmm ... does your build have GSS enabled (configure --with-gssapi)? If so I think you ran into this recently-discovered issue: http://archives.postgresql.org/pgsql-committers/2010-10/msg00253.php I had originally thought that your log message about setsockopt(TCP_NODELAY) failed: Invalid argument was post-crash, but if it was pre-crash it supports that theory, because that error would in fact lead to the core dump in ConnCreate if you had ENABLE_GSS on. In any case that log message is pretty odd: it's not at all clear how the setsockopt call could have failed. Failure to establish a socket should bail out earlier. regards, tom lane
Samuel Stearns <SStearns@internode.com.au> writes: > Could it possibly be this?: > http://postgresql.1045698.n5.nabble.com/BUG-5731-postmaster-sometimes-dumps-core-when-handling-local-connections-td3239029.html Yeah, I'd just been off digging through the code to arrive at that same theory. Did you build with GSSAPI support? regards, tom lane
Its not our build - its the one downloaded some the postgres homepage from http://www.postgresql.org/ftp/binary/v8.3.12/solaris/solaris10/i386/ Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 2:53 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > Could it possibly be this?: > http://postgresql.1045698.n5.nabble.com/BUG-5731-postmaster-sometimes-dumps-core-when-handling-local-connections-td3239029.html Yeah, I'd just been off digging through the code to arrive at that same theory. Did you build with GSSAPI support? regards, tom lane
Samuel Stearns <SStearns@internode.com.au> writes: > Its not our build - its the one downloaded some the postgres homepage > from http://www.postgresql.org/ftp/binary/v8.3.12/solaris/solaris10/i386/ pg_config --configure would tell you how it was built. regards, tom lane
Tom, Yes, with gssapi. Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 3:02 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > Its not our build - its the one downloaded some the postgres homepage > from http://www.postgresql.org/ftp/binary/v8.3.12/solaris/solaris10/i386/ pg_config --configure would tell you how it was built. regards, tom lane
Samuel Stearns <SStearns@internode.com.au> writes: > Yes, with gssapi. Well, then we have our smoking gun, but it's still not clear *why* the setsockopt() call failed. regards, tom lane
Tom, So you are in agreement that the fix is: This simple patch seem to fix the problem --- src/backend/postmaster/postmaster.c.orig 2010-10-27 19:07:42.000000000 +0400 +++ src/backend/postmaster/postmaster.c 2010-10-27 19:08:25.000000000 +0400 @@ -1917,7 +1917,7 @@ if (port->sock >= 0) StreamClose(port->sock); ConnFree(port); - port = NULL; + return NULL; } else { -- From that previous link? Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 3:22 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > Yes, with gssapi. Well, then we have our smoking gun, but it's still not clear *why* the setsockopt() call failed. regards, tom lane
Tom, I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler. Is this fixed in alater release of postgres? Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 3:22 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > Yes, with gssapi. Well, then we have our smoking gun, but it's still not clear *why* the setsockopt() call failed. regards, tom lane
Samuel Stearns <SStearns@internode.com.au> writes: > I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler. Is this fixed ina later release of postgres? The fix will be in next week's releases. regards, tom lane
So will that be an 8.3.13? Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 3:32 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler. Is this fixed ina later release of postgres? The fix will be in next week's releases. regards, tom lane
Thanks for all the help with this, Tom. How do I go about finding the release with the fix applied? Sam -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, 10 December 2010 3:32 PM To: Samuel Stearns Cc: Shoaib Mir; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Postgres Crash Samuel Stearns <SStearns@internode.com.au> writes: > I'm getting info from our sysadmins that we can't re-compile because we don't have the sun compiler. Is this fixed ina later release of postgres? The fix will be in next week's releases. regards, tom lane
Tom Lane-2 wrote: > > Samuel Stearns <SStearns@internode.com.au> writes: >> Yes, with gssapi. > > Well, then we have our smoking gun, but it's still not clear *why* > the setsockopt() call failed. > > regards, tom lane > I know why it failed. I just had two independent database servers (8.4.4 and 9.0.0 on Solaris) crash AT THE SAME TIME with the exact same error. Reason? nmap. Our system administrator ran nmap on the subnet in question to scan for hosts with open TCP ports, which caused the setsockopt() call to fail. I know the bug has already been fixed, but it's good to know anyway. --Al -- View this message in context: http://postgresql.1045698.n5.nabble.com/Postgres-Crash-tp3299776p3888697.html Sent from the PostgreSQL - admin mailing list archive at Nabble.com.
Hi Thanks for the info. Where was it fixed? in postgres or nmap? do you know also the version? br, Antonio -- View this message in context: http://postgresql.1045698.n5.nabble.com/Postgres-Crash-tp3299776p4418707.html Sent from the PostgreSQL - admin mailing list archive at Nabble.com.