Thread: 8.5alpha3 hot standby crash report (DatabasePath related?)
Howdy folks, Doing some testing with 8.5alpha3, my standby crashed this morning whilst doing some testing. Seems I have a core file, so thought I would send a report. Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit Error in the log was simply this: LOG: restored log file "000000010000000300000032" from archive LOG: startup process (PID 385) was terminated by signal 11 LOG: terminating any other active server processes info from core file was: postgres@postgresdev[/export/home/postgres/pgwork/data/slave]dbx /export/home/postgres/pgwork/inst/slave/bin/postgres core For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc Reading postgres core file header read successfully Reading ld.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libsocket.so.1 Reading libm.so.2 Reading libc.so.1 Reading libaio.so.1 Reading libmd.so.1 program terminated by signal SEGV (no mapping at the fault address) 0xfedb578c: strlen+0x000c: movl (%eax),%edx Current function is RelationCacheInitFileInvalidate4237 snprintf(initfilename, sizeof(initfilename), "%s/%s", (dbx) where [1] strlen(0x0), at 0xfedb578c [2] _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976 [3] snprintf(0x8042e90,0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c, 0x4f434552, 0x59524556), at 0xfee10734 =>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in "relcache.c" [5] xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in "xact.c" [6] StartupXLOG(), line 5834 in "xlog.c" [7] StartupProcessMain(), line 8359 in "xlog.c" [8] AuxiliaryProcessMain(argc= 2, argv = 0x80475e0), line 408 in "bootstrap.c" [9] StartChildProcess(type = 134510104), line 4218 in "postmaster.c" [10] PostmasterMain(argc = 3, argv = 0x84c6e00),line 1061 in "postmaster.c" [11] main(argc = 3, argv = 0x84c6e00), line 188 in "main.c" (dbx) print initfilename initfilename = "XLOG" (dbx) list4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",4238 DatabasePath,RELCACHE_INIT_FILENAME);4239 4240 if (beforeSend)4241 {4242 /* no interlockneeded here */4243 unlink(initfilename);4244 }4245 else4246 { (dbx) print DatabasePath DatabasePath = (nil) Theo mentioned that he expects the above would work under Linux, but that we need DatabasePath to not be null in the above. FWIW, the SQL I was running on the master did involve two create database commands back to back. LMK if you have any questions. -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote: > Doing some testing with 8.5alpha3, my standby crashed this morning whilst > doing some testing. Seems I have a core file, so thought I would send a report. Thanks for the report. -- Simon Riggs www.2ndQuadrant.com
Robert Treat <xzilla@users.sourceforge.net> writes: > Doing some testing with 8.5alpha3, my standby crashed this morning whilst > doing some testing. Seems I have a core file, so thought I would send a report. Hmm. Looks like it's crashing because DatabasePath is not set yet. We could probably kluge around that by having this function return without doing anything if DatabasePath is null, but I wonder if this doesn't indicate that things are being done in the wrong order. If DatabasePath isn't set, then there's a whole lot of other infrastructure that's not good either: > Current function is RelationCacheInitFileInvalidate > 4237 snprintf(initfilename, sizeof(initfilename), "%s/%s", > 4238 DatabasePath, RELCACHE_INIT_FILENAME); > (dbx) where > [1] strlen(0x0), at 0xfedb578c > [2] _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976 > [3] snprintf(0x8042e90, 0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c, > 0x4f434552, 0x59524556), at 0xfee10734 > =>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in > "relcache.c" > [5] xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in > "xact.c" > [6] StartupXLOG(), line 5834 in "xlog.c" > [7] StartupProcessMain(), line 8359 in "xlog.c" > [8] AuxiliaryProcessMain(argc = 2, argv = 0x80475e0), line 408 in > "bootstrap.c" > [9] StartChildProcess(type = 134510104), line 4218 in "postmaster.c" > [10] PostmasterMain(argc = 3, argv = 0x84c6e00), line 1061 in "postmaster.c" > [11] main(argc = 3, argv = 0x84c6e00), line 188 in "main.c" regards, tom lane
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote: > Doing some testing with 8.5alpha3, my standby crashed this morning whilst > doing some testing. Seems I have a core file, so thought I would send a report. > > Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc > (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit > > Error in the log was simply this: > LOG: restored log file "000000010000000300000032" from archive > LOG: startup process (PID 385) was terminated by signal 11 > LOG: terminating any other active server processes Please send some more of the prior log, so we can tell the initialization state at that point. If you can reproduce it, that would help also. Thanks. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2010-01-07 at 16:05 -0500, Tom Lane wrote: > Robert Treat <xzilla@users.sourceforge.net> writes: > > Doing some testing with 8.5alpha3, my standby crashed this morning whilst > > doing some testing. Seems I have a core file, so thought I would send a report. > > Hmm. Looks like it's crashing because DatabasePath is not set yet. > We could probably kluge around that by having this function return > without doing anything if DatabasePath is null, but I wonder if this > doesn't indicate that things are being done in the wrong order. > If DatabasePath isn't set, then there's a whole lot of other > infrastructure that's not good either: OK, I think I see this now. Will fix. -- Simon Riggs www.2ndQuadrant.com
On Jan 7, 2010, at 4:15 PM, Simon Riggs wrote: > On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote: > >> Doing some testing with 8.5alpha3, my standby crashed this morning >> whilst >> doing some testing. Seems I have a core file, so thought I would >> send a report. >> >> Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by >> GCC gcc >> (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit >> >> Error in the log was simply this: >> LOG: restored log file "000000010000000300000032" from archive >> LOG: startup process (PID 385) was terminated by signal 11 >> LOG: terminating any other active server processes > > Please send some more of the prior log, so we can tell the > initialization state at that point. > Not much more to send really. It's just "restored log file" lines going back to the startup (as expected) > If you can reproduce it, that would help also. Thanks. > Not sure about reproducing, but if I restart the database, it crashes again during log replay (stack trace info looks the same). LOG: database system was interrupted while in recovery at log time 2010-01-07 12:01:57 EST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: starting archive recovery LOG: restore_command = '/export/home/postgres/pgwork/inst/slave/bin/ pg_standby /export/home/postgres/pgwork/data/.hot_standby %f %p %r' cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 00000001.history cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 00000001.history cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 00000001.history cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 00000001.history LOG: restored log file "00000001000000030000002A" from archive LOG: automatic recovery in progress LOG: initializing recovery connections LOG: redo starts at 3/2A000020, consistency will be reached at 3/2D000120 LOG: restored log file "00000001000000030000002B" from archive LOG: restored log file "00000001000000030000002C" from archive LOG: restored log file "00000001000000030000002D" from archive LOG: consistent recovery state reached at 3/2D000120 LOG: database system is ready to accept read only connections LOG: restored log file "00000001000000030000002E" from archive LOG: restored log file "00000001000000030000002F" from archive LOG: restored log file "000000010000000300000030" from archive LOG: restored log file "000000010000000300000031" from archive LOG: restored log file "000000010000000300000032" from archive LOG: startup process (PID 4397) was terminated by signal 11 LOG: terminating any other active server processes -- Robert Treat http://www.omniti.com/
On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote: > Not much more to send really. No problem. I can see what causes it, nothing more required, thanks. What I don't fully understand yet is why the error hasn't shown itself before, because it appears such a basic error. -- Simon Riggs www.2ndQuadrant.com
On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote: > On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote: > >> Not much more to send really. > > No problem. I can see what causes it, nothing more required, thanks. > What I don't fully understand yet is why the error hasn't shown itself > before, because it appears such a basic error. > Is anyone doing regular testing on non-linux? (to the extent that this is a problem due to Solaris ideas about snprintf). Robert Treat http://www.omniti.com/
On Thu, 2010-01-07 at 18:02 -0500, Robert Treat wrote: > On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote: > > > On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote: > > > >> Not much more to send really. > > > > No problem. I can see what causes it, nothing more required, thanks. > > What I don't fully understand yet is why the error hasn't shown itself > > before, because it appears such a basic error. > > > > Is anyone doing regular testing on non-linux? (to the extent that > this is a problem due to Solaris ideas about snprintf). OK, understand now. The Solaris test failed in a place where it would have been good if Linux had done so; this has revealed that the relcache init file deletion required in some places was actually failing. The path was never set correctly in the first place, it appears. I've updated the TODO to reflect the new bug report. I've almost finished hacking a solution. I'll test and commit tomorrow, since it's a fairly standalone problem, if fiddly. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote: > I'll test and commit tomorrow, since it's a fairly standalone problem Fix attached, thanks for testing. Works for me and I don't expect it to fail on Solaris, since the root cause of the failure has been addressed and a correctly designed test executed. I will wait a day for your confirmation and/or other's comments. -- Simon Riggs www.2ndQuadrant.com
Attachment
On Jan 8, 2010, at 7:36 AM, Simon Riggs wrote: > On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote: >> I'll test and commit tomorrow, since it's a fairly standalone problem > > Fix attached, thanks for testing. > > Works for me and I don't expect it to fail on Solaris, since the root > cause of the failure has been addressed and a correctly designed test > executed. > > I will wait a day for your confirmation and/or other's comments. > Looks good from my end, thanks Simon. Robert Treat http://www.omniti.com