Thread: 8.5alpha3 hot standby crash report (DatabasePath related?)

8.5alpha3 hot standby crash report (DatabasePath related?)

From
Robert Treat
Date:
Howdy folks, 

Doing some testing with 8.5alpha3, my standby crashed this morning whilst 
doing some testing. Seems I have a core file, so thought I would send a report.

Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc 
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit

Error in the log was simply this:
LOG:  restored log file "000000010000000300000032" from archive
LOG:  startup process (PID 385) was terminated by signal 11
LOG:  terminating any other active server processes

info from core file was:

postgres@postgresdev[/export/home/postgres/pgwork/data/slave]dbx 
/export/home/postgres/pgwork/inst/slave/bin/postgres core 
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your 
.dbxrc
Reading postgres
core file header read successfully
Reading ld.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libsocket.so.1
Reading libm.so.2
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
program terminated by signal SEGV (no mapping at the fault address)
0xfedb578c: strlen+0x000c:      movl     (%eax),%edx
Current function is RelationCacheInitFileInvalidate4237           snprintf(initfilename, sizeof(initfilename),
"%s/%s",
(dbx) where [1] strlen(0x0), at 0xfedb578c  [2] _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976  [3]
snprintf(0x8042e90,0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c, 
 
0x4f434552, 0x59524556), at 0xfee10734 
=>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in 
"relcache.c" [5] xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in 
"xact.c" [6] StartupXLOG(), line 5834 in "xlog.c" [7] StartupProcessMain(), line 8359 in "xlog.c" [8]
AuxiliaryProcessMain(argc= 2, argv = 0x80475e0), line 408 in 
 
"bootstrap.c" [9] StartChildProcess(type = 134510104), line 4218 in "postmaster.c" [10] PostmasterMain(argc = 3, argv =
0x84c6e00),line 1061 in "postmaster.c" [11] main(argc = 3, argv = 0x84c6e00), line 188 in "main.c"
 
(dbx) print initfilename
initfilename = "XLOG"
(dbx) list4237           snprintf(initfilename, sizeof(initfilename), "%s/%s",4238
DatabasePath,RELCACHE_INIT_FILENAME);4239   4240           if (beforeSend)4241           {4242                   /* no
interlockneeded here */4243                   unlink(initfilename);4244           }4245           else4246           {
 
(dbx) print DatabasePath
DatabasePath = (nil)

Theo mentioned that he expects the above would work under Linux, but that we 
need DatabasePath to not be null in the above. 

FWIW, the SQL I was running on the master did involve two create database 
commands back to back.  

LMK if you have any questions. 

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com


Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Simon Riggs
Date:
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:

> Doing some testing with 8.5alpha3, my standby crashed this morning whilst 
> doing some testing. Seems I have a core file, so thought I would send a report.

Thanks for the report.

-- Simon Riggs           www.2ndQuadrant.com



Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Tom Lane
Date:
Robert Treat <xzilla@users.sourceforge.net> writes:
> Doing some testing with 8.5alpha3, my standby crashed this morning whilst 
> doing some testing. Seems I have a core file, so thought I would send a report.

Hmm.  Looks like it's crashing because DatabasePath is not set yet.
We could probably kluge around that by having this function return
without doing anything if DatabasePath is null, but I wonder if this
doesn't indicate that things are being done in the wrong order.
If DatabasePath isn't set, then there's a whole lot of other
infrastructure that's not good either:

> Current function is RelationCacheInitFileInvalidate
>  4237           snprintf(initfilename, sizeof(initfilename), "%s/%s",
>  4238                            DatabasePath, RELCACHE_INIT_FILENAME);
> (dbx) where
>   [1] strlen(0x0), at 0xfedb578c 
>   [2] _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976 
>   [3] snprintf(0x8042e90, 0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c, 
> 0x4f434552, 0x59524556), at 0xfee10734 
> =>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in 
> "relcache.c"
>   [5] xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in 
> "xact.c"
>   [6] StartupXLOG(), line 5834 in "xlog.c"
>   [7] StartupProcessMain(), line 8359 in "xlog.c"
>   [8] AuxiliaryProcessMain(argc = 2, argv = 0x80475e0), line 408 in 
> "bootstrap.c"
>   [9] StartChildProcess(type = 134510104), line 4218 in "postmaster.c"
>   [10] PostmasterMain(argc = 3, argv = 0x84c6e00), line 1061 in "postmaster.c"
>   [11] main(argc = 3, argv = 0x84c6e00), line 188 in "main.c"
        regards, tom lane


Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Simon Riggs
Date:
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:

> Doing some testing with 8.5alpha3, my standby crashed this morning whilst 
> doing some testing. Seems I have a core file, so thought I would send a report.
> 
> Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc 
> (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit
> 
> Error in the log was simply this:
> LOG:  restored log file "000000010000000300000032" from archive
> LOG:  startup process (PID 385) was terminated by signal 11
> LOG:  terminating any other active server processes

Please send some more of the prior log, so we can tell the
initialization state at that point.

If you can reproduce it, that would help also. Thanks.

-- Simon Riggs           www.2ndQuadrant.com



Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Simon Riggs
Date:
On Thu, 2010-01-07 at 16:05 -0500, Tom Lane wrote:
> Robert Treat <xzilla@users.sourceforge.net> writes:
> > Doing some testing with 8.5alpha3, my standby crashed this morning whilst 
> > doing some testing. Seems I have a core file, so thought I would send a report.
> 
> Hmm.  Looks like it's crashing because DatabasePath is not set yet.
> We could probably kluge around that by having this function return
> without doing anything if DatabasePath is null, but I wonder if this
> doesn't indicate that things are being done in the wrong order.
> If DatabasePath isn't set, then there's a whole lot of other
> infrastructure that's not good either:

OK, I think I see this now. Will fix.

-- Simon Riggs           www.2ndQuadrant.com



Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Robert Treat
Date:
On Jan 7, 2010, at 4:15 PM, Simon Riggs wrote:

> On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:
>
>> Doing some testing with 8.5alpha3, my standby crashed this morning  
>> whilst
>> doing some testing. Seems I have a core file, so thought I would  
>> send a report.
>>
>> Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by  
>> GCC gcc
>> (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit
>>
>> Error in the log was simply this:
>> LOG:  restored log file "000000010000000300000032" from archive
>> LOG:  startup process (PID 385) was terminated by signal 11
>> LOG:  terminating any other active server processes
>
> Please send some more of the prior log, so we can tell the
> initialization state at that point.
>

Not much more to send really. It's just "restored log file" lines  
going back to the startup (as expected)

> If you can reproduce it, that would help also. Thanks.
>

Not sure about reproducing, but if I restart the database, it crashes  
again during log replay (stack trace info looks the same).

LOG:  database system was interrupted while in recovery at log time  
2010-01-07 12:01:57 EST
HINT:  If this has occurred more than once some data might be  
corrupted and you might need to choose an earlier recovery target.
LOG:  starting archive recovery
LOG:  restore_command = '/export/home/postgres/pgwork/inst/slave/bin/ 
pg_standby /export/home/postgres/pgwork/data/.hot_standby %f %p %r'
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/ 
00000001.history
LOG:  restored log file "00000001000000030000002A" from archive
LOG:  automatic recovery in progress
LOG:  initializing recovery connections
LOG:  redo starts at 3/2A000020, consistency will be reached at  
3/2D000120
LOG:  restored log file "00000001000000030000002B" from archive
LOG:  restored log file "00000001000000030000002C" from archive
LOG:  restored log file "00000001000000030000002D" from archive
LOG:  consistent recovery state reached at 3/2D000120
LOG:  database system is ready to accept read only connections
LOG:  restored log file "00000001000000030000002E" from archive
LOG:  restored log file "00000001000000030000002F" from archive
LOG:  restored log file "000000010000000300000030" from archive
LOG:  restored log file "000000010000000300000031" from archive
LOG:  restored log file "000000010000000300000032" from archive
LOG:  startup process (PID 4397) was terminated by signal 11
LOG:  terminating any other active server processes

--
Robert Treat
http://www.omniti.com/


Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Simon Riggs
Date:
On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:

> Not much more to send really. 

No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.

-- Simon Riggs           www.2ndQuadrant.com



Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Robert Treat
Date:
On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote:

> On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:
>
>> Not much more to send really.
>
> No problem. I can see what causes it, nothing more required, thanks.
> What I don't fully understand yet is why the error hasn't shown itself
> before, because it appears such a basic error.
>

Is anyone doing regular testing on non-linux?  (to the extent that  
this is a problem due to Solaris ideas about snprintf).


Robert Treat
http://www.omniti.com/


Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Simon Riggs
Date:
On Thu, 2010-01-07 at 18:02 -0500, Robert Treat wrote:
> On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote:
> 
> > On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:
> >
> >> Not much more to send really.
> >
> > No problem. I can see what causes it, nothing more required, thanks.
> > What I don't fully understand yet is why the error hasn't shown itself
> > before, because it appears such a basic error.
> >
> 
> Is anyone doing regular testing on non-linux?  (to the extent that  
> this is a problem due to Solaris ideas about snprintf).

OK, understand now. The Solaris test failed in a place where it would
have been good if Linux had done so; this has revealed that the relcache
init file deletion required in some places was actually failing. The
path was never set correctly in the first place, it appears. I've
updated the TODO to reflect the new bug report.

I've almost finished hacking a solution. I'll test and commit tomorrow,
since it's a fairly standalone problem, if fiddly.

-- Simon Riggs           www.2ndQuadrant.com



Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Simon Riggs
Date:
On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote:
> I'll test and commit tomorrow, since it's a fairly standalone problem

Fix attached, thanks for testing.

Works for me and I don't expect it to fail on Solaris, since the root
cause of the failure has been addressed and a correctly designed test
executed.

I will wait a day for your confirmation and/or other's comments.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

From
Robert Treat
Date:
On Jan 8, 2010, at 7:36 AM, Simon Riggs wrote:

> On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote:
>> I'll test and commit tomorrow, since it's a fairly standalone problem
>
> Fix attached, thanks for testing.
>
> Works for me and I don't expect it to fail on Solaris, since the root
> cause of the failure has been addressed and a correctly designed test
> executed.
>
> I will wait a day for your confirmation and/or other's comments.
>

Looks good from my end, thanks Simon.


Robert Treat
http://www.omniti.com