Re: BUG #7500: hot-standby replica crash after an initial rsync - Mailing list pgsql-bugs

From Andres Freund
Subject Re: BUG #7500: hot-standby replica crash after an initial rsync
Date
Msg-id 201208221133.35698.andres@2ndquadrant.com
Whole thread Raw
In response to Re: BUG #7500: hot-standby replica crash after an initial rsync  (Maxim Boguk <maxim.boguk@gmail.com>)
Responses Re: BUG #7500: hot-standby replica crash after an initial rsync  (Maxim Boguk <maxim.boguk@gmail.com>)
List pgsql-bugs
Hi Maxim,

On Wednesday, August 22, 2012 06:18:14 AM Maxim Boguk wrote:
> On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
> > On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund
<andres@2ndquadrant.com>wrote:
> >> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
> >> > Hi Andres,
> >>
> >> I would add something akin to
> >>
> >> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
> >> sleep(10);
> >
> > Hi Andres,
> >
> > There is it:
> >
> > gdb backtrace:
> >
> > #0  0x0000000800bd1ab0 in exit () from /lib/libc.so.7
> > #1  0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137
> > #2  0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501
> > #3  0x0000000000488956 in SlruReportIOError (ctl=<optimized out>,
> > pageno=<optimized out>, xid=361788644) at slru.c:857
> > #4  0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
> > write_ok=1 '\001', xid=361788644) at slru.c:427
> > #5  0x0000000000489f7f in RecordNewMultiXact (multi=361788644,
> > offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
> > #6  0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at
> > multixact.c:2017
> > #7  0x0000000000484bea in StartupXLOG () at xlog.c:6277
> > #8  0x00000000004870a8 in StartupProcessMain () at xlog.c:9335
> > #9  0x000000000049cab2 in AuxiliaryProcessMain (argc=2,
> > argv=0x7fffffffe450) at bootstrap.c:412
> > #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at
> > postmaster.c:4442
> > #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> > postmaster.c:1093
> > #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at
> > main.c:188
> >
> > relevant database log:
> > 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:
> > 00000: redo starts at 925/2C0004C8
> > 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> > StartupXLOG, xlog.c:6202
> > 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
> > XX000: could not access status of transaction 361788644
> > 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
> > Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> > Unknown error: 0.
> > 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
> > xlog redo create multixact 361788644 offset 822580965: 1768508061
> > 1768508078 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0]
> > []LOCATION: SlruReportIOError, slru.c:840
> >
> >
> > Kind Regards,
> > Maksym
>
> And finally gdb backtrace on build without optimization
> ( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug
> --enable-cassert CFLAGS='-O0' ):
>
> #0  0x0000000800d42ab0 in exit () from /lib/libc.so.7
> #1  0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137
> #2  0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501
> #3  0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654,
> xid=361788644) at slru.c:836
> #4  0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654,
> write_ok=1 '\001', xid=361788644) at slru.c:427
> #5  0x00000000004b85dd in RecordNewMultiXact (multi=361788644,
> offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764
> #6  0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at
> multixact.c:2017
> #7  0x00000000004aef0c in StartupXLOG () at xlog.c:6277
> #8  0x00000000004b446d in StartupProcessMain () at xlog.c:9335
> #9  0x00000000004d9e36 in AuxiliaryProcessMain (argc=2,
> argv=0x7fffffffe8b0) at bootstrap.c:412
> #10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4442
> #11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> postmaster.c:1093
> #12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188
Thanks! That already helps.

> It seems failing around function SlruPhysicalReadPage at block
>
>         errno = 0;
>         if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
>         {
>                 slru_errcause = SLRU_READ_FAILED;
>                 slru_errno = errno;
>                 close(fd);
>                 return false;
>         }
>
> It seems that it read less than BLCKSZ bytes. And it is not an error
> condition for read call
> (from man 2 read:
> Note that reading less than size bytes is not an error.
> ).
If there are enough bytes available and youre not reading from a pipe or
similar it shouldn't happen though (unless signals happen, those change errno
though). But you already cleared up why its happening:

> Actual size of pg_multixact/offsets/1590 file is:
> [pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la
> /db/data_back/pg_multixact/offsets/1590
> -rw-------  1 pgsql  pgsql  81920 Aug 21 07:46
> /db/data_back/pg_multixact/offsets/1590
>
> Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving
> seek position to outside of an actual file size.
> And after that, read call actually could read only 0 bytes from it.
That does sound plausible, yes.

> Does that mean that file was damaged during rsync?
Not necessarily. When did you initially set up that cluster? Normally the file
should get zeroed out before its being used. If the cluster was copied
improperly (i.e. no pg_start/stop backup or such) it could easily happen. But
I would guess that you did it properly?

Could you show pg_controldata output for primary/replica? To see how new that
multixact offset is.

> What next I should examine?
If you could give me a DEBUG3 log from startup till it crashes it would be
helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing so.
Thats probably going to be to big to go through the mailinglist.

If youre feeling adventurous you could use the xlogdump project
(https://github.com/snaga/xlogdump) to see what xlog records happen from the
time it starts up till it crashes. I don't immediately see a way a
XLOG_MULTIXACT_CREATE_ID can be generated without a preceding
XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page.
But most of that should be already be visible in the log with the above define.

Did you have any crashes, -m immediate shutdowns, or similar on the primary or
replica?

Greetings,

Andres
--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

pgsql-bugs by date:

Previous
From: Maxim Boguk
Date:
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync
Next
From: Sachin Srivastava
Date:
Subject: Re: postgresql-9.1.4 macosx-10.6.8 (hostssl) psql -h hostname crashes