Re: BUG #7500: hot-standby replica crash after an initial rsync - Mailing list pgsql-bugs

From Maxim Boguk
Subject Re: BUG #7500: hot-standby replica crash after an initial rsync
Date
Msg-id CAK-MWwTfoG_HcUhFk2KVm1R8XPWEq1AEw3JWuodWNq76GUBm8A@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7500: hot-standby replica crash after an initial rsync  (Maxim Boguk <maxim.boguk@gmail.com>)
Responses Re: BUG #7500: hot-standby replica crash after an initial rsync  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-bugs
On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

>
>
> On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
>> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
>> > Hi Andres,
>>
>> I would add something akin to
>>
>> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
>> sleep(10);
>>
>
> Hi Andres,
>
> There is it:
>
> gdb backtrace:
>
> #0  0x0000000800bd1ab0 in exit () from /lib/libc.so.7
> #1  0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137
> #2  0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501
> #3  0x0000000000488956 in SlruReportIOError (ctl=<optimized out>,
> pageno=<optimized out>, xid=361788644) at slru.c:857
> #4  0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
> write_ok=1 '\001', xid=361788644) at slru.c:427
> #5  0x0000000000489f7f in RecordNewMultiXact (multi=361788644,
> offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
> #6  0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at
> multixact.c:2017
> #7  0x0000000000484bea in StartupXLOG () at xlog.c:6277
> #8  0x00000000004870a8 in StartupProcessMain () at xlog.c:9335
> #9  0x000000000049cab2 in AuxiliaryProcessMain (argc=2,
> argv=0x7fffffffe450) at bootstrap.c:412
> #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4442
> #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> postmaster.c:1093
> #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188
>
> relevant database log:
> 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:  00000:
> redo starts at 925/2C0004C8
> 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6202
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
>
>
> Kind Regards,
> Maksym
>

And finally gdb backtrace on build without optimization
( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug
--enable-cassert CFLAGS='-O0' ):

#0  0x0000000800d42ab0 in exit () from /lib/libc.so.7
#1  0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137
#2  0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501
#3  0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654,
xid=361788644) at slru.c:836
#4  0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654,
write_ok=1 '\001', xid=361788644) at slru.c:427
#5  0x00000000004b85dd in RecordNewMultiXact (multi=361788644,
offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764
#6  0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at
multixact.c:2017
#7  0x00000000004aef0c in StartupXLOG () at xlog.c:6277
#8  0x00000000004b446d in StartupProcessMain () at xlog.c:9335
#9  0x00000000004d9e36 in AuxiliaryProcessMain (argc=2,
argv=0x7fffffffe8b0) at bootstrap.c:412
#10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at
postmaster.c:4442
#11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
postmaster.c:1093
#12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188

It seems failing around function SlruPhysicalReadPage at block

        errno = 0;
        if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
        {
                slru_errcause = SLRU_READ_FAILED;
                slru_errno = errno;
                close(fd);
                return false;
        }

It seems that it read less than BLCKSZ bytes. And it is not an error
condition for read call
(from man 2 read:
Note that reading less than size bytes is not an error.
).

Actual size of pg_multixact/offsets/1590 file is:
[pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la
/db/data_back/pg_multixact/offsets/1590
-rw-------  1 pgsql  pgsql  81920 Aug 21 07:46
/db/data_back/pg_multixact/offsets/1590

Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving
seek position to outside of an actual file size.
And after that, read call actually could read only 0 bytes from it.

Does that mean that file was damaged during rsync?


Kind Regards,
Maksym

What next I should examine?

Kind Regards,
Maksym

pgsql-bugs by date:

Previous
From: Maxim Boguk
Date:
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync
Next
From: Andres Freund
Date:
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync