Re: Assertion failure at standby promotion - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Assertion failure at standby promotion
Date
Msg-id 5187812D.6090100@vmware.com
Whole thread Raw
In response to Assertion failure at standby promotion  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Assertion failure at standby promotion  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 03.05.2013 18:17, Fujii Masao wrote:
> Hi,
>
> I got the following assertion failure when I promoted the standby.
>
> 2013-05-04 00:12:31 JST sby1 LOG:  received promote request
> 2013-05-04 00:12:31 JST sby1 FATAL:  terminating walreceiver process
> due to administrator command
> 2013-05-04 00:12:31 JST sby1 LOG:  redo done at 0/6FFE038
> 2013-05-04 00:12:31 JST sby1 LOG:  last completed transaction was at
> log time 2013-05-04 00:12:25.878909+09
> 2013-05-04 00:12:31 JST sby1 LOG:  selected new timeline ID: 2
> 2013-05-04 00:12:31 JST sby1 LOG:  archive recovery complete
> 2013-05-04 00:12:31 JST sby1 LOG:  checkpoint starting:
> TRAP: FailedAssertion("!(sentPtr<= sendTimeLineValidUpto)", File:
> "walsender.c", Line: 1465)
> 2013-05-04 00:12:31 JST sby1 LOG:  autovacuum launcher started
>
> The way to reproduce this is:
>
> 1. Create one master A, one standby B, and one cascade standby C.
> 2. Run pgbench -i -s 10
> 3. Promote the standby B before pgbench -i finishes

I was able to reproduce this. The assertion checks that if the system is 
promoted at WAL location X, we must not have already sent WAL at > X to 
the client. As the code stands, that assumption is wrong; the walsender 
will merrily stream WAL that hasn't been replayed yet, and the system 
can be promoted before replaying all the WAL that has been streamed to a 
cascading standby. The comment in GetStandbyFlushRecPtr(), which is the 
function that determined how far the WAL may be streamed to a cascading 
standby, says this:

>     /*
>      * We can safely send what's already been replayed. Also, if walreceiver
>      * is streaming WAL from the same timeline, we can send anything that
>      * it has streamed, but hasn't been replayed yet.
>      */

There seems to be two bugs here:

1. This used to work in 9.2, because the startup process would always 
replay all the WAL present in pg_xlog before promoting (the WAL present 
in pg_xlog was streamed from master). But the refactorings in xlog.c in 
9.3devel broke that, so that the startup process can promote earlier.

2. Even after fixing the logic in xlog.c, there is still a corner-case 
where the startup process can promote before all the WAL that has been 
received from walreceiver has been received. That happens if the WAL 
streaming is terminated at a page boundary, rather than at a record 
boundary. For example, we might have received WAL up to the page 
boundary at 0/5BFA000, but the last *complete* record that we have 
received ends at 0/5BF9BD8.

To fix the second issue, I think two things need to happen. First, we 
need to suppress the check in walsender. Second, we need to teach the 
WAL replay to back off when that happens. At the moment, the replay in 
the cascading standby gets stuck, trying to fetch the next page 
containing rest of the partial WAL record. Instead, it should throw away 
the partial record it has, and resync at the end of the last replayed 
record. I think 9.2 has the same bug, BTW. Without support for timeline 
switches over streaming replication, it was just more difficult to hit.

- Heikki



pgsql-hackers by date:

Previous
From: soroosh sardari
Date:
Subject: Re: Meaning of keyword category list in src/backend/parser/gram.y
Next
From: Heikki Linnakangas
Date:
Subject: Re: Assertion failure at standby promotion