Thread: Inconsistent DB data in Streaming Replication

Inconsistent DB data in Streaming Replication

From
Samrat Revagade
Date:
<div dir="ltr"><p class="">Hello,<p class="">We have been trying to figure out possible solutions to the following
problemin streaming replication Consider following scenario:<p class="">If master receives commit command, it writes
andflushes commit WAL records to the disk, It also writes and flushes data page related to this transaction.<p
class="">Themaster then sends WAL records to standby up to the commit WAL record. But before sending these records if
failoverhappens then,  old master is ahead of  standby which is now the new master in terms of DB data leading to
inconsistentdata . <p class=""> <p class="">One solution to avoid this situation is have the master send WAL records to
standbyand wait for ACK from standby committing WAL files to disk and only after that commit data page related to this
transactionon master.<p class="">The main drawback would be increased wait time for the client due to extra round trip
tostandby before master sends ACK to client. Are there any other issues with this approach?<p class=""><br /><p
class="">Thankyou,<p class=""><p class="">Samrat</div> 

Re: Inconsistent DB data in Streaming Replication

From
Shaun Thomas
Date:
On 04/08/2013 05:34 AM, Samrat Revagade wrote:

> One solution to avoid this situation is have the master send WAL
> records to standby and wait for ACK from standby committing WAL files
> to disk and only after that commit data page related to this
> transaction on master.

Isn't this basically what synchronous replication does in PG 9.1+?

-- 
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email



Re: Inconsistent DB data in Streaming Replication

From
Tom Lane
Date:
Samrat Revagade <revagade.samrat@gmail.com> writes:
> We have been trying to figure out possible solutions to the following
> problem in streaming replication Consider following scenario:

> If master receives commit command, it writes and flushes commit WAL records
> to the disk, It also writes and flushes data page related to this
> transaction.

> The master then sends WAL records to standby up to the commit WAL record.
> But before sending these records if failover happens then,  old master is
> ahead of  standby which is now the new master in terms of DB data leading
> to inconsistent data .

I don't exactly see the problem ... unless you're imagining that master
and slave share the same data storage or something like that.  That's
not going to work for a ton of reasons besides this one.
        regards, tom lane



Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 04/08/2013 05:34 AM, Samrat Revagade wrote:
>
>> One solution to avoid this situation is have the master send WAL
>> records to standby and wait for ACK from standby committing WAL files
>> to disk and only after that commit data page related to this
>> transaction on master.
>
>
> Isn't this basically what synchronous replication does in PG 9.1+?

Not exactly. Sync-rep ensures that commit success is not sent to the
client before a synchronous replica acks the commit record. What
Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master. I seem to remember this
being discussed when sync rep was committed. I don't recall if the
idea was discarded only on performance grounds or whether there were
other issues too.

Thinking about it now it, the requirement is that after crash and
failover to a sync replica we should be able to reuse the datadir to
replicate from the new master without consistency. We should be able
to achieve that by ensuring that we don't write out pages until we
have received an ack from the sync replica and that we check for
possible timeline switches before recovering local WAL. For the first,
it seems to me that it should be enough to rework the updating of
XlogCtl->LogwrtResult.Flush so it accounts for the sync replica. For
the second part, I think Heikkis work on enabling timeline switches
over streaming connections already ensure this (I haven't checked it
out in detail), but if not, shouldn't be too hard to add.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Andres Freund
Date:
On 2013-04-08 19:26:33 +0300, Ants Aasma wrote:
> On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> > On 04/08/2013 05:34 AM, Samrat Revagade wrote:
> >
> >> One solution to avoid this situation is have the master send WAL
> >> records to standby and wait for ACK from standby committing WAL files
> >> to disk and only after that commit data page related to this
> >> transaction on master.
> >
> >
> > Isn't this basically what synchronous replication does in PG 9.1+?
> 
> Not exactly. Sync-rep ensures that commit success is not sent to the
> client before a synchronous replica acks the commit record. What
> Samrat is proposing here is that WAL is not flushed to the OS before
> it is acked by a synchronous replica so recovery won't go past the
> timeline change made in failover, making it necessary to take a new
> base backup to resync with the new master. I seem to remember this
> being discussed when sync rep was committed. I don't recall if the
> idea was discarded only on performance grounds or whether there were
> other issues too.

Thats not going to work for a fair number of reasons:
* wal is streamed *from disk* not from memory
* what if the local node crashes/restarts immediately? Then the standby is farther ahead than the master.
* the performance implications of never writing data before flushing it are pretty severe
* ...

So this doesn't seem to solve anything.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Mon, Apr 8, 2013 at 7:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-04-08 19:26:33 +0300, Ants Aasma wrote:
>> Not exactly. Sync-rep ensures that commit success is not sent to the
>> client before a synchronous replica acks the commit record. What
>> Samrat is proposing here is that WAL is not flushed to the OS before
>> it is acked by a synchronous replica so recovery won't go past the
>> timeline change made in failover, making it necessary to take a new
>> base backup to resync with the new master. I seem to remember this
>> being discussed when sync rep was committed. I don't recall if the
>> idea was discarded only on performance grounds or whether there were
>> other issues too.
>
> Thats not going to work for a fair number of reasons:
> * wal is streamed *from disk* not from memory

Yeah, this one alone makes the do-not-flush-before-replicating
approach impractical.

> * what if the local node crashes/restarts immediately? Then the standby
>   is farther ahead than the master.
> * the performance implications of never writing data before flushing it
>   are pretty severe
> * ...
>
> So this doesn't seem to solve anything.

Yeah, delaying WAL writes until replication is successful seems
impractical, but I don't see why we couldn't optionally take into
account walsender write pointers when considering if we can write out
a page. Sure there will be some performance hit for waiting to
replicate WAL, but on the other hand having to rsync a huge database
isn't too good for performance either.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Mon, Apr 8, 2013 at 7:34 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:
>
> Hello,
>
> We have been trying to figure out possible solutions to the following problem in streaming replication Consider
followingscenario:
 
>
> If master receives commit command, it writes and flushes commit WAL records to the disk, It also writes and flushes
datapage related to this transaction.
 
>
> The master then sends WAL records to standby up to the commit WAL record. But before sending these records if
failoverhappens then,  old master is ahead of  standby which is now the new master in terms of DB data leading to
inconsistentdata .
 

Why do you think that the inconsistent data after failover happens is
problem? Because
it's one of the reasons why a fresh base backup is required when
starting old master as
new standby? If yes, I agree with you. I've often heard the complaints
about a backup
when restarting new standby. That's really big problem.

The timeline mismatch after failover was one of the reasons why a
backup is required.
But, thanks to Heikki's recent work, that's solved, i.e., the timeline
mismatch would be
automatically resolved when starting replication in 9.3. So, the
remaining problem is an
inconsistent database.

> One solution to avoid this situation is have the master send WAL records to standby and wait for ACK from standby
committingWAL files to disk and only after that commit data page related to this transaction on master.
 

You mean to make the master wait the data page write until WAL has been not only
flushed to disk but also replicated to the standby?

> The main drawback would be increased wait time for the client due to extra round trip to standby before master sends
ACKto client. Are there any other issues with this approach?
 

I think that you can introduce GUC specifying whether this extra check
is required to
avoid a backup when failback.

Regards,

--
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Samrat Revagade
Date:
<div dir="ltr"><div style="border-width:1.5pt 1pt;border-color:rgb(211,211,211);border-style:solid;padding:12pt 0in
12pt24pt"><p class="" style="margin-bottom:0.0001pt;border:none;padding:0in"><p class=""
style="margin-bottom:0.0001pt"><spanstyle="font-size:10pt;font-family:Arial,sans-serif">>What Samrat is proposing
hereis that WAL is not flushed to the OS before</span><p class="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif">>itis acked by a synchronous replica so recovery won't go past
the</span><pclass="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif">>timelinechange made in failover, making it necessary to take a
new</span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">>base
backupto resync with the new master.</span><p><p class="" style="margin-bottom:0.0001pt;border:none;padding:0in"><p
class=""style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">Actually we are
proposingthat the data page on the master is not committed till master receives ACK from the standby. The WAL files can
beflushed to the disk on both the master and standby, before standby generates ACK to master. The end objective is the
sameof avoiding to take base backup of old master to resync with new master.</span><p class=""
style="margin-bottom:0.0001pt"><pclass="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif">>Whydo you think that the inconsistent data after failover
happensis</span><span style="font-size:10pt;font-family:Arial,sans-serif"><br /><span style="background-repeat:initial
initial">>problem?Because </span></span><p class="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif">>it'sone of the reasons why a fresh base backup is required
when</span><spanstyle="font-size:10pt;font-family:Arial,sans-serif"><br /><span style="background-repeat:initial
initial">>startingold master as</span><br /><span style="background-repeat:initial initial">>new standby? If yes,
Iagree with you. I've often heard the complaints</span><br /><span style="background-repeat:initial initial">>about
abackup</span><br /><span style="background-repeat:initial initial">>when restarting new standby. That's really big
problem.</span></span><pclass="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif"> </span><span
style="font-family:Arial,sans-serif;font-size:10pt">Yes,taking backup is</span><span
style="font-family:Arial,sans-serif;font-size:10pt"> </span><span
style="font-family:Arial,sans-serif;font-size:10pt">majorproblem when the database size is more than several TB. It
wouldtake very long time to ship backup data over the slow WAN network. </span><p class=""
style="margin-bottom:12pt"><spanstyle="font-size:10pt;font-family:Arial,sans-serif;color:rgb(80,0,80)">>> One
solutionto avoid this situation is have the master send WAL records to standby and wait for ACK from standby committing
WALfiles to disk and only after that commit data page related to this transaction on master.</span><p class=""
style="margin-bottom:0.0001pt"><spanstyle="font-size:10pt;font-family:Arial,sans-serif">>You mean to make the master
waitthe data page write until WAL has been not only</span><span style="font-size:10pt;font-family:Arial,sans-serif"><br
/><spanstyle="background-repeat:initial initial">>flushed to disk but also replicated to the
standby?</span></span><pclass="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif"> </span><span
style="font-size:10pt;font-family:Arial,sans-serif">Yes.Master should not write the data page before corresponding WAL
recordshave been replicated to the standby. The WAL records have been flushed to disk on both master and
standby.</span><pclass="" style="margin-bottom:12pt"><span
style="font-size:10pt;font-family:Arial,sans-serif;color:rgb(80,0,80)">>>The main drawback would be increased
waittime for the client due to extra round trip to standby before master sends ACK to client. Are there any other
issueswith this approach?</span><p class=""><span
style="font-size:10pt;line-height:115%;font-family:Arial,sans-serif">>Ithink that you can introduce GUC specifying
whetherthis extra check</span><span style="font-size:10pt;line-height:115%;font-family:Arial,sans-serif"><br /><span
style="background-repeat:initialinitial">>is required to avoid a backup when failback</span></span><p class=""><p
class=""style="margin-bottom:0.0001pt"><span style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%">That
wouldbe better idea. We can disable it whenever taking a fresh backup is not a problem. </span><span
style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span
style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span
style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span
style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span
style="font-size:10pt;font-family:Arial,sans-serif"> </span><pclass="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif"><br/></span><p class="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif">Regards,</span><pclass="" style="margin-bottom:0.0001pt"><span
style="font-size:10pt;font-family:Arial,sans-serif">Samrat </span><span
style="font-size:10pt;font-family:Arial,sans-serif"> </span><p><p></div></div><divclass="gmail_extra"><br /><br /><div
class="gmail_quote">OnMon, Apr 8, 2013 at 10:40 PM, Fujii Masao <span dir="ltr"><<a
href="mailto:masao.fujii@gmail.com"target="_blank">masao.fujii@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Mon, Apr 8,
2013at 7:34 PM, Samrat Revagade<br /> <<a href="mailto:revagade.samrat@gmail.com">revagade.samrat@gmail.com</a>>
wrote:<br/> ><br /> > Hello,<br /> ><br /> > We have been trying to figure out possible solutions to the
followingproblem in streaming replication Consider following scenario:<br /> ><br /> > If master receives commit
command,it writes and flushes commit WAL records to the disk, It also writes and flushes data page related to this
transaction.<br/> ><br /> > The master then sends WAL records to standby up to the commit WAL record. But before
sendingthese records if failover happens then,  old master is ahead of  standby which is now the new master in terms of
DBdata leading to inconsistent data .<br /><br /></div>Why do you think that the inconsistent data after failover
happensis<br /> problem? Because<br /> it's one of the reasons why a fresh base backup is required when<br /> starting
oldmaster as<br /> new standby? If yes, I agree with you. I've often heard the complaints<br /> about a backup<br />
whenrestarting new standby. That's really big problem.<br /><br /> The timeline mismatch after failover was one of the
reasonswhy a<br /> backup is required.<br /> But, thanks to Heikki's recent work, that's solved, i.e., the timeline<br
/>mismatch would be<br /> automatically resolved when starting replication in 9.3. So, the<br /> remaining problem is
an<br/> inconsistent database.<br /><div class="im"><br /> > One solution to avoid this situation is have the master
sendWAL records to standby and wait for ACK from standby committing WAL files to disk and only after that commit data
pagerelated to this transaction on master.<br /><br /></div>You mean to make the master wait the data page write until
WALhas been not only<br /> flushed to disk but also replicated to the standby?<br /><div class="im"><br /> > The
maindrawback would be increased wait time for the client due to extra round trip to standby before master sends ACK to
client.Are there any other issues with this approach?<br /><br /></div>I think that you can introduce GUC specifying
whetherthis extra check<br /> is required to<br /> avoid a backup when failback.<br /><br /> Regards,<br /><br /> --<br
/>Fujii Masao<br /></blockquote></div><br /></div> 

Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Tue, Apr 9, 2013 at 9:42 AM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:
>>What Samrat is proposing here is that WAL is not flushed to the OS before
>>it is acked by a synchronous replica so recovery won't go past the
>>timeline change made in failover, making it necessary to take a new
>>base backup to resync with the new master.
>
> Actually we are proposing that the data page on the master is not committed
> till master receives ACK from the standby. The WAL files can be flushed to
> the disk on both the master and standby, before standby generates ACK to
> master. The end objective is the same of avoiding to take base backup of old
> master to resync with new master.

Sorry for misreading your e-mail. It seems like  we are on the same
page here. I too have found this an annoying limitation in using
replication in an unreliable environment.

>  Yes, taking backup is  major problem when the database size is more than
> several TB. It would take very long time to ship backup data over the slow
> WAN network.

For WAN environment rsync can be a good enough answer, a tiny amount
of pages will be actually transferred. This is assuming a smallish
database and low bandwidth. For larger databases avoiding the need to
read in the whole database for differences is an obvious win.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Hannu Krosing
Date:
On 04/08/2013 12:34 PM, Samrat Revagade wrote:
>
> Hello,
>
> We have been trying to figure out possible solutions to the following 
> problem in streaming replication Consider following scenario:
>
> If master receives commit command, it writes and flushes commit WAL 
> records to the disk, It also writes and flushes data page related to 
> this transaction.
>
No data page flushing takes place. All data page writing is delayed to 
bgWriter and/or checkpoints.
>
> The master then sends WAL records to standby up to the commit WAL 
> record. But before sending these records if failover happens then,  
> old master is ahead of  standby which is now the new master in terms 
> of DB data leading to inconsistent data .
>
The master sends WAL records to standby continuously, not "upon commit 
wal record".

In case of syncrep the master just waits for confirmation from standby 
before returning to client on commit.
>
> One solution to avoid this situation is have the master send WAL 
> records to standby and wait for ACK from standby committing WAL files 
> to disk and only after that commit data page related to this 
> transaction on master.
>
Not just commit, you must stop any *writing* of the wal records 
effectively killing any parallelism.
>
> The main drawback would be increased wait time for the client due to 
> extra round trip to standby before master sends ACK to client. Are 
> there any other issues with this approach?
>
Min issue is that it will make *all* backends dependant on each sync
commit, essentially serialising all backends commits, with the
serialisation *including* the latency of roundtrip to client.

With current sync streaming the other backends can continue to write wal,
with proposed approach you can not write any records after the one 
waiting an ACK from standby.
>
>
> Thank you,
>
> Samrat
>




Re: Inconsistent DB data in Streaming Replication

From
Samrat Revagade
Date:
<div dir="ltr"><p class="">>>it's one of the reasons why a fresh base backup is required when starting old master
asnew standby? >>If yes, I agree with you. I've often heard the complaints about a backup when restarting new
standby.>>That's really big problem.<p class="">I think Fujii Masao is on the same page.<p class=""> <p
class="">>Incase of syncrep the master just waits for confirmation from standby before returning to client on
>commit.<pclass="">>Not just commit, you must stop any *writing* of the wal records effectively killing any
parallelism.<br/> > Min issue is that it will make *all* backends dependant on each sync commit, essentially
serialisingall >backends commits, with the serialisation *including* the latency of roundtrip to client. With
current>sync streaming the other backends can continue to write wal, with proposed approach you cannot >write any
recordsafter the one waiting an ACK from standby.<p class=""> <p class="">Let me rephrase the proposal in a more
accuratemanner:<p class="">Consider following scenario:<p class=""> <p class="">(1) A client sends the "COMMIT" command
tothe master server.<p class=""><p class="">(2) The master writes WAL record to disk<p class="">(3) The master writes
thedata page related to this transaction.  i.e. via checkpoint or bgwriter.<p class="">(4) The master sends WAL records
continuouslyto the standby, up to the commit WAL record.<p class="">(5) The standby receives WAL records, writes them
tothe disk, and then replies the ACK.<p class="">(6) The master returns a success indication to a client after it
receivesACK.<p class=""> <p class="">If failover happens between (3) and (4), WAL and DB data in old master are ahead
ofthem in new master. After failover, new master continues running new transactions independently from old master. Then
WALrecord and DB data would become inconsistent between those two servers. To resolve these inconsistencies, the backup
ofnew master needs to be taken onto new standby.<p class=""><br /><p class="">But taking backup is not feasible in case
oflarger database size with several TB over a slow WAN.<br /><p class=""><p class="">So to avoid this type of
inconsistencywithout taking fresh backup we are thinking to do following thing:<p class=""> <br /><p class="">>>
Ithink that you can introduce GUC specifying whether this extra check is required to avoid a backup >>when
failback.<pclass="">Approach:<p class="">Introduce new GUC option specifying whether to prevent PostgreSQL from writing
DBdata before corresponding WAL records have been replicated to the standby. That is, if this GUC option is enabled,
PostgreSQLwaits for corresponding WAL records to be not only written to the disk but also replicated to the standby
beforewriting DB data.<p class=""><br /><p class="">So the process becomes as follows:<p class="">(1) A client sends
the"COMMIT" command to the master server.<p class="">(2) The master writes the commit WAL record to the disk.<p
class="">(3)The master sends WAL records continuously to standby up to the commit WAL record.<p class="">(4) The
standbyreceives WAL records, writes them to disk, and then replies the ACK.<p class="">(5) <b>The master then forces a
writeof the data page related to this transaction. </b><p class="">(6) The master returns a success indication to a
clientafter it receives ACK.<p class=""> <p class="">While master is waiting to force a write (point 5) for this data
page,streaming replication continuous. Also other data page writes are not dependent on this particular page write. So
thecommit of data pages are not serialized.<p class="" style="style"><br /><p class="" style="style">Regards,<p
class=""style="style">Samrat<p class=""><br /></div> 

Re: Inconsistent DB data in Streaming Replication

From
Samrat Revagade
Date:
<div dir="ltr"><br />>(5) <b>The master then forces a write of the data page related to this transaction.</b><p
class=""><b>Sorry,this is incorrect. Whenever the master writes the data page it checks that the WAL record is written
instandby till that LSN. </b><p class=""><b><br /></b><p class=""><p class="">>While master is waiting to force a
write(point 5) for this data page, streaming replication continuous. <p class="">>Also other data page writes are
notdependent on this particular page write. So the commit of data >pages are not serialized.<p class=""><b>Sorry,
thisis incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it
checksthat the WAL record is written in standby till that LSN.</b><p class=""><b><br /></b><p class=""
style="style"><b>Regards,</b><pclass="" style="style"><b>Samrat</b><p class="" style="style"><b><br /></b><p></div> 

Re: Inconsistent DB data in Streaming Replication

From
Amit Kapila
Date:
On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

>>(5) The master then forces a write of the data page related to this
transaction.
> Sorry, this is incorrect. Whenever the master writes the data page it
checks that the WAL record is written in standby till that LSN. 

>> While master is waiting to force a write (point 5) for this data page,
streaming replication continuous.
>> Also other data page writes are not dependent on this particular page
write. So the commit of data >pages are not serialized.

>Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

With Regards,
Amit Kapila.







Re: Inconsistent DB data in Streaming Replication

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
> On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
>> Sorry, this is incorrect. Streaming replication continuous, master is not
>> waiting, whenever the master writes the data page it checks that the WAL
>> record is written in standby till that LSN.

> I am not sure it will resolve the problem completely as your old-master can
> have some WAL extra then new-master for same timeline. I don't remember
> exactly will timeline switch feature
> take care of this extra WAL, Heikki can confirm this point?
> Also I think this can serialize flush of data pages in checkpoint/bgwriter
> which is currently not the case.

Yeah.  TBH this entire discussion seems to be "let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master".  This is not merely optimizing for the
wrong thing, it's positively hazardous.  After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.
        regards, tom lane



Re: Inconsistent DB data in Streaming Replication

From
Andres Freund
Date:
On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
> Amit Kapila <amit.kapila@huawei.com> writes:
> > On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
> >> Sorry, this is incorrect. Streaming replication continuous, master is not
> >> waiting, whenever the master writes the data page it checks that the WAL
> >> record is written in standby till that LSN.
> 
> > I am not sure it will resolve the problem completely as your old-master can
> > have some WAL extra then new-master for same timeline. I don't remember
> > exactly will timeline switch feature
> > take care of this extra WAL, Heikki can confirm this point?
> > Also I think this can serialize flush of data pages in checkpoint/bgwriter
> > which is currently not the case.
> 
> Yeah.  TBH this entire discussion seems to be "let's cripple performance
> in the normal case so that we can skip doing an rsync when resurrecting
> a crashed, failed-over master".  This is not merely optimizing for the
> wrong thing, it's positively hazardous.  After a fail-over, you should
> be wondering whether it's safe to resurrect the old master at all, not
> about how fast you can bring it back up without validating its data.
> IOW, I wouldn't consider skipping the rsync even if I had a feature
> like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care. So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inconsistent DB data in Streaming Replication

From
Shaun Thomas
Date:
On 04/10/2013 09:10 AM, Tom Lane wrote:

> IOW, I wouldn't consider skipping the rsync even if I had a feature
> like this.

Totally. Out in the field, we consider the "old" database corrupt the 
moment we fail over. There is literally no way to verify the safety of 
any data along the broken chain, given race conditions and multiple 
potential failure points.

The only potential use case for this that I can see, would be for system 
maintenance and a controlled failover. I agree: that's a major PITA when 
doing DR testing, but I personally don't think this is the way to fix 
that particular edge case.

Maybe checksums will fix this in the long run... I don't know. DRBD has 
a handy block-level verify function for things like this, and it can 
re-sync master/slave data by comparing the commit log across the servers 
if you tell it one node should be considered incorrect.

The thing is... we have clogs, and we have WAL. If we can assume 
bidirectional communication and verification (checksum comparison?) of 
both of those components, the database *should* be able to re-sync itself.

Even if that were possible given the internals, I can't see anyone 
jumping on this before 9.4 or 9.5 unless someone sponsors the feature.

Automatic re-sync would (within available WALs) be an awesome feature, 
though...

-- 
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Wed, Apr 10, 2013 at 11:26 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 04/10/2013 09:10 AM, Tom Lane wrote:
>
>> IOW, I wouldn't consider skipping the rsync even if I had a feature
>> like this.
>
>
> Totally. Out in the field, we consider the "old" database corrupt the moment
> we fail over.

Strange. If this is really true, shared disk failover solution is
fundamentally broken
because the standby needs to start up with the shared "corrupted"
database at the
failover. Also, we cannot trust the crash recovery at all if we adopt
the same logic
as you think. I think that there are the cases where we can replay and reuse the
old database even after PostgreSQL crashes.

Regards,

--
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Shaun Thomas
Date:
On 04/10/2013 11:40 AM, Fujii Masao wrote:

> Strange. If this is really true, shared disk failover solution is
> fundamentally broken because the standby needs to start up with the
> shared "corrupted" database at the failover.

How so? Shared disk doesn't use replication. The point I was trying to 
make is that replication requires synchronization between two disparate 
servers, and verifying they have exactly the same data is a non-trivial 
exercise. Even a single transaction after a failover (effectively) 
negates the old server because there's no easy "catch up" mechanism yet.

Even if this isn't necessarily true, it's the safest approach IMO.

-- 
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
>> Amit Kapila <amit.kapila@huawei.com> writes:
>> > On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
>> >> Sorry, this is incorrect. Streaming replication continuous, master is not
>> >> waiting, whenever the master writes the data page it checks that the WAL
>> >> record is written in standby till that LSN.
>>
>> > I am not sure it will resolve the problem completely as your old-master can
>> > have some WAL extra then new-master for same timeline. I don't remember
>> > exactly will timeline switch feature
>> > take care of this extra WAL, Heikki can confirm this point?
>> > Also I think this can serialize flush of data pages in checkpoint/bgwriter
>> > which is currently not the case.
>>
>> Yeah.  TBH this entire discussion seems to be "let's cripple performance
>> in the normal case so that we can skip doing an rsync when resurrecting
>> a crashed, failed-over master".  This is not merely optimizing for the
>> wrong thing, it's positively hazardous.  After a fail-over, you should
>> be wondering whether it's safe to resurrect the old master at all, not
>> about how fast you can bring it back up without validating its data.
>> IOW, I wouldn't consider skipping the rsync even if I had a feature
>> like this.
>
> Agreed. Especially as in situations where you fall over in a planned
> way, e.g. for a hardware upgrade, you can avoid the need to resync with
> a littlebit of care.

It's really worth documenting that way.

> So its mostly in catastrophic situations this
> becomes a problem and in those you really should resync - and its a good
> idea not to use a normal rsync but a rsync --checksum or similar.

If database is very large, rsync --checksum takes very long. And I'm concerned
that most of data pages in master has the different checksum from those in the
standby because of commit hint bit. I'm not sure how rsync --checksum can
speed up the backup after failover.

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 04/10/2013 11:40 AM, Fujii Masao wrote:
>
>> Strange. If this is really true, shared disk failover solution is
>> fundamentally broken because the standby needs to start up with the
>> shared "corrupted" database at the failover.
>
>
> How so? Shared disk doesn't use replication. The point I was trying to make
> is that replication requires synchronization between two disparate servers,
> and verifying they have exactly the same data is a non-trivial exercise.
> Even a single transaction after a failover (effectively) negates the old
> server because there's no easy "catch up" mechanism yet.

Hmm... ISTM what Samrat is proposing can resolve the problem. That is,
if we can think that any data page which has not been replicated to the standby
is not written in the master, new standby (i.e., old master) can safely catch up
with new master (i.e., old standby). In this approach, of course, new standby
might have some WAL records which new master doesn't have, so before
starting up new standby, we need to remove all the WAL files in new standby
and retrieve any WAL files from new master. But, what's the problem in his
approach?

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Wed, Apr 10, 2013 at 7:44 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 04/10/2013 11:40 AM, Fujii Masao wrote:
>
>> Strange. If this is really true, shared disk failover solution is
>> fundamentally broken because the standby needs to start up with the
>> shared "corrupted" database at the failover.
>
>
> How so? Shared disk doesn't use replication. The point I was trying to make
> is that replication requires synchronization between two disparate servers,
> and verifying they have exactly the same data is a non-trivial exercise.
> Even a single transaction after a failover (effectively) negates the old
> server because there's no easy "catch up" mechanism yet.
>
> Even if this isn't necessarily true, it's the safest approach IMO.

We already rely on WAL-before-data to ensure correct recovery. What is
proposed here is to slightly redefine it to require WAL to be
replicated before it is considered to be flushed. This ensures that no
data page on disk differs from the WAL that the slave has. The
machinery to do this is already mostly there, we already wait for WAL
flushes and we know the write location on the slave. The second
requirement is that we never start up as master and we don't trust any
local WAL. This is actually how pacemaker clusters work, you would
only need to amend the RA to wipe the WAL and configure postgresql
with restart_after_crash = false.

It would be very helpful in restoring HA capability after failover if
we wouldn't have to read through the whole database after a VM goes
down and is migrated with the shared disk onto a new host.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Tom Lane
Date:
Ants Aasma <ants@cybertec.at> writes:
> We already rely on WAL-before-data to ensure correct recovery. What is
> proposed here is to slightly redefine it to require WAL to be
> replicated before it is considered to be flushed. This ensures that no
> data page on disk differs from the WAL that the slave has. The
> machinery to do this is already mostly there, we already wait for WAL
> flushes and we know the write location on the slave. The second
> requirement is that we never start up as master and we don't trust any
> local WAL. This is actually how pacemaker clusters work, you would
> only need to amend the RA to wipe the WAL and configure postgresql
> with restart_after_crash = false.

> It would be very helpful in restoring HA capability after failover if
> we wouldn't have to read through the whole database after a VM goes
> down and is migrated with the shared disk onto a new host.

The problem with this is it's making an idealistic assumption that a
crashed master didn't do anything wrong or lose/corrupt any data during
its crash.  As soon as you realize that's an unsafe assumption, the
whole thing becomes worthless to you.

If the idea had zero implementation cost, I would say "sure, let people
play with it until they find out (probably the hard way) that it's a bad
idea".  But it's going to introduce, at the very least, additional
complexity into a portion of the system that is critical and plenty
complicated enough already.  That being the case, I don't want it there
at all, not even as an option.        regards, tom lane



Re: Inconsistent DB data in Streaming Replication

From
Boszormenyi Zoltan
Date:
2013-04-10 18:46 keltezéssel, Fujii Masao írta:
> On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
>>> Amit Kapila <amit.kapila@huawei.com> writes:
>>>> On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
>>>>> Sorry, this is incorrect. Streaming replication continuous, master is not
>>>>> waiting, whenever the master writes the data page it checks that the WAL
>>>>> record is written in standby till that LSN.
>>>> I am not sure it will resolve the problem completely as your old-master can
>>>> have some WAL extra then new-master for same timeline. I don't remember
>>>> exactly will timeline switch feature
>>>> take care of this extra WAL, Heikki can confirm this point?
>>>> Also I think this can serialize flush of data pages in checkpoint/bgwriter
>>>> which is currently not the case.
>>> Yeah.  TBH this entire discussion seems to be "let's cripple performance
>>> in the normal case so that we can skip doing an rsync when resurrecting
>>> a crashed, failed-over master".  This is not merely optimizing for the
>>> wrong thing, it's positively hazardous.  After a fail-over, you should
>>> be wondering whether it's safe to resurrect the old master at all, not
>>> about how fast you can bring it back up without validating its data.
>>> IOW, I wouldn't consider skipping the rsync even if I had a feature
>>> like this.
>> Agreed. Especially as in situations where you fall over in a planned
>> way, e.g. for a hardware upgrade, you can avoid the need to resync with
>> a littlebit of care.
> It's really worth documenting that way.
>
>> So its mostly in catastrophic situations this
>> becomes a problem and in those you really should resync - and its a good
>> idea not to use a normal rsync but a rsync --checksum or similar.
> If database is very large, rsync --checksum takes very long. And I'm concerned
> that most of data pages in master has the different checksum from those in the
> standby because of commit hint bit. I'm not sure how rsync --checksum can
> speed up the backup after failover.

"rsync --checksum" alone may not but "rsync --inplace" may speed up backup a lot.

>
> Regards,
>


--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de     http://www.postgresql.at/




Re: Inconsistent DB data in Streaming Replication

From
Andres Freund
Date:
On 2013-04-10 20:39:25 +0200, Boszormenyi Zoltan wrote:
> 2013-04-10 18:46 keltezéssel, Fujii Masao írta:
> >On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >>On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
> >>>Amit Kapila <amit.kapila@huawei.com> writes:
> >>>>On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
> >>>>>Sorry, this is incorrect. Streaming replication continuous, master is not
> >>>>>waiting, whenever the master writes the data page it checks that the WAL
> >>>>>record is written in standby till that LSN.
> >>>>I am not sure it will resolve the problem completely as your old-master can
> >>>>have some WAL extra then new-master for same timeline. I don't remember
> >>>>exactly will timeline switch feature
> >>>>take care of this extra WAL, Heikki can confirm this point?
> >>>>Also I think this can serialize flush of data pages in checkpoint/bgwriter
> >>>>which is currently not the case.
> >>>Yeah.  TBH this entire discussion seems to be "let's cripple performance
> >>>in the normal case so that we can skip doing an rsync when resurrecting
> >>>a crashed, failed-over master".  This is not merely optimizing for the
> >>>wrong thing, it's positively hazardous.  After a fail-over, you should
> >>>be wondering whether it's safe to resurrect the old master at all, not
> >>>about how fast you can bring it back up without validating its data.
> >>>IOW, I wouldn't consider skipping the rsync even if I had a feature
> >>>like this.
> >>Agreed. Especially as in situations where you fall over in a planned
> >>way, e.g. for a hardware upgrade, you can avoid the need to resync with
> >>a littlebit of care.
> >It's really worth documenting that way.
> >
> >>So its mostly in catastrophic situations this
> >>becomes a problem and in those you really should resync - and its a good
> >>idea not to use a normal rsync but a rsync --checksum or similar.
> >If database is very large, rsync --checksum takes very long. And I'm concerned
> >that most of data pages in master has the different checksum from those in the
> >standby because of commit hint bit. I'm not sure how rsync --checksum can
> >speed up the backup after failover.

Its not about speed, its about correctness.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inconsistent DB data in Streaming Replication

From
Amit Kapila
Date:
On Wednesday, April 10, 2013 10:31 PM Fujii Masao wrote:
> On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas
> <sthomas@optionshouse.com> wrote:
> > On 04/10/2013 11:40 AM, Fujii Masao wrote:
> >
> >> Strange. If this is really true, shared disk failover solution is
> >> fundamentally broken because the standby needs to start up with the
> >> shared "corrupted" database at the failover.
> >
> >
> > How so? Shared disk doesn't use replication. The point I was trying
> to make
> > is that replication requires synchronization between two disparate
> servers,
> > and verifying they have exactly the same data is a non-trivial
> exercise.
> > Even a single transaction after a failover (effectively) negates the
> old
> > server because there's no easy "catch up" mechanism yet.
> 
> Hmm... ISTM what Samrat is proposing can resolve the problem. That is,
> if we can think that any data page which has not been replicated to the
> standby
> is not written in the master, new standby (i.e., old master) can safely
> catch up
> with new master (i.e., old standby). In this approach, of course, new
> standby
> might have some WAL records which new master doesn't have, so before
> starting up new standby, we need to remove all the WAL files in new
> standby
> and retrieve any WAL files from new master. But, what's the problem in
> his
> approach?

Consider the case old-master crashed during flushing the data page, now you
would need full page image from new-master.
It might so happen that in new-master Checkpoint would have purged (reused)
the log file's from that time line, in that case
it will be difficult to get the full page image, user can refer WAL archive
for that, but I think it will not be straight forward.

One more point, what will be the new behavior when there are 2 transactions
one has synchronous_commit =off and other with on?

With Regards,
Amit Kapila.




Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
> Consider the case old-master crashed during flushing the data page, now you
> would need full page image from new-master.
> It might so happen that in new-master Checkpoint would have purged (reused)
> the log file's from that time line, in that case
> it will be difficult to get the full page image, user can refer WAL archive
> for that, but I think it will not be straight forward.

Using a WAL archive is standard procedure when you do not wish to take
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.

> One more point, what will be the new behavior when there are 2 transactions
> one has synchronous_commit =off and other with on?

Exactly the same as now - waiting for commit record replication when
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Sameer Thakur
Date:
Hello,
>The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA >when doing DR testing, but I personally don't think this is the way to fix that particular edge case.

This is the use case we are trying to address (I work with Samrat). We were wondering why this may not be a good fix?

regards
Sameer


On Thu, Apr 11, 2013 at 3:52 PM, Ants Aasma <ants@cybertec.at> wrote:
On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
> Consider the case old-master crashed during flushing the data page, now you
> would need full page image from new-master.
> It might so happen that in new-master Checkpoint would have purged (reused)
> the log file's from that time line, in that case
> it will be difficult to get the full page image, user can refer WAL archive
> for that, but I think it will not be straight forward.

Using a WAL archive is standard procedure when you do not wish to take
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.

> One more point, what will be the new behavior when there are 2 transactions
> one has synchronous_commit =off and other with on?

Exactly the same as now - waiting for commit record replication when
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Inconsistent DB data in Streaming Replication

From
Hannu Krosing
Date:
On 04/11/2013 01:26 PM, Sameer Thakur wrote:
Hello,
>The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA >when doing DR testing, but I personally don't think this is the way to fix that particular edge case.

This is the use case we are trying to address (I work with Samrat).
You just shut down the old master and let the standby catch
up (takas a few microseconds ;) ) before you promote it.

After this you can start up the former master with recovery.conf
and it will follow nicely.
We were wondering why this may not be a good fix?
The proposed fix - halting all writes of data pages to disk and
to WAL files while waiting ACK from standby - will tremendously
slow down all parallel work on master.

And it does just turn around "master is ahead of slave" problem
into "slave is ahead of master" problem :)

Regards
Hannu

regards
Sameer


On Thu, Apr 11, 2013 at 3:52 PM, Ants Aasma <ants@cybertec.at> wrote:
On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
> Consider the case old-master crashed during flushing the data page, now you
> would need full page image from new-master.
> It might so happen that in new-master Checkpoint would have purged (reused)
> the log file's from that time line, in that case
> it will be difficult to get the full page image, user can refer WAL archive
> for that, but I think it will not be straight forward.

Using a WAL archive is standard procedure when you do not wish to take
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.

> One more point, what will be the new behavior when there are 2 transactions
> one has synchronous_commit =off and other with on?

Exactly the same as now - waiting for commit record replication when
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> The proposed fix - halting all writes of data pages to disk and
> to WAL files while waiting ACK from standby - will tremendously
> slow down all parallel work on master.

This is not what is being proposed. The proposed fix halts writes of
only data pages that are modified within the window of WAL that is not
yet ACKed by the slave. This means pages that were recently modified
and where the clocksweep or checkpoint has decided to evict them. This
only affects the checkpointer, bgwriter and backends doing allocation.
Furthermore, for the backend clocksweep case it would be reasonable to
just pick another buffer to evict. The slowdown for most actual cases
will be negligible.

> And it does just turn around "master is ahead of slave" problem
> into "slave is ahead of master" problem :)

The issue is not being ahead or behind. The issue is ensuring WAL
durability in the face of failovers before modifying data pages. This
is sufficient to guarantee no forks in the WAL stream from the point
of view of data files and with that the capability to always recover
by replaying WAL. There can still be forks from the point of view of
async commits, with most recent commits disappearing on failover, but
this is in no way different from what we have now.

I don't share the view that the disk image is extremely likely to be
corrupt after a crash. If that were the case then we should recommend
that people don't use crash recovery at all and always restore from a
backup. For errors like power supply failure, uncorrectable ECC
errors, etc. we can be pretty sure that the server was not writing
garbage into the storage system before failing. Having to do a day
long rsync run + recovery to catch up on all changes during the resync
to restore high-availability safety in those circumstances is in many
cases a larger risk.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Tom Lane
Date:
Ants Aasma <ants@cybertec.at> writes:
> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> The proposed fix - halting all writes of data pages to disk and
>> to WAL files while waiting ACK from standby - will tremendously
>> slow down all parallel work on master.

> This is not what is being proposed.

The claim that it won't slow things down is lacking in credible
evidence.  Moreover, whether there's any performance hit or not, it will
certainly add complexity (and risk of bugs) in a place where we can ill
afford more.  And the bottom line remains: this is optimizing for the
wrong thing.
        regards, tom lane



Re: Inconsistent DB data in Streaming Replication

From
Hannu Krosing
Date:
On 04/11/2013 03:52 PM, Ants Aasma wrote:
> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> The proposed fix - halting all writes of data pages to disk and
>> to WAL files while waiting ACK from standby - will tremendously
>> slow down all parallel work on master.
> This is not what is being proposed. The proposed fix halts writes of
> only data pages that are modified within the window of WAL that is not
> yet ACKed by the slave. This means pages that were recently modified
> and where the clocksweep or checkpoint has decided to evict them. This
> only affects the checkpointer, bgwriter and backends doing allocation.
> Furthermore, for the backend clocksweep case it would be reasonable to
> just pick another buffer to evict. The slowdown for most actual cases
> will be negligible.
You also need to hold back all WAL writes, including the ones by
parallel async and locally-synced transactions. Which means that
you have to make all locally synced transactions to wait on the
syncrep transactions committed before them.
After getting the ACK from slave you then have a backlog of stuff
to write locally, which then also needs to be sent to slave. Basically
this turns a nice smooth WAL write-and-stream pipeline into a
chunky wait-and-write-and-wait-and-stream-and-wait :P
This may not be a problem in slight write load cases, which is
probably the most widely happening usecase for postgres, but it
will harm top performance and also force people to get much
better (and more expensive) hardware than would otherways
be needed.
>
>> And it does just turn around "master is ahead of slave" problem
>> into "slave is ahead of master" problem :)
> The issue is not being ahead or behind. The issue is ensuring WAL
> durability in the face of failovers before modifying data pages. This
> is sufficient to guarantee no forks in the WAL stream from the point
> of view of data files and with that the capability to always recover
> by replaying WAL.
How would this handle the case Tom pointed out, namely a short
power recycling on master ?

Instead of just continuing after booting up again the master now
has to figure out if it had any slaves and then try to query them
(for how long?) if they had any replayed WAL the master does
not know of.

Suddenly the pure existence of streaming replica slaves has become
a problem for master !

This will especially complicate the case of multiple slaves each
having received WAL to a slightly different LSN ? And you do want
to have at least 2 slaves if you want both durability
and availability with syncrep.

What if the one of slaves disconnects ? how should master react to this ?


Regards
Hannu Krosing



Re: Inconsistent DB data in Streaming Replication

From
Ants Aasma
Date:
On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>
>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>> wrote:
>>>
>>> The proposed fix - halting all writes of data pages to disk and
>>> to WAL files while waiting ACK from standby - will tremendously
>>> slow down all parallel work on master.
>>
>> This is not what is being proposed. The proposed fix halts writes of
>> only data pages that are modified within the window of WAL that is not
>> yet ACKed by the slave. This means pages that were recently modified
>> and where the clocksweep or checkpoint has decided to evict them. This
>> only affects the checkpointer, bgwriter and backends doing allocation.
>> Furthermore, for the backend clocksweep case it would be reasonable to
>> just pick another buffer to evict. The slowdown for most actual cases
>> will be negligible.
>
> You also need to hold back all WAL writes, including the ones by
> parallel async and locally-synced transactions. Which means that
> you have to make all locally synced transactions to wait on the
> syncrep transactions committed before them.
> After getting the ACK from slave you then have a backlog of stuff
> to write locally, which then also needs to be sent to slave. Basically
> this turns a nice smooth WAL write-and-stream pipeline into a
> chunky wait-and-write-and-wait-and-stream-and-wait :P
> This may not be a problem in slight write load cases, which is
> probably the most widely happening usecase for postgres, but it
> will harm top performance and also force people to get much
> better (and more expensive) hardware than would otherways
> be needed.

Why would you need to hold back WAL writes? WAL is written on master
first and then steamed to slave as it is done now. You would only need
hold back dirty page evictions having a recent enough LSN to not yet
be replicated. This holding back is already done to wait for local WAL
flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
it's usage count gets bumped, so it will not be considered for
eviction for at least one clocksweep cycle. In normal circumstances
that will be enough time to get an ACK from the slave. When WAL is
generated at an higher rate than can be replicated this will not be
true. In that case backends that need to bring in new pages will have
to wait for WAL to be replicated before they can continue. That will
hopefully include the backends that are doing the dirtying, throttling
the WAL generation rate. This would definitely be optional behavior,
not something turned on by default.

>>
>>> And it does just turn around "master is ahead of slave" problem
>>> into "slave is ahead of master" problem :)
>>
>> The issue is not being ahead or behind. The issue is ensuring WAL
>> durability in the face of failovers before modifying data pages. This
>> is sufficient to guarantee no forks in the WAL stream from the point
>> of view of data files and with that the capability to always recover
>> by replaying WAL.
>
> How would this handle the case Tom pointed out, namely a short
> power recycling on master ?
>
> Instead of just continuing after booting up again the master now
> has to figure out if it had any slaves and then try to query them
> (for how long?) if they had any replayed WAL the master does
> not know of.

If the master is restarted and there is no failover to the slave, then
nothing strange would happen, master does recovery, comes up and
starts streaming to the slave again. If there is a failover, then
whatever is managing the failover needs to ensure that the master does
not come up again on its own before it is reconfigured as a slave.
This is what HA cluster managers do.

> Suddenly the pure existence of streaming replica slaves has become
> a problem for master !
>
> This will especially complicate the case of multiple slaves each
> having received WAL to a slightly different LSN ? And you do want
> to have at least 2 slaves if you want both durability
> and availability with syncrep.
>
> What if the one of slaves disconnects ? how should master react to this ?

Again, WAL replication will be the same as it is now. Availability
considerations, including what to do when slaves go away, are the same
as for current sync replication. Only required change is that we can
configure the master to hold out on writing any data pages that
contain changes that might go missing in the case of a failover.

Whether the additional complexity is worth the feature is a matter of
opinion. As we have no patch yet I can't say that I know what all the
implications are, but at first glance the complexity seems rather
compartmentalized. This would only amend what the concept of a WAL
flush considers safely flushed.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Thu, Apr 11, 2013 at 2:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ants Aasma <ants@cybertec.at> writes:
>> We already rely on WAL-before-data to ensure correct recovery. What is
>> proposed here is to slightly redefine it to require WAL to be
>> replicated before it is considered to be flushed. This ensures that no
>> data page on disk differs from the WAL that the slave has. The
>> machinery to do this is already mostly there, we already wait for WAL
>> flushes and we know the write location on the slave. The second
>> requirement is that we never start up as master and we don't trust any
>> local WAL. This is actually how pacemaker clusters work, you would
>> only need to amend the RA to wipe the WAL and configure postgresql
>> with restart_after_crash = false.
>
>> It would be very helpful in restoring HA capability after failover if
>> we wouldn't have to read through the whole database after a VM goes
>> down and is migrated with the shared disk onto a new host.
>
> The problem with this is it's making an idealistic assumption that a
> crashed master didn't do anything wrong or lose/corrupt any data during
> its crash.  As soon as you realize that's an unsafe assumption, the
> whole thing becomes worthless to you.

The crash recovery relies on the same assumption. If it's really unsafe,
we should stop supporting the crash recovery. But I don't think that's
always true.

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>
> You just shut down the old master and let the standby catch
> up (takas a few microseconds ;) ) before you promote it.
>
> After this you can start up the former master with recovery.conf
> and it will follow nicely.

No. When you shut down the old master, it might not have been
able to send all the WAL records to the standby. I have observed
this situation several times. So in your approach, new standby
might fail to catch up with the master nicely.

Regards,

--
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Fri, Apr 12, 2013 at 12:09 AM, Ants Aasma <ants@cybertec.at> wrote:
> On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>>
>>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>>> wrote:
>>>>
>>>> The proposed fix - halting all writes of data pages to disk and
>>>> to WAL files while waiting ACK from standby - will tremendously
>>>> slow down all parallel work on master.
>>>
>>> This is not what is being proposed. The proposed fix halts writes of
>>> only data pages that are modified within the window of WAL that is not
>>> yet ACKed by the slave. This means pages that were recently modified
>>> and where the clocksweep or checkpoint has decided to evict them. This
>>> only affects the checkpointer, bgwriter and backends doing allocation.
>>> Furthermore, for the backend clocksweep case it would be reasonable to
>>> just pick another buffer to evict. The slowdown for most actual cases
>>> will be negligible.
>>
>> You also need to hold back all WAL writes, including the ones by
>> parallel async and locally-synced transactions. Which means that
>> you have to make all locally synced transactions to wait on the
>> syncrep transactions committed before them.
>> After getting the ACK from slave you then have a backlog of stuff
>> to write locally, which then also needs to be sent to slave. Basically
>> this turns a nice smooth WAL write-and-stream pipeline into a
>> chunky wait-and-write-and-wait-and-stream-and-wait :P
>> This may not be a problem in slight write load cases, which is
>> probably the most widely happening usecase for postgres, but it
>> will harm top performance and also force people to get much
>> better (and more expensive) hardware than would otherways
>> be needed.
>
> Why would you need to hold back WAL writes? WAL is written on master
> first and then steamed to slave as it is done now. You would only need
> hold back dirty page evictions having a recent enough LSN to not yet
> be replicated. This holding back is already done to wait for local WAL
> flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
> it's usage count gets bumped, so it will not be considered for
> eviction for at least one clocksweep cycle. In normal circumstances
> that will be enough time to get an ACK from the slave. When WAL is
> generated at an higher rate than can be replicated this will not be
> true. In that case backends that need to bring in new pages will have
> to wait for WAL to be replicated before they can continue. That will
> hopefully include the backends that are doing the dirtying, throttling
> the WAL generation rate. This would definitely be optional behavior,
> not something turned on by default.
>
>>>
>>>> And it does just turn around "master is ahead of slave" problem
>>>> into "slave is ahead of master" problem :)
>>>
>>> The issue is not being ahead or behind. The issue is ensuring WAL
>>> durability in the face of failovers before modifying data pages. This
>>> is sufficient to guarantee no forks in the WAL stream from the point
>>> of view of data files and with that the capability to always recover
>>> by replaying WAL.
>>
>> How would this handle the case Tom pointed out, namely a short
>> power recycling on master ?
>>
>> Instead of just continuing after booting up again the master now
>> has to figure out if it had any slaves and then try to query them
>> (for how long?) if they had any replayed WAL the master does
>> not know of.
>
> If the master is restarted and there is no failover to the slave, then
> nothing strange would happen, master does recovery, comes up and
> starts streaming to the slave again. If there is a failover, then
> whatever is managing the failover needs to ensure that the master does
> not come up again on its own before it is reconfigured as a slave.
> This is what HA cluster managers do.
>
>> Suddenly the pure existence of streaming replica slaves has become
>> a problem for master !
>>
>> This will especially complicate the case of multiple slaves each
>> having received WAL to a slightly different LSN ? And you do want
>> to have at least 2 slaves if you want both durability
>> and availability with syncrep.
>>
>> What if the one of slaves disconnects ? how should master react to this ?
>
> Again, WAL replication will be the same as it is now. Availability
> considerations, including what to do when slaves go away, are the same
> as for current sync replication. Only required change is that we can
> configure the master to hold out on writing any data pages that
> contain changes that might go missing in the case of a failover.
>
> Whether the additional complexity is worth the feature is a matter of
> opinion. As we have no patch yet I can't say that I know what all the
> implications are, but at first glance the complexity seems rather
> compartmentalized. This would only amend what the concept of a WAL
> flush considers safely flushed.

I really share the same view with you!

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Pavan Deolasee
Date:



On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants@cybertec.at> wrote:
On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>
>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>> wrote:
>>>
>>> The proposed fix - halting all writes of data pages to disk and
>>> to WAL files while waiting ACK from standby - will tremendously
>>> slow down all parallel work on master.
>>
>> This is not what is being proposed. The proposed fix halts writes of
>> only data pages that are modified within the window of WAL that is not
>> yet ACKed by the slave. This means pages that were recently modified
>> and where the clocksweep or checkpoint has decided to evict them. This
>> only affects the checkpointer, bgwriter and backends doing allocation.
>> Furthermore, for the backend clocksweep case it would be reasonable to
>> just pick another buffer to evict. The slowdown for most actual cases
>> will be negligible.
>
> You also need to hold back all WAL writes, including the ones by
> parallel async and locally-synced transactions. Which means that
> you have to make all locally synced transactions to wait on the
> syncrep transactions committed before them.
> After getting the ACK from slave you then have a backlog of stuff
> to write locally, which then also needs to be sent to slave. Basically
> this turns a nice smooth WAL write-and-stream pipeline into a
> chunky wait-and-write-and-wait-and-stream-and-wait :P
> This may not be a problem in slight write load cases, which is
> probably the most widely happening usecase for postgres, but it
> will harm top performance and also force people to get much
> better (and more expensive) hardware than would otherways
> be needed.

Why would you need to hold back WAL writes? WAL is written on master
first and then steamed to slave as it is done now. You would only need
hold back dirty page evictions having a recent enough LSN to not yet
be replicated. This holding back is already done to wait for local WAL
flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
it's usage count gets bumped, so it will not be considered for
eviction for at least one clocksweep cycle. In normal circumstances
that will be enough time to get an ACK from the slave. When WAL is
generated at an higher rate than can be replicated this will not be
true. In that case backends that need to bring in new pages will have
to wait for WAL to be replicated before they can continue. That will
hopefully include the backends that are doing the dirtying, throttling
the WAL generation rate. This would definitely be optional behavior,
not something turned on by default.


I agree. I don't think the proposes change would cause a lot of performance bottleneck since the proposal is to hold back writing of dirty pages until the WAL is replicated successfully to the standby. The heap pages are mostly written by the background threads often much later than the WAL for the change is written. So in all likelihood, there will be no wait involved. Of course, this will not be true for very frequently updated pages that must be written at a checkpoint.

But I wonder if the problem is really limited to the heap pages ? Even for something like a CLOG page, we will need to ensure that the WAL records are replayed before the page is written to the disk. Same is true for relation truncation. In fact, all places where the master needs to call XLogFlush() probably needs to be examined to decide if the subsequent action has a chance to leave the database corrupt and ensure that the WAL is replicated before proceeding with the change.

Tom has a very valid concern from the additional code complexity point of view though I disagree that its always good idea to start with a fresh rsync. If we can avoid that with right checks, I don't see why we should not improve the downtime for the master. Its very likely that the standby may not be as good a server as the master is and the user would want to quickly switch back to the master for performance reasons. To reduce complexity, can we do this as some sort of plugin for XLogFlush() which gets to know that XLogFlush has been upto the given LSN and the event that caused the function to be called ? We can then leave the handling  of the even to the implementer. This will also avoid any penalty for those who are happy with the current mechanism and do not want any complex HA setups.

Thanks,
Pavan

Re: Inconsistent DB data in Streaming Replication

From
Hannu Krosing
Date:
On 04/11/2013 07:29 PM, Fujii Masao wrote:
> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> You just shut down the old master and let the standby catch
>> up (takas a few microseconds ;) ) before you promote it.
>>
>> After this you can start up the former master with recovery.conf
>> and it will follow nicely.
> No. When you shut down the old master, it might not have been
> able to send all the WAL records to the standby.
In what cases (other than a standby lagging too much or
not listening at all) have you observed this ?
> I have observed
> this situation several times. So in your approach, new standby
> might fail to catch up with the master nicely.
the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this:

* Graceful shutdown

When smart/fast shutdown is requested, the primary waits to exit
until XLOG records have been sent to the standby, up to the
shutdown checkpoint record.

Maybe you were requesting immediate shutdown ?

Regards
Hannu Krosing




Re: Inconsistent DB data in Streaming Replication

From
Andres Freund
Date:
On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:
> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> >
> > You just shut down the old master and let the standby catch
> > up (takas a few microseconds ;) ) before you promote it.
> >
> > After this you can start up the former master with recovery.conf
> > and it will follow nicely.
>
> No. When you shut down the old master, it might not have been
> able to send all the WAL records to the standby. I have observed
> this situation several times. So in your approach, new standby
> might fail to catch up with the master nicely.

It seems most of this thread is focusing on the wrong thing then. If we
really are only talking about planned failover then we need to solve
*that* not some ominous "don't flush data too early" which has
noticeable performance and implementation complexity problems.

I guess youre observing that not everything is replicated because youre
doing an immediate shutdown - probably because performing the shutdown
checkpoint would take too long. This seems solveable by implementing a
recovery connection command which initiates a shutdown that just
disables future WAL inserts and returns the last lsn that has been
written. Then you can fall over as soon as that llsn has been reached
and can make the previous master follow from there on without problems.

You could even teach the standby not to increment the timeline in that
case since thats safe.

The biggest issue seems to be how to implement this without another
spinlock acquisition for every XLogInsert(), but that seems possible.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inconsistent DB data in Streaming Replication

From
Andres Freund
Date:
On 2013-04-12 11:18:01 +0530, Pavan Deolasee wrote:
> On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants@cybertec.at> wrote:
> 
> > On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com>
> > wrote:
> > > On 04/11/2013 03:52 PM, Ants Aasma wrote:
> > >>
> > >> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
> > >> wrote:
> > >>>
> > >>> The proposed fix - halting all writes of data pages to disk and
> > >>> to WAL files while waiting ACK from standby - will tremendously
> > >>> slow down all parallel work on master.
> > >>
> > >> This is not what is being proposed. The proposed fix halts writes of
> > >> only data pages that are modified within the window of WAL that is not
> > >> yet ACKed by the slave. This means pages that were recently modified
> > >> and where the clocksweep or checkpoint has decided to evict them. This
> > >> only affects the checkpointer, bgwriter and backends doing allocation.
> > >> Furthermore, for the backend clocksweep case it would be reasonable to
> > >> just pick another buffer to evict. The slowdown for most actual cases
> > >> will be negligible.
> > >
> > > You also need to hold back all WAL writes, including the ones by
> > > parallel async and locally-synced transactions. Which means that
> > > you have to make all locally synced transactions to wait on the
> > > syncrep transactions committed before them.
> > > After getting the ACK from slave you then have a backlog of stuff
> > > to write locally, which then also needs to be sent to slave. Basically
> > > this turns a nice smooth WAL write-and-stream pipeline into a
> > > chunky wait-and-write-and-wait-and-stream-and-wait :P
> > > This may not be a problem in slight write load cases, which is
> > > probably the most widely happening usecase for postgres, but it
> > > will harm top performance and also force people to get much
> > > better (and more expensive) hardware than would otherways
> > > be needed.
> >
> > Why would you need to hold back WAL writes? WAL is written on master
> > first and then steamed to slave as it is done now. You would only need
> > hold back dirty page evictions having a recent enough LSN to not yet
> > be replicated. This holding back is already done to wait for local WAL
> > flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
> > it's usage count gets bumped, so it will not be considered for
> > eviction for at least one clocksweep cycle. In normal circumstances
> > that will be enough time to get an ACK from the slave. When WAL is
> > generated at an higher rate than can be replicated this will not be
> > true. In that case backends that need to bring in new pages will have
> > to wait for WAL to be replicated before they can continue. That will
> > hopefully include the backends that are doing the dirtying, throttling
> > the WAL generation rate. This would definitely be optional behavior,
> > not something turned on by default.
> >
> >
> I agree. I don't think the proposes change would cause a lot of performance
> bottleneck since the proposal is to hold back writing of dirty pages until
> the WAL is replicated successfully to the standby. The heap pages are
> mostly written by the background threads often much later than the WAL for
> the change is written. So in all likelihood, there will be no wait
> involved. Of course, this will not be true for very frequently updated
> pages that must be written at a checkpoint.

I don't think that holds true at all. If you look at pg_stat_bgwriter in
any remotely bugs cluster with a hot data set over shared_buffers you'll
notice that a large percentage of writes will have been done by backends
themselves.
Yes, we need to improve on this, and we are talking about it right now
in another thread, but until thats solved this argumentation seems to
fall flat on its face.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inconsistent DB data in Streaming Replication

From
Pavan Deolasee
Date:



On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund <andres@2ndquadrant.com> wrote:


I don't think that holds true at all. If you look at pg_stat_bgwriter in
any remotely bugs cluster with a hot data set over shared_buffers you'll
notice that a large percentage of writes will have been done by backends
themselves.

Even if what you are saying it true, which I am sure is, the pages that the backend is evicting mustn't be recently used by the LRU algorithm which means that the WAL pertaining to the last change to the page in most likelihood is already replicated, unless the replication is really lagging much behind. Of course, if the standby is not able to keep pace with the master in a realistic manner then we have a problem with the approach.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Inconsistent DB data in Streaming Replication

From
Andres Freund
Date:
On 2013-04-12 16:58:44 +0530, Pavan Deolasee wrote:
> On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> >
> >
> > I don't think that holds true at all. If you look at pg_stat_bgwriter in
> > any remotely bugs cluster with a hot data set over shared_buffers you'll
> > notice that a large percentage of writes will have been done by backends
> > themselves.
> >
> 
> Even if what you are saying it true, which I am sure is, the pages that the
> backend is evicting mustn't be recently used by the LRU algorithm which
> means that the WAL pertaining to the last change to the page in most
> likelihood is already replicated, unless the replication is really lagging
> much behind. Of course, if the standby is not able to keep pace with the
> master in a realistic manner then we have a problem with the approach.

It frequently takes time in the sub to few second range for usagecounts
in zero.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Fri, Apr 12, 2013 at 5:53 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 04/11/2013 07:29 PM, Fujii Masao wrote:
>>
>> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>> wrote:
>>>
>>> You just shut down the old master and let the standby catch
>>> up (takas a few microseconds ;) ) before you promote it.
>>>
>>> After this you can start up the former master with recovery.conf
>>> and it will follow nicely.
>>
>> No. When you shut down the old master, it might not have been
>> able to send all the WAL records to the standby.
>
> In what cases (other than a standby lagging too much or
> not listening at all) have you observed this ?
>
>> I have observed
>> this situation several times. So in your approach, new standby
>> might fail to catch up with the master nicely.
>
> the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this:
>
> * Graceful shutdown
>
> When smart/fast shutdown is requested, the primary waits to exit
> until XLOG records have been sent to the standby, up to the
> shutdown checkpoint record.
>
> Maybe you were requesting immediate shutdown ?

No. I did fast shutdown.

It's true that the master waits for checkpoint record to be replicated to the
standby when fast shutdown is performed. But the standby can not always
successfully receive all WAL records which the master sent.

To ensure that all WAL records have been replicated to the standby at fast
shutdown, we should make the walsender wait for the standby to write the
checkpoint record and send back the ACK.

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Fri, Apr 12, 2013 at 7:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:
>> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> >
>> > You just shut down the old master and let the standby catch
>> > up (takas a few microseconds ;) ) before you promote it.
>> >
>> > After this you can start up the former master with recovery.conf
>> > and it will follow nicely.
>>
>> No. When you shut down the old master, it might not have been
>> able to send all the WAL records to the standby. I have observed
>> this situation several times. So in your approach, new standby
>> might fail to catch up with the master nicely.
>
> It seems most of this thread is focusing on the wrong thing then. If we
> really are only talking about planned failover then we need to solve
> *that* not some ominous "don't flush data too early" which has
> noticeable performance and implementation complexity problems.

At least I'd like to talk about not only planned failover but also normal
failover.

> I guess youre observing that not everything is replicated because youre
> doing an immediate shutdown

No. I did fast shutdown.

At fast shutdown, after walsender sends the checkpoint record and
closes the replication connection, walreceiver can detect the close
of connection before receiving all WAL records. This means that,
even if walsender sends all WAL records, walreceiver cannot always
receive all of them.

> You could even teach the standby not to increment the timeline in that
> case since thats safe.

I don't think this is required thanks to recent Heikki's great efforts about
timelines.

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Hannu Krosing
Date:
On 04/14/2013 05:56 PM, Fujii Masao wrote:
> On Fri, Apr 12, 2013 at 7:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:
>>> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>>>> You just shut down the old master and let the standby catch
>>>> up (takas a few microseconds ;) ) before you promote it.
>>>>
>>>> After this you can start up the former master with recovery.conf
>>>> and it will follow nicely.
>>> No. When you shut down the old master, it might not have been
>>> able to send all the WAL records to the standby. I have observed
>>> this situation several times. So in your approach, new standby
>>> might fail to catch up with the master nicely.
>> It seems most of this thread is focusing on the wrong thing then. If we
>> really are only talking about planned failover then we need to solve
>> *that* not some ominous "don't flush data too early" which has
>> noticeable performance and implementation complexity problems.
> At least I'd like to talk about not only planned failover but also normal
> failover.
>
>> I guess youre observing that not everything is replicated because youre
>> doing an immediate shutdown
> No. I did fast shutdown.
>
> At fast shutdown, after walsender sends the checkpoint record and
> closes the replication connection, walreceiver can detect the close
> of connection before receiving all WAL records. This means that,
> even if walsender sends all WAL records, walreceiver cannot always
> receive all of them.
Seems very much like a bug, or at least a missing mode -
synchronous shutdown - where the master will wait for ack from standby(s)
before closing client connection.
>
>> You could even teach the standby not to increment the timeline in that
>> case since thats safe.
> I don't think this is required thanks to recent Heikki's great efforts about
> timelines.
>
> Regards,
>




Re: Inconsistent DB data in Streaming Replication

From
Florian Pflug
Date:
On Apr14, 2013, at 17:56 , Fujii Masao <masao.fujii@gmail.com> wrote:
> At fast shutdown, after walsender sends the checkpoint record and
> closes the replication connection, walreceiver can detect the close
> of connection before receiving all WAL records. This means that,
> even if walsender sends all WAL records, walreceiver cannot always
> receive all of them.

That sounds like a bug in walreceiver to me.

The following code in walreceiver's main loop looks suspicious:
 /*  * Process the received data, and any subsequent data we  * can read without blocking.  */ for (;;) {   if (len >
0)  {     /* Something was received from master, so reset timeout */     ...     XLogWalRcvProcessMsg(buf[0], &buf[1],
len- 1);   }   else if (len == 0)     break;   else if (len < 0)   {     ereport(LOG,         (errmsg("replication
terminatedby primary server"),          errdetail("End of WAL reached on timeline %u at %X/%X",
startpointTLI,               (uint32) (LogstreamResult.Write >> 32),                (uint32) LogstreamResult.Write)));
  ...   }   len = walrcv_receive(0, &buf); }
 
 /* Let the master know that we received some data. */ XLogWalRcvSendReply(false, false);
 /*  * If we've written some records, flush them to disk and  * let the startup process and primary server know about
*them.  */    XLogWalRcvFlush(false);
 

The loop at the top looks fine - it specifically avoids throwing
an error on EOF. But the code then proceeds to XLogWalRcvSendReply()
which doesn't seem to have the same smarts - it simply does
 if (PQputCopyData(streamConn, buffer, nbytes) <= 0 ||       PQflush(streamConn))     ereport(ERROR,
(errmsg("couldnot send data to WAL stream: %s",                     PQerrorMessage(streamConn))));
 

Unless I'm missing something, that certainly seems to explain
how a standby can lag behind even after a controlled shutdown of
the master.

best regards,
Florian Pflug




Re: Inconsistent DB data in Streaming Replication

From
Amit Kapila
Date:
On Monday, April 15, 2013 1:02 PM Florian Pflug wrote:
> On Apr14, 2013, at 17:56 , Fujii Masao <masao.fujii@gmail.com> wrote:
> > At fast shutdown, after walsender sends the checkpoint record and
> > closes the replication connection, walreceiver can detect the close
> > of connection before receiving all WAL records. This means that,
> > even if walsender sends all WAL records, walreceiver cannot always
> > receive all of them.
> 
> That sounds like a bug in walreceiver to me.
> 
> The following code in walreceiver's main loop looks suspicious:
> 
>   /*
>    * Process the received data, and any subsequent data we
>    * can read without blocking.
>    */
>   for (;;)
>   {
>     if (len > 0)
>     {
>       /* Something was received from master, so reset timeout */
>       ...
>       XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
>     }
>     else if (len == 0)
>       break;
>     else if (len < 0)
>     {
>       ereport(LOG,
>           (errmsg("replication terminated by primary server"),
>            errdetail("End of WAL reached on timeline %u at %X/%X",
>                  startpointTLI,
>                  (uint32) (LogstreamResult.Write >> 32),
>                  (uint32) LogstreamResult.Write)));
>       ...
>     }
>     len = walrcv_receive(0, &buf);
>   }
> 
>   /* Let the master know that we received some data. */
>   XLogWalRcvSendReply(false, false);
> 
>   /*
>    * If we've written some records, flush them to disk and
>    * let the startup process and primary server know about
>    * them.
>    */
>   XLogWalRcvFlush(false);
> 
> The loop at the top looks fine - it specifically avoids throwing
> an error on EOF. But the code then proceeds to XLogWalRcvSendReply()
> which doesn't seem to have the same smarts - it simply does
> 
>   if (PQputCopyData(streamConn, buffer, nbytes) <= 0 ||
>       PQflush(streamConn))
>       ereport(ERROR,
>               (errmsg("could not send data to WAL stream: %s",
>                       PQerrorMessage(streamConn))));
> 
> Unless I'm missing something, that certainly seems to explain
> how a standby can lag behind even after a controlled shutdown of
> the master.

Do you mean to say that as an error has occurred, so it would not be able to
flush received WAL, which could result in loss of WAL?
I think even if error occurs, it will call flush in WalRcvDie(), before
terminating WALReceiver.

With Regards,
Amit Kapila.




Re: Inconsistent DB data in Streaming Replication

From
Florian Pflug
Date:
On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote:
> Do you mean to say that as an error has occurred, so it would not be able to
> flush received WAL, which could result in loss of WAL?
> I think even if error occurs, it will call flush in WalRcvDie(), before
> terminating WALReceiver.

Hm, true, but for that to prevent the problem the inner processing
loop needs to always read up to EOF before it exits and we attempt
to send a reply. Which I don't think it necessarily does. Assume,
that the master sends a chunk of data, waits a bit, and finally
sends the shutdown record and exits. The slave might then receive
the first chunk, and it might trigger sending a reply. At the time
the reply is sent, the master has already sent the shutdown record
and closed the connection, and we'll thus fail to reply and abort.
Since the shutdown record has never been read from the socket,
XLogWalRcvFlush won't flush it, and the slave ends up behind the
master.

Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
we might also error out of the inner processing loop if the server
closes the socket after sending a keepalive but before we attempt
to respond.

Fixing this on the receive side alone seems quite messy and fragile.
So instead, I think we should let the master send a shutdown message
after it has sent everything it wants to send, and wait for the client
to acknowledge it before shutting down the socket.

If the client fails to respond, we could log a fat WARNING.

best regards,
Florian Pflug




Re: Inconsistent DB data in Streaming Replication

From
Amit Kapila
Date:
On Wednesday, April 17, 2013 4:19 PM Florian Pflug wrote:
> On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote:
> > Do you mean to say that as an error has occurred, so it would not be
> able to
> > flush received WAL, which could result in loss of WAL?
> > I think even if error occurs, it will call flush in WalRcvDie(),
> before
> > terminating WALReceiver.
> 
> Hm, true, but for that to prevent the problem the inner processing
> loop needs to always read up to EOF before it exits and we attempt
> to send a reply. Which I don't think it necessarily does. Assume,
> that the master sends a chunk of data, waits a bit, and finally
> sends the shutdown record and exits. The slave might then receive
> the first chunk, and it might trigger sending a reply. At the time
> the reply is sent, the master has already sent the shutdown record
> and closed the connection, and we'll thus fail to reply and abort.
> Since the shutdown record has never been read from the socket,
> XLogWalRcvFlush won't flush it, and the slave ends up behind the
> master.
> 
> Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
> we might also error out of the inner processing loop if the server
> closes the socket after sending a keepalive but before we attempt
> to respond.
> 
> Fixing this on the receive side alone seems quite messy and fragile.
> So instead, I think we should let the master send a shutdown message
> after it has sent everything it wants to send, and wait for the client
> to acknowledge it before shutting down the socket.
> 
> If the client fails to respond, we could log a fat WARNING.

Your explanation seems to be okay, but I think before discussing the exact
solution, 
If the actual problem can be reproduced, then it might be better to discuss
this solution.

With Regards,
Amit Kapila.




Re: Inconsistent DB data in Streaming Replication

From
Martijn van Oosterhout
Date:
On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote:
> Fixing this on the receive side alone seems quite messy and fragile.
> So instead, I think we should let the master send a shutdown message
> after it has sent everything it wants to send, and wait for the client
> to acknowledge it before shutting down the socket.
>
> If the client fails to respond, we could log a fat WARNING.

ISTM the master should half close the socket, using shutdown(). That
way the client receives an EOF and can still then send its reply to the
master.  Then when the master receives that it can close() completely.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Inconsistent DB data in Streaming Replication

From
Florian Pflug
Date:
On Apr19, 2013, at 14:46 , Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote:
>> Fixing this on the receive side alone seems quite messy and fragile.
>> So instead, I think we should let the master send a shutdown message
>> after it has sent everything it wants to send, and wait for the client
>> to acknowledge it before shutting down the socket.
>> 
>> If the client fails to respond, we could log a fat WARNING.
> 
> ISTM the master should half close the socket, using shutdown(). That
> way the client receives an EOF and can still then send its reply to the
> master.  Then when the master receives that it can close() completely.

Hm, there may be arbitrarily many reply requests within the unread
data in the socket's buffer, so wait for just one reply won't work.
Also, to distinguish a slave which crashes while the master shuts down
from one that has received all WAL and flushed it, the slave should flush
all WAL and send a final reply before closing the socket.

So the master would, upon shutting down, close only its writing end
of the connection, and continue to receive replies until it sees EOF.
After all slaves have gone, the master would emit a WARNING for every
slave whose last logged flush position is earlier than the master's
idea of end-of-wal.

The slave would, upon seeing EOF, flush all its WAL, send a final
reply, and close() the socket.

I'm not sure that relying on TCP's half-close feature has much benefit
over using a home-grown shutdown message, though. Anyway, the basic
shutdown protocol would be the same regardless of what exactly we use
to signal a shutdown.

BTW, I assume we'd only do this for smart shutdowns. 

best regards,
Florian Pflug




Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Wed, Apr 17, 2013 at 10:11 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Wednesday, April 17, 2013 4:19 PM Florian Pflug wrote:
>> On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote:
>> > Do you mean to say that as an error has occurred, so it would not be
>> able to
>> > flush received WAL, which could result in loss of WAL?
>> > I think even if error occurs, it will call flush in WalRcvDie(),
>> before
>> > terminating WALReceiver.
>>
>> Hm, true, but for that to prevent the problem the inner processing
>> loop needs to always read up to EOF before it exits and we attempt
>> to send a reply. Which I don't think it necessarily does. Assume,
>> that the master sends a chunk of data, waits a bit, and finally
>> sends the shutdown record and exits. The slave might then receive
>> the first chunk, and it might trigger sending a reply. At the time
>> the reply is sent, the master has already sent the shutdown record
>> and closed the connection, and we'll thus fail to reply and abort.
>> Since the shutdown record has never been read from the socket,
>> XLogWalRcvFlush won't flush it, and the slave ends up behind the
>> master.
>>
>> Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
>> we might also error out of the inner processing loop if the server
>> closes the socket after sending a keepalive but before we attempt
>> to respond.
>>
>> Fixing this on the receive side alone seems quite messy and fragile.
>> So instead, I think we should let the master send a shutdown message
>> after it has sent everything it wants to send, and wait for the client
>> to acknowledge it before shutting down the socket.
>>
>> If the client fails to respond, we could log a fat WARNING.
>
> Your explanation seems to be okay, but I think before discussing the exact
> solution,
> If the actual problem can be reproduced, then it might be better to discuss
> this solution.

I got this problem several times when I enabled WAL archiving and shut down
the master.

Regards,

-- 
Fujii Masao



Re: Inconsistent DB data in Streaming Replication

From
Fujii Masao
Date:
On Wed, Apr 17, 2013 at 7:49 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote:
>> Do you mean to say that as an error has occurred, so it would not be able to
>> flush received WAL, which could result in loss of WAL?
>> I think even if error occurs, it will call flush in WalRcvDie(), before
>> terminating WALReceiver.
>
> Hm, true, but for that to prevent the problem the inner processing
> loop needs to always read up to EOF before it exits and we attempt
> to send a reply. Which I don't think it necessarily does. Assume,
> that the master sends a chunk of data, waits a bit, and finally
> sends the shutdown record and exits. The slave might then receive
> the first chunk, and it might trigger sending a reply. At the time
> the reply is sent, the master has already sent the shutdown record
> and closed the connection, and we'll thus fail to reply and abort.
> Since the shutdown record has never been read from the socket,
> XLogWalRcvFlush won't flush it, and the slave ends up behind the
> master.
>
> Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
> we might also error out of the inner processing loop if the server
> closes the socket after sending a keepalive but before we attempt
> to respond.
>
> Fixing this on the receive side alone seems quite messy and fragile.
> So instead, I think we should let the master send a shutdown message
> after it has sent everything it wants to send, and wait for the client
> to acknowledge it before shutting down the socket.

Agreed. I've tried to fix this problem on only the walreceiver side, but
that failed. I agree that we should change walsender so that it waits
for the replay from the standby before closing the connection.

Regards,

-- 
Fujii Masao