Re: Inconsistent DB data in Streaming Replication - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Inconsistent DB data in Streaming Replication
Date
Msg-id CAHGQGwHbQLXmt3Ci0bA_cxG=VvOhSE1HSUSVu_QofZ2fFHb-_Q@mail.gmail.com
Whole thread Raw
In response to Re: Inconsistent DB data in Streaming Replication  (Ants Aasma <ants@cybertec.at>)
List pgsql-hackers
On Fri, Apr 12, 2013 at 12:09 AM, Ants Aasma <ants@cybertec.at> wrote:
> On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
>> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>>
>>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>>> wrote:
>>>>
>>>> The proposed fix - halting all writes of data pages to disk and
>>>> to WAL files while waiting ACK from standby - will tremendously
>>>> slow down all parallel work on master.
>>>
>>> This is not what is being proposed. The proposed fix halts writes of
>>> only data pages that are modified within the window of WAL that is not
>>> yet ACKed by the slave. This means pages that were recently modified
>>> and where the clocksweep or checkpoint has decided to evict them. This
>>> only affects the checkpointer, bgwriter and backends doing allocation.
>>> Furthermore, for the backend clocksweep case it would be reasonable to
>>> just pick another buffer to evict. The slowdown for most actual cases
>>> will be negligible.
>>
>> You also need to hold back all WAL writes, including the ones by
>> parallel async and locally-synced transactions. Which means that
>> you have to make all locally synced transactions to wait on the
>> syncrep transactions committed before them.
>> After getting the ACK from slave you then have a backlog of stuff
>> to write locally, which then also needs to be sent to slave. Basically
>> this turns a nice smooth WAL write-and-stream pipeline into a
>> chunky wait-and-write-and-wait-and-stream-and-wait :P
>> This may not be a problem in slight write load cases, which is
>> probably the most widely happening usecase for postgres, but it
>> will harm top performance and also force people to get much
>> better (and more expensive) hardware than would otherways
>> be needed.
>
> Why would you need to hold back WAL writes? WAL is written on master
> first and then steamed to slave as it is done now. You would only need
> hold back dirty page evictions having a recent enough LSN to not yet
> be replicated. This holding back is already done to wait for local WAL
> flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
> it's usage count gets bumped, so it will not be considered for
> eviction for at least one clocksweep cycle. In normal circumstances
> that will be enough time to get an ACK from the slave. When WAL is
> generated at an higher rate than can be replicated this will not be
> true. In that case backends that need to bring in new pages will have
> to wait for WAL to be replicated before they can continue. That will
> hopefully include the backends that are doing the dirtying, throttling
> the WAL generation rate. This would definitely be optional behavior,
> not something turned on by default.
>
>>>
>>>> And it does just turn around "master is ahead of slave" problem
>>>> into "slave is ahead of master" problem :)
>>>
>>> The issue is not being ahead or behind. The issue is ensuring WAL
>>> durability in the face of failovers before modifying data pages. This
>>> is sufficient to guarantee no forks in the WAL stream from the point
>>> of view of data files and with that the capability to always recover
>>> by replaying WAL.
>>
>> How would this handle the case Tom pointed out, namely a short
>> power recycling on master ?
>>
>> Instead of just continuing after booting up again the master now
>> has to figure out if it had any slaves and then try to query them
>> (for how long?) if they had any replayed WAL the master does
>> not know of.
>
> If the master is restarted and there is no failover to the slave, then
> nothing strange would happen, master does recovery, comes up and
> starts streaming to the slave again. If there is a failover, then
> whatever is managing the failover needs to ensure that the master does
> not come up again on its own before it is reconfigured as a slave.
> This is what HA cluster managers do.
>
>> Suddenly the pure existence of streaming replica slaves has become
>> a problem for master !
>>
>> This will especially complicate the case of multiple slaves each
>> having received WAL to a slightly different LSN ? And you do want
>> to have at least 2 slaves if you want both durability
>> and availability with syncrep.
>>
>> What if the one of slaves disconnects ? how should master react to this ?
>
> Again, WAL replication will be the same as it is now. Availability
> considerations, including what to do when slaves go away, are the same
> as for current sync replication. Only required change is that we can
> configure the master to hold out on writing any data pages that
> contain changes that might go missing in the case of a failover.
>
> Whether the additional complexity is worth the feature is a matter of
> opinion. As we have no patch yet I can't say that I know what all the
> implications are, but at first glance the complexity seems rather
> compartmentalized. This would only amend what the concept of a WAL
> flush considers safely flushed.

I really share the same view with you!

Regards,

-- 
Fujii Masao



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Inconsistent DB data in Streaming Replication
Next
From: Bruce Momjian
Date:
Subject: Viewing new 9.3 error fields