Re: Inconsistent DB data in Streaming Replication - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Inconsistent DB data in Streaming Replication |
Date | |
Msg-id | CAHGQGwHbQLXmt3Ci0bA_cxG=VvOhSE1HSUSVu_QofZ2fFHb-_Q@mail.gmail.com Whole thread Raw |
In response to | Re: Inconsistent DB data in Streaming Replication (Ants Aasma <ants@cybertec.at>) |
List | pgsql-hackers |
On Fri, Apr 12, 2013 at 12:09 AM, Ants Aasma <ants@cybertec.at> wrote: > On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> On 04/11/2013 03:52 PM, Ants Aasma wrote: >>> >>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> >>> wrote: >>>> >>>> The proposed fix - halting all writes of data pages to disk and >>>> to WAL files while waiting ACK from standby - will tremendously >>>> slow down all parallel work on master. >>> >>> This is not what is being proposed. The proposed fix halts writes of >>> only data pages that are modified within the window of WAL that is not >>> yet ACKed by the slave. This means pages that were recently modified >>> and where the clocksweep or checkpoint has decided to evict them. This >>> only affects the checkpointer, bgwriter and backends doing allocation. >>> Furthermore, for the backend clocksweep case it would be reasonable to >>> just pick another buffer to evict. The slowdown for most actual cases >>> will be negligible. >> >> You also need to hold back all WAL writes, including the ones by >> parallel async and locally-synced transactions. Which means that >> you have to make all locally synced transactions to wait on the >> syncrep transactions committed before them. >> After getting the ACK from slave you then have a backlog of stuff >> to write locally, which then also needs to be sent to slave. Basically >> this turns a nice smooth WAL write-and-stream pipeline into a >> chunky wait-and-write-and-wait-and-stream-and-wait :P >> This may not be a problem in slight write load cases, which is >> probably the most widely happening usecase for postgres, but it >> will harm top performance and also force people to get much >> better (and more expensive) hardware than would otherways >> be needed. > > Why would you need to hold back WAL writes? WAL is written on master > first and then steamed to slave as it is done now. You would only need > hold back dirty page evictions having a recent enough LSN to not yet > be replicated. This holding back is already done to wait for local WAL > flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied > it's usage count gets bumped, so it will not be considered for > eviction for at least one clocksweep cycle. In normal circumstances > that will be enough time to get an ACK from the slave. When WAL is > generated at an higher rate than can be replicated this will not be > true. In that case backends that need to bring in new pages will have > to wait for WAL to be replicated before they can continue. That will > hopefully include the backends that are doing the dirtying, throttling > the WAL generation rate. This would definitely be optional behavior, > not something turned on by default. > >>> >>>> And it does just turn around "master is ahead of slave" problem >>>> into "slave is ahead of master" problem :) >>> >>> The issue is not being ahead or behind. The issue is ensuring WAL >>> durability in the face of failovers before modifying data pages. This >>> is sufficient to guarantee no forks in the WAL stream from the point >>> of view of data files and with that the capability to always recover >>> by replaying WAL. >> >> How would this handle the case Tom pointed out, namely a short >> power recycling on master ? >> >> Instead of just continuing after booting up again the master now >> has to figure out if it had any slaves and then try to query them >> (for how long?) if they had any replayed WAL the master does >> not know of. > > If the master is restarted and there is no failover to the slave, then > nothing strange would happen, master does recovery, comes up and > starts streaming to the slave again. If there is a failover, then > whatever is managing the failover needs to ensure that the master does > not come up again on its own before it is reconfigured as a slave. > This is what HA cluster managers do. > >> Suddenly the pure existence of streaming replica slaves has become >> a problem for master ! >> >> This will especially complicate the case of multiple slaves each >> having received WAL to a slightly different LSN ? And you do want >> to have at least 2 slaves if you want both durability >> and availability with syncrep. >> >> What if the one of slaves disconnects ? how should master react to this ? > > Again, WAL replication will be the same as it is now. Availability > considerations, including what to do when slaves go away, are the same > as for current sync replication. Only required change is that we can > configure the master to hold out on writing any data pages that > contain changes that might go missing in the case of a failover. > > Whether the additional complexity is worth the feature is a matter of > opinion. As we have no patch yet I can't say that I know what all the > implications are, but at first glance the complexity seems rather > compartmentalized. This would only amend what the concept of a WAL > flush considers safely flushed. I really share the same view with you! Regards, -- Fujii Masao
pgsql-hackers by date: