Re: Inconsistent DB data in Streaming Replication - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: Inconsistent DB data in Streaming Replication
Date
Msg-id CABOikdPvCfbdkd+jexwgqUMyKO=aquXkTy=b2pJuEiyjwY-gxw@mail.gmail.com
Whole thread Raw
In response to Re: Inconsistent DB data in Streaming Replication  (Ants Aasma <ants@cybertec.at>)
Responses Re: Inconsistent DB data in Streaming Replication
List pgsql-hackers



On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants@cybertec.at> wrote:
On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>
>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>> wrote:
>>>
>>> The proposed fix - halting all writes of data pages to disk and
>>> to WAL files while waiting ACK from standby - will tremendously
>>> slow down all parallel work on master.
>>
>> This is not what is being proposed. The proposed fix halts writes of
>> only data pages that are modified within the window of WAL that is not
>> yet ACKed by the slave. This means pages that were recently modified
>> and where the clocksweep or checkpoint has decided to evict them. This
>> only affects the checkpointer, bgwriter and backends doing allocation.
>> Furthermore, for the backend clocksweep case it would be reasonable to
>> just pick another buffer to evict. The slowdown for most actual cases
>> will be negligible.
>
> You also need to hold back all WAL writes, including the ones by
> parallel async and locally-synced transactions. Which means that
> you have to make all locally synced transactions to wait on the
> syncrep transactions committed before them.
> After getting the ACK from slave you then have a backlog of stuff
> to write locally, which then also needs to be sent to slave. Basically
> this turns a nice smooth WAL write-and-stream pipeline into a
> chunky wait-and-write-and-wait-and-stream-and-wait :P
> This may not be a problem in slight write load cases, which is
> probably the most widely happening usecase for postgres, but it
> will harm top performance and also force people to get much
> better (and more expensive) hardware than would otherways
> be needed.

Why would you need to hold back WAL writes? WAL is written on master
first and then steamed to slave as it is done now. You would only need
hold back dirty page evictions having a recent enough LSN to not yet
be replicated. This holding back is already done to wait for local WAL
flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
it's usage count gets bumped, so it will not be considered for
eviction for at least one clocksweep cycle. In normal circumstances
that will be enough time to get an ACK from the slave. When WAL is
generated at an higher rate than can be replicated this will not be
true. In that case backends that need to bring in new pages will have
to wait for WAL to be replicated before they can continue. That will
hopefully include the backends that are doing the dirtying, throttling
the WAL generation rate. This would definitely be optional behavior,
not something turned on by default.


I agree. I don't think the proposes change would cause a lot of performance bottleneck since the proposal is to hold back writing of dirty pages until the WAL is replicated successfully to the standby. The heap pages are mostly written by the background threads often much later than the WAL for the change is written. So in all likelihood, there will be no wait involved. Of course, this will not be true for very frequently updated pages that must be written at a checkpoint.

But I wonder if the problem is really limited to the heap pages ? Even for something like a CLOG page, we will need to ensure that the WAL records are replayed before the page is written to the disk. Same is true for relation truncation. In fact, all places where the master needs to call XLogFlush() probably needs to be examined to decide if the subsequent action has a chance to leave the database corrupt and ensure that the WAL is replicated before proceeding with the change.

Tom has a very valid concern from the additional code complexity point of view though I disagree that its always good idea to start with a fresh rsync. If we can avoid that with right checks, I don't see why we should not improve the downtime for the master. Its very likely that the standby may not be as good a server as the master is and the user would want to quickly switch back to the master for performance reasons. To reduce complexity, can we do this as some sort of plugin for XLogFlush() which gets to know that XLogFlush has been upto the given LSN and the event that caused the function to be called ? We can then leave the handling  of the even to the implementer. This will also avoid any penalty for those who are happy with the current mechanism and do not want any complex HA setups.

Thanks,
Pavan

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: (auto)vacuum truncate exclusive lock
Next
From: Jeff Janes
Date:
Subject: Re: (auto)vacuum truncate exclusive lock