Re: New sync commit mode remote_write - Mailing list pgsql-hackers

From Robert Haas
Subject Re: New sync commit mode remote_write
Date
Msg-id CA+TgmobS0R0c6236nJXJMCrisCqZEHHhq2-S+G22tHPi6TjBvQ@mail.gmail.com
Whole thread Raw
In response to Re: New sync commit mode remote_write  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On Fri, Apr 20, 2012 at 3:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 12:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Thu, Apr 19, 2012 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>> The work around would be for the master to refuse to automatically
>>>> restart after a crash, insisting on a fail-over instead (or a manual
>>>> forcing of recovery)?
>>>
>>> I suppose that would work, but I think Simon's idea is better: don't
>>> let the slave replay the WAL until either (a) it's promoted or (b) the
>>> master finishes the fsync.   That boils down to adding some more
>>> handshaking to the replication protocol, I think.
>>
>> It would be 8 bytes on every data message sent to the standby.
>
> There seems to be another problem to solve. In current design of streaming
> replication, we cannot send any WAL records before writing them locally.
> Which would mess up the mode which makes a transaction wait for remote
> write but not local one. We should change walsender so that it can send
> WAL records before they are written, e.g., send from wal_buffers?

In theory, writing WAL should be quick, since it's only going out to
the OS cache, and flushing it should be the slow part, since that
involves waiting for the actual disk write to complete.  Some
instrumentation I shoved in here reveals that there actually are some
cases where the write can take a long time, when Linux starts to get
worried about the amount of dirty data in cache and punishes anyone
who tries to write anything, but I'm not sure whether that's common
enough to warrant a change here.

One thing that does seem to be a problem is using WALWriteLock to
cover both the WAL write and the WAL flush.  Suppose that we're
writing WAL very quickly, so that wal_buffers fills up.  We can't
continue writing WAL until some of what's in the buffer has been
*written*, but the WAL writer process will grab WALWriteLock, write
*and flush* a chunk of WAL, and everybody who wants to insert WAL has
to wait for both the write and the flush.  It's probably possible to
do better, here.  Streaming directly from wal_buffers would allow sync
rep to dodge this problem altogether, but it's a general performance
problem as well so it would be nice to have a general solution that
would improve latency and throughput across the board, if such a
solution is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)
Next
From: Jeff Janes
Date:
Subject: Re: New sync commit mode remote_write