Thread: New sync commit mode remote_write

New sync commit mode remote_write

From

Magnus Hagander

Date:

19 April 2012, 06:05:25

I admit to not having followed the discussion around the new mode for
synchronous_commit very closely, so my apologies if this has been
discussed and dismiseed - I blame failing to find it int he archives
;)

My understanding from looking at the docs is that
synchronous_commit=remote_write will always imply a *local* commit as
well.

Is there any way to set the system up to do a write to the remote,
ensure it's in memory of the remote (remote_write mode, not full sync
to disk), but *not* necessarily to the local disk? Meaning we're ok to
release the transaction when the data is in memory both locally and
remotely but not wait for I/O?

Seems there is a pretty large usecase for this particular in our
lovely new cloud environments with pathetic I/O performance....

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: New sync commit mode remote_write

From

Robert Haas

Date:

19 April 2012, 06:47:57

On Apr 19, 2012, at 5:05 AM, Magnus Hagander <magnus@hagander.net> wrote:
> I admit to not having followed the discussion around the new mode for
> synchronous_commit very closely, so my apologies if this has been
> discussed and dismiseed - I blame failing to find it int he archives
> ;)
>
> My understanding from looking at the docs is that
> synchronous_commit=remote_write will always imply a *local* commit as
> well.
>
> Is there any way to set the system up to do a write to the remote,
> ensure it's in memory of the remote (remote_write mode, not full sync
> to disk), but *not* necessarily to the local disk? Meaning we're ok to
> release the transaction when the data is in memory both locally and
> remotely but not wait for I/O?

If we crash, the slave can end up ahead of the master, and then it's hopelessly corrupted...

Maybe we could engineer around this, but it hasn't been done yet.

...Robert

Re: New sync commit mode remote_write

From

Magnus Hagander

Date:

19 April 2012, 07:44:46

On Thu, Apr 19, 2012 at 12:40, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, Apr 19, 2012 at 10:05 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> I admit to not having followed the discussion around the new mode for
>> synchronous_commit very closely, so my apologies if this has been
>> discussed and dismiseed - I blame failing to find it int he archives
>> ;)
>>
>> My understanding from looking at the docs is that
>> synchronous_commit=remote_write will always imply a *local* commit as
>> well.
>>
>> Is there any way to set the system up to do a write to the remote,
>> ensure it's in memory of the remote (remote_write mode, not full sync
>> to disk), but *not* necessarily to the local disk? Meaning we're ok to
>> release the transaction when the data is in memory both locally and
>> remotely but not wait for I/O?
>>
>> Seems there is a pretty large usecase for this particular in our
>> lovely new cloud environments with pathetic I/O performance....
>
> Yeh, its on my TODO list.
>
> What we need to do is to send the last written point as part of the
> replication protocol, so the standby can receive it, yet know not to
> apply it yet in case of crash.
>
> I was expecting that to change as a result of efforts to improve
> WALInsertLock, so I didn't want to do something that would be
> immediately invalidated.

Understood. Something to look forward in 9.3 then :-)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: New sync commit mode remote_write

From

Simon Riggs

Date:

19 April 2012, 07:46:16

On Thu, Apr 19, 2012 at 10:05 AM, Magnus Hagander <magnus@hagander.net> wrote:
> I admit to not having followed the discussion around the new mode for
> synchronous_commit very closely, so my apologies if this has been
> discussed and dismiseed - I blame failing to find it int he archives
> ;)
>
> My understanding from looking at the docs is that
> synchronous_commit=remote_write will always imply a *local* commit as
> well.
>
> Is there any way to set the system up to do a write to the remote,
> ensure it's in memory of the remote (remote_write mode, not full sync
> to disk), but *not* necessarily to the local disk? Meaning we're ok to
> release the transaction when the data is in memory both locally and
> remotely but not wait for I/O?
>
> Seems there is a pretty large usecase for this particular in our
> lovely new cloud environments with pathetic I/O performance....

Yeh, its on my TODO list.

What we need to do is to send the last written point as part of the
replication protocol, so the standby can receive it, yet know not to
apply it yet in case of crash.

I was expecting that to change as a result of efforts to improve
WALInsertLock, so I didn't want to do something that would be
immediately invalidated.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: New sync commit mode remote_write

From

Jeff Janes

Date:

19 April 2012, 13:39:08

On Thu, Apr 19, 2012 at 2:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Apr 19, 2012, at 5:05 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> I admit to not having followed the discussion around the new mode for
>> synchronous_commit very closely, so my apologies if this has been
>> discussed and dismiseed - I blame failing to find it int he archives
>> ;)
>>
>> My understanding from looking at the docs is that
>> synchronous_commit=remote_write will always imply a *local* commit as
>> well.
>>
>> Is there any way to set the system up to do a write to the remote,
>> ensure it's in memory of the remote (remote_write mode, not full sync
>> to disk), but *not* necessarily to the local disk? Meaning we're ok to
>> release the transaction when the data is in memory both locally and
>> remotely but not wait for I/O?
>
> If we crash, the slave can end up ahead of the master, and then it's hopelessly corrupted...
>
> Maybe we could engineer around this, but it hasn't been done yet.

The work around would be for the master to refuse to automatically
restart after a crash, insisting on a fail-over instead (or a manual
forcing of recovery)?

Cheers,

Jeff

Re: New sync commit mode remote_write

From

Robert Haas

Date:

19 April 2012, 15:51:20

On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
> The work around would be for the master to refuse to automatically
> restart after a crash, insisting on a fail-over instead (or a manual
> forcing of recovery)?

I suppose that would work, but I think Simon's idea is better: don't
let the slave replay the WAL until either (a) it's promoted or (b) the
master finishes the fsync.   That boils down to adding some more
handshaking to the replication protocol, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New sync commit mode remote_write

From

Simon Riggs

Date:

20 April 2012, 12:20:50

On Thu, Apr 19, 2012 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
>> The work around would be for the master to refuse to automatically
>> restart after a crash, insisting on a fail-over instead (or a manual
>> forcing of recovery)?
>
> I suppose that would work, but I think Simon's idea is better: don't
> let the slave replay the WAL until either (a) it's promoted or (b) the
> master finishes the fsync.   That boils down to adding some more
> handshaking to the replication protocol, I think.

It would be 8 bytes on every data message sent to the standby.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: New sync commit mode remote_write

From

Fujii Masao

Date:

20 April 2012, 16:59:15

On Sat, Apr 21, 2012 at 12:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, Apr 19, 2012 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> The work around would be for the master to refuse to automatically
>>> restart after a crash, insisting on a fail-over instead (or a manual
>>> forcing of recovery)?
>>
>> I suppose that would work, but I think Simon's idea is better: don't
>> let the slave replay the WAL until either (a) it's promoted or (b) the
>> master finishes the fsync.   That boils down to adding some more
>> handshaking to the replication protocol, I think.
>
> It would be 8 bytes on every data message sent to the standby.

There seems to be another problem to solve. In current design of streaming
replication, we cannot send any WAL records before writing them locally.
Which would mess up the mode which makes a transaction wait for remote
write but not local one. We should change walsender so that it can send
WAL records before they are written, e.g., send from wal_buffers?

Regards,

--
Fujii Masao

Re: New sync commit mode remote_write

From

Robert Haas

Date:

24 April 2012, 13:00:38

On Fri, Apr 20, 2012 at 3:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 12:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Thu, Apr 19, 2012 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>> The work around would be for the master to refuse to automatically
>>>> restart after a crash, insisting on a fail-over instead (or a manual
>>>> forcing of recovery)?
>>>
>>> I suppose that would work, but I think Simon's idea is better: don't
>>> let the slave replay the WAL until either (a) it's promoted or (b) the
>>> master finishes the fsync.   That boils down to adding some more
>>> handshaking to the replication protocol, I think.
>>
>> It would be 8 bytes on every data message sent to the standby.
>
> There seems to be another problem to solve. In current design of streaming
> replication, we cannot send any WAL records before writing them locally.
> Which would mess up the mode which makes a transaction wait for remote
> write but not local one. We should change walsender so that it can send
> WAL records before they are written, e.g., send from wal_buffers?

In theory, writing WAL should be quick, since it's only going out to
the OS cache, and flushing it should be the slow part, since that
involves waiting for the actual disk write to complete.  Some
instrumentation I shoved in here reveals that there actually are some
cases where the write can take a long time, when Linux starts to get
worried about the amount of dirty data in cache and punishes anyone
who tries to write anything, but I'm not sure whether that's common
enough to warrant a change here.

One thing that does seem to be a problem is using WALWriteLock to
cover both the WAL write and the WAL flush.  Suppose that we're
writing WAL very quickly, so that wal_buffers fills up.  We can't
continue writing WAL until some of what's in the buffer has been
*written*, but the WAL writer process will grab WALWriteLock, write
*and flush* a chunk of WAL, and everybody who wants to insert WAL has
to wait for both the write and the flush.  It's probably possible to
do better, here.  Streaming directly from wal_buffers would allow sync
rep to dodge this problem altogether, but it's a general performance
problem as well so it would be nice to have a general solution that
would improve latency and throughput across the board, if such a
solution is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New sync commit mode remote_write

From

Jeff Janes

Date:

24 April 2012, 13:22:07

On Thu, Apr 19, 2012 at 11:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
>> The work around would be for the master to refuse to automatically
>> restart after a crash, insisting on a fail-over instead (or a manual
>> forcing of recovery)?
>
> I suppose that would work, but I think Simon's idea is better: don't
> let the slave replay the WAL until either (a) it's promoted or (b) the
> master finishes the fsync.   That boils down to adding some more
> handshaking to the replication protocol, I think.

Alternative c) is that the master automatically recovers from a crash,
but doesn't replay that particular wal record because it doesn't find
it on disk, so the slave has to be instructed to throw it away.  (Or
perhaps the slave could feed the wal back to the master, so the master
could replay it?)

Cheers,

Jeff

Re: New sync commit mode remote_write

From

Robert Haas

Date:

24 April 2012, 17:51:50

On Tue, Apr 24, 2012 at 12:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Apr 19, 2012 at 11:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On 4/19/12, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> The work around would be for the master to refuse to automatically
>>> restart after a crash, insisting on a fail-over instead (or a manual
>>> forcing of recovery)?
>>
>> I suppose that would work, but I think Simon's idea is better: don't
>> let the slave replay the WAL until either (a) it's promoted or (b) the
>> master finishes the fsync.   That boils down to adding some more
>> handshaking to the replication protocol, I think.
>
> Alternative c) is that the master automatically recovers from a crash,
> but doesn't replay that particular wal record because it doesn't find
> it on disk, so the slave has to be instructed to throw it away.

Right.  Which kind of stinks.

> (Or
> perhaps the slave could feed the wal back to the master, so the master
> could replay it?)

Yes, that would be a very nice enhancement, I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company