Thread: Proposal for 9.1: WAL streaming from WAL buffers

Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

11 June 2010, 10:15:00

Hi,

In 9.0, walsender reads WAL always from the disk and sends it to the standby.
That is, we cannot send WAL until it has been written (and flushed) to the disk.
This degrades the performance of synchronous replication very much since a
transaction commit must wait for the WAL write time *plus* the replication time.

The attached patch enables walsender to read data from WAL buffers in addition
to the disk. Since we can write and send WAL simultaneously, in synchronous
replication, a transaction commit has only to wait for either of them. So the
performance would significantly increase.

Now three hackers (Zoltan, Simon and me) are planning to develop synchronous
replication feature. I'm not sure whose patch will be committed at last. But
since the attached patch provides just a infrastructure to optimize SR, it
would work fine with any of them together and have a good effect.

I'll add the patch into the next CF. AFAIK the ReviewFest will start Jun 15.
During that, if you are interested in the patch, please feel free to review it.
Also you can get the code change from my git repository:

    git://git.postgresql.org/git/users/fujii/postgres.git
    branch: read-wal-buffers

>From here I talk about the detail of the change. At first, walsender reads WAL
from the disk. If it has reached the current write location (i.e., there is no
unsent WAL in the disk), then it attempts to read from WAL buffers. This buffer
reading continues until the WAL to send has been purged from WAL buffers. IOW,
If WAL buffers is large enough and walsender has been catching up with insertion
of WAL, it can read WAL from the buffers forever.

Then if WAL to send has purged from the buffers, walsender backs off and tries
to read it from the disk. If we can find no WAL to send in the disk, walsender
attempts to read WAL from the buffers again. Walsender repeats these operations.

The location of the oldest record in the buffers is saved in the shared memory.
This location is used to calculate whether the particular WAL is in the buffers
or not.

To avoid lock contention, walsender reads WAL buffers and XLogCtl->xlblocks
without holding neither WALInsertLock nor WALWriteLock. Of course, they might be
changed because of buffer replacement while being read. So after reading them,
we check that what we read was valid by comparing the location of the read WAL
with the location of the oldest record in the buffers. This logic is similar to
what XLogRead() does at the end.

This feature is required for preventing the performance of synchronous
replication from dropping significantly. It can cut the time that a transaction
committed on the master takes to become visible on the standby. So, it's also
useful for asynchronous replication.

Thought? Comment? Objection?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

read_wal_buffers_v1.patch

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

11 June 2010, 10:23:00

On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Thought? Comment? Objection?

What happens if the WAL is streamed to the standby and then the master
crashes without writing that WAL to disk?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

11 June 2010, 10:57:59

On Fri, Jun 11, 2010 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Thought? Comment? Objection?
>
> What happens if the WAL is streamed to the standby and then the master
> crashes without writing that WAL to disk?

What are you concerned about?

I think that the situation would be the same as 9.0 from users' perspective.
After failover, the transaction which a client regards as aborted (because
of the crash) might be visible or invisible on new master (i.e., original
standby). For now, we cannot control that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

11 June 2010, 11:24:13

On Fri, Jun 11, 2010 at 9:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jun 11, 2010 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> Thought? Comment? Objection?
>>
>> What happens if the WAL is streamed to the standby and then the master
>> crashes without writing that WAL to disk?
>
> What are you concerned about?
>
> I think that the situation would be the same as 9.0 from users' perspective.
> After failover, the transaction which a client regards as aborted (because
> of the crash) might be visible or invisible on new master (i.e., original
> standby). For now, we cannot control that.

I think the failover case might be OK.  But if the master crashes and
restarts, the slave might be left thinking its xlog position is ahead
of the xlog position on the master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Tom Lane

Date:

11 June 2010, 11:31:46

Fujii Masao <masao.fujii@gmail.com> writes:
> In 9.0, walsender reads WAL always from the disk and sends it to the standby.
> That is, we cannot send WAL until it has been written (and flushed) to the disk.

I believe the above statement to be incorrect: walsender does *not* wait
for an fsync to occur.

I agree with the idea of trying to read from WAL buffers instead of the
file system, but the main reason why is that the current behavior makes
FADVISE_DONTNEED for WAL pretty dubious.  It'd be a good idea to still
(artificially) limit replication to not read ahead of the written-out
data.

> ... Since we can write and send WAL simultaneously, in synchronous
> replication, a transaction commit has only to wait for either of them. So the
> performance would significantly increase.

That performance claim, frankly, is ludicrous.  There is no way that
round trip network delay plus write+fsync on the slave is faster than
local write+fsync.  Furthermore, I would say that you are thinking
exactly backwards about the requirements for synchronous replication:
what that would mean is that transaction commit waits for *both*,
not whichever one finishes first.
        regards, tom lane

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Stefan Kaltenbrunner

Date:

11 June 2010, 11:38:47

On 06/11/2010 04:31 PM, Tom Lane wrote:
> Fujii Masao<masao.fujii@gmail.com>  writes:
>> In 9.0, walsender reads WAL always from the disk and sends it to the standby.
>> That is, we cannot send WAL until it has been written (and flushed) to the disk.
>
> I believe the above statement to be incorrect: walsender does *not* wait
> for an fsync to occur.
>
> I agree with the idea of trying to read from WAL buffers instead of the
> file system, but the main reason why is that the current behavior makes
> FADVISE_DONTNEED for WAL pretty dubious.  It'd be a good idea to still
> (artificially) limit replication to not read ahead of the written-out
> data.
>
>> ... Since we can write and send WAL simultaneously, in synchronous
>> replication, a transaction commit has only to wait for either of them. So the
>> performance would significantly increase.
>
> That performance claim, frankly, is ludicrous.  There is no way that
> round trip network delay plus write+fsync on the slave is faster than
> local write+fsync.  Furthermore, I would say that you are thinking
> exactly backwards about the requirements for synchronous replication:
> what that would mean is that transaction commit waits for *both*,
> not whichever one finishes first.

hmm not sure that is what fujii tried to say - I think his point was 
that in the original case we would have serialized all the operations 
(first write+sync on the master, network afterwards and write+sync on 
the slave) and now we could try parallelizing by sending the wal before 
we have synced locally.



Stefan

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Tom Lane

Date:

11 June 2010, 11:47:55

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> hmm not sure that is what fujii tried to say - I think his point was 
> that in the original case we would have serialized all the operations 
> (first write+sync on the master, network afterwards and write+sync on 
> the slave) and now we could try parallelizing by sending the wal before 
> we have synced locally.

Well, we're already not waiting for fsync, which is the slowest part.
If there's a performance problem, it may be because FADVISE_DONTNEED
disables kernel buffering so that we're forced to actually read the data
back from disk before sending it on down the wire.
        regards, tom lane

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Stefan Kaltenbrunner

Date:

11 June 2010, 12:15:43

On 06/11/2010 04:47 PM, Tom Lane wrote:
> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>  writes:
>> hmm not sure that is what fujii tried to say - I think his point was
>> that in the original case we would have serialized all the operations
>> (first write+sync on the master, network afterwards and write+sync on
>> the slave) and now we could try parallelizing by sending the wal before
>> we have synced locally.
>
> Well, we're already not waiting for fsync, which is the slowest part.
> If there's a performance problem, it may be because FADVISE_DONTNEED
> disables kernel buffering so that we're forced to actually read the data
> back from disk before sending it on down the wire.

hmm ok - but assuming sync rep we would end up with something like the 
following(hypotetically assuming each operation takes 1 time unit):

originally:

write 1
sync 1
network 1
write 1
sync 1

total: 5

whereas in the new case we would basically have the write+sync compete 
with network+write+sync in parallel(total 3 units) and we would only 
have to wait for the slower of those two sets of operations instead of 
the total time of both or am I missing something.


Stefan

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Josh Berkus

Date:

11 June 2010, 19:16:53

> Well, we're already not waiting for fsync, which is the slowest part.
> If there's a performance problem, it may be because FADVISE_DONTNEED
> disables kernel buffering so that we're forced to actually read the data
> back from disk before sending it on down the wire.

Well, that's fairly direct to solve, no?  Just disable FADVISE_DONTNEED
if walsenders > 0.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Florian Pflug

Date:

11 June 2010, 21:35:10

On Jun 11, 2010, at 16:31 , Tom Lane wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> In 9.0, walsender reads WAL always from the disk and sends it to the standby.
>> That is, we cannot send WAL until it has been written (and flushed) to the disk.
>
> I believe the above statement to be incorrect: walsender does *not* wait
> for an fsync to occur.

Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the
master.Or am I missing something? 

best regards,
Florian Pflug

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Josh Berkus

Date:

11 June 2010, 22:10:32

> Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the
master.Or am I missing something?
 

Well, in the failover case this isn't a problem, it's a benefit: the
standby gets a transaction which you would have lost off the master.
However, I can see this as a problem in the event of a server-room
powerout with very bad timing where there isn't a failover to the standby:

1) Master goes out
2) "floating" transaction applied to standby.
3) Standby goes out
4) Power back on
5) master comes up
6) standby comes up

It seems like, in that sequence, the standby would have one transaction
which the master doesn't have, yet the standby thinks it can continue
getting WAL from the master.  Or did I miss something which makes this
impossible?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Florian Pflug

Date:

12 June 2010, 08:50:20

On Jun 12, 2010, at 3:10 , Josh Berkus wrote:
>> Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the
master.Or am I missing something? 
>
> 1) Master goes out
> 2) "floating" transaction applied to standby.
> 3) Standby goes out
> 4) Power back on
> 5) master comes up
> 6) standby comes up
>
> It seems like, in that sequence, the standby would have one transaction
> which the master doesn't have, yet the standby thinks it can continue
> getting WAL from the master.  Or did I miss something which makes this
> impossible?

I did indeed miss something - with wal_sync_method set to either open_datasync or open_sync, all written WAL is also
synced.Since open_datasync is the preferred setting according to
http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#GUC-WAL-SYNC-METHOD,systems supporting open_datasync
shouldbe safe. 

My Ubuntu 10.04 box running postgres 8.4.4 doesn't support open_datasync though, and hence defaults to fdatasync.
Probablybecause of this fragment in xlogdefs.h 
#if O_DSYNC != BARE_OPEN_SYNC_FLAG
#define OPEN_DATASYNC_FLAG              (O_DSYNC | PG_O_DIRECT)
#endif

glibc defines O_DSYNC as an alias for O_SYNC and warrants that with
"Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a
writeto be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and
metadatanecessary to retrieve it to be on disk by the time the system call returns." 

If that is true, I believe we should default to open_sync, not fdatasync if open_datasync isn't available, at least on
linux.

best regards,
Florian Pflug

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Heikki Linnakangas

Date:

13 June 2010, 02:24:39

On 12/06/10 01:16, Josh Berkus wrote:
>
>> Well, we're already not waiting for fsync, which is the slowest part.
>> If there's a performance problem, it may be because FADVISE_DONTNEED
>> disables kernel buffering so that we're forced to actually read the data
>> back from disk before sending it on down the wire.
>
> Well, that's fairly direct to solve, no?  Just disable FADVISE_DONTNEED
> if walsenders>  0.

We already do that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Greg Smith

Date:

13 June 2010, 04:36:56

Florian Pflug wrote:
> glibc defines O_DSYNC as an alias for O_SYNC and warrants that with
> "Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a
writeto be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and
metadatanecessary to retrieve it to be on disk by the time the system call returns."

>
> If that is true, I believe we should default to open_sync, not fdatasync if open_datasync isn't available, at least
onlinux.

>   

It's not true, because Linux O_SYNC semantics are basically that it's 
never worked reliably on ext3.  See 
http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php for 
example of how terrible the situation would be if O_SYNC were the 
default on Linux.

We just got a report that a better O_DSYNC is now properly exposed 
starting on kernel 2.6.33+glibc 2.12:  
http://archives.postgresql.org/message-id/201006041539.03868.cousinmarc@gmail.com 
and it's possible they may have finally fixed it so it work like it's 
supposed to.  PostgreSQL versions compiled against the right 
prerequisites will default to O_DSYNC by themselves.  Whether or not 
this is a good thing has yet to be determined.  The last thing we'd want 
to do at this point is make the old and usually broken O_SYNC behavior 
suddenly preferred, when the new and possibly fixed O_DSYNC one will be 
automatically selected when available without any code changes on the 
database side.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

14 June 2010, 05:14:50

On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the failover case might be OK.  But if the master crashes and
> restarts, the slave might be left thinking its xlog position is ahead
> of the xlog position on the master.

Right. Unless we perform a failover in this case, the standby might go down
because of inconsistency of WAL after restarting the master. To avoid this
problem, walsender must wait for WAL to be not only written but also *fsynced*
on the master before sending it as 9.0 does. Though this would degrade the
performance, this might be useful for some cases. We should provide the knob
to specify whether to allow the standby to go ahead of the master or not?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

14 June 2010, 05:39:38

On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> hmm not sure that is what fujii tried to say - I think his point was
>> that in the original case we would have serialized all the operations
>> (first write+sync on the master, network afterwards and write+sync on
>> the slave) and now we could try parallelizing by sending the wal before
>> we have synced locally.
>
> Well, we're already not waiting for fsync, which is the slowest part.

No, currently walsender waits for fsync.

Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
As the result, walsender cannot send WAL not fsynced yet. We should
update xlogctl->LogwrtResult.Write before XLogWrite() performs fsync
for 9.0?

But that change would cause the problem that Robert pointed out.
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

> If there's a performance problem, it may be because FADVISE_DONTNEED
> disables kernel buffering so that we're forced to actually read the data
> back from disk before sending it on down the wire.

Currently, if max_wal_senders > 0, POSIX_FADV_DONTNEED is not used for
WAL files at all.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

14 June 2010, 05:42:20

On Sat, Jun 12, 2010 at 12:15 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> hmm ok - but assuming sync rep we would end up with something like the
> following(hypotetically assuming each operation takes 1 time unit):
>
> originally:
>
> write 1
> sync 1
> network 1
> write 1
> sync 1
>
> total: 5
>
> whereas in the new case we would basically have the write+sync compete with
> network+write+sync in parallel(total 3 units) and we would only have to wait
> for the slower of those two sets of operations instead of the total time of
> both or am I missing something.

Yeah, this is what I'd like to say. Thanks!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

14 June 2010, 08:11:01

On Mon, Jun 14, 2010 at 4:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the failover case might be OK.  But if the master crashes and
>> restarts, the slave might be left thinking its xlog position is ahead
>> of the xlog position on the master.
>
> Right. Unless we perform a failover in this case, the standby might go down
> because of inconsistency of WAL after restarting the master. To avoid this
> problem, walsender must wait for WAL to be not only written but also *fsynced*
> on the master before sending it as 9.0 does. Though this would degrade the
> performance, this might be useful for some cases. We should provide the knob
> to specify whether to allow the standby to go ahead of the master or not?

Maybe.  That sounds like a pretty enormous foot-gun to me, considering
that we have no way of recovering from the situation where the standby
gets ahead of the master.  Right now, I believe we're still in the
situation where the standby goes into an infinite CPU-chewing,
log-spewing loop, but even after we fix that it's not going to be good
enough to really handle that case sensibly, which we probably need to
do if we want to make this change.

Come to think of it, can this happen already?  Can the master stream
WAL to the standby after it's written but before it's fsync'd?

We should get the open item fixed for 9.0 here before we start
worrying about 9.1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Simon Riggs

Date:

14 June 2010, 08:54:49

On Mon, 2010-06-14 at 17:39 +0900, Fujii Masao wrote:
> On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> >> hmm not sure that is what fujii tried to say - I think his point was
> >> that in the original case we would have serialized all the operations
> >> (first write+sync on the master, network afterwards and write+sync on
> >> the slave) and now we could try parallelizing by sending the wal before
> >> we have synced locally.
> >
> > Well, we're already not waiting for fsync, which is the slowest part.
> 
> No, currently walsender waits for fsync.
> 
> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
> As the result, walsender cannot send WAL not fsynced yet. We should
> update xlogctl->LogwrtResult.Write before XLogWrite() performs fsync
> for 9.0?
> 
> But that change would cause the problem that Robert pointed out.
> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

ISTM you just defined some clear objectives for next work.

Copying the data from WAL buffers is mostly irrelevant. The majority of
time is lost waiting for fsync. The biggest issue is about how to allow
WAL write and WALSender to act concurrently and have backend wait for
both.

Sure, copying data from wal_buffers will be faster still, but it will
cause you to address some subtle data structure locking operations that
we could solve at a later time. And it still gives the problem of how
the master resets itself if the standby really is ahead.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Simon Riggs

Date:

14 June 2010, 08:56:20

On Mon, 2010-06-14 at 17:39 +0900, Fujii Masao wrote:
> No, currently walsender waits for fsync.
> ...

> But that change would cause the problem that Robert pointed out.
> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

Presumably this means that if synchronous_commit = off on primary that
SR in 9.0 will no longer work correctly if the primary crashes?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

14 June 2010, 09:48:58

On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Maybe.  That sounds like a pretty enormous foot-gun to me, considering
> that we have no way of recovering from the situation where the standby
> gets ahead of the master.

No, we can do that by reconstructing the standby from the backup.

And, that situation is not a problem for users including me who prefer to
perform a failover when the master goes down. Of course, we can just restart
the master in that case, but it's likely to take longer than a failover
because there would be a cause of the crash. For example, if the master goes
down because of a media crash, the master would never start up unless PITR
is performed. So I'm not sure how many users prefer a restart to a failover.

> We should get the open item fixed for 9.0 here before we start
> worrying about 9.1.

Yep, so I was submitting some patches in these days :)

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

14 June 2010, 10:13:38

On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Maybe.  That sounds like a pretty enormous foot-gun to me, considering
>> that we have no way of recovering from the situation where the standby
>> gets ahead of the master.
>
> No, we can do that by reconstructing the standby from the backup.
>
> And, that situation is not a problem for users including me who prefer to
> perform a failover when the master goes down.

You don't get to pick - if a backend crashes on the master, it will
restart right away and come up, but the slave will now be hosed...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Tom Lane

Date:

14 June 2010, 12:03:16

Fujii Masao <masao.fujii@gmail.com> writes:
> On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, we're already not waiting for fsync, which is the slowest part.

> No, currently walsender waits for fsync.

No, you're mistaken.

> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.

Wrong.  LogwrtResult.Write tracks how far we've written out data,
but it is only (known to be) fsync'd as far as LogwrtResult.Flush.

> But that change would cause the problem that Robert pointed out.
> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

Yes.  Possibly walsender should only be allowed to send as far as
LogwrtResult.Flush.
        regards, tom lane

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

15 June 2010, 01:46:30

On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Maybe.  That sounds like a pretty enormous foot-gun to me, considering
>>> that we have no way of recovering from the situation where the standby
>>> gets ahead of the master.
>>
>> No, we can do that by reconstructing the standby from the backup.
>>
>> And, that situation is not a problem for users including me who prefer to
>> perform a failover when the master goes down.
>
> You don't get to pick - if a backend crashes on the master, it will
> restart right away and come up, but the slave will now be hosed...

You are concerned about the case where postmaster automatically restarts
the crash recovery, in particular? Yes, this case is more problematic.
If the standby is ahead of the master, the standby might find an invalid
record and run into the infinite retry loop, or keep working without
noticing the inconsistency between the database and the WAL.

I'm thinking that walreceiver should throw a PANIC when it receives the
record which is in the LSN older than the last WAL receive location,
except the beginning of streaming (because the standby always requests
for streaming from the starting of WAL file at first even if some records
have already been received in previous time). Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

15 June 2010, 01:47:08

On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Well, we're already not waiting for fsync, which is the slowest part.
>
>> No, currently walsender waits for fsync.
>
> No, you're mistaken.
>
>> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
>> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
>
> Wrong.  LogwrtResult.Write tracks how far we've written out data,
> but it is only (known to be) fsync'd as far as LogwrtResult.Flush.

Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position
we've written. But in the current XLogWrite() code, it's updated after
XLogWrite() calls issue_xlog_fsync(). No?

Of course, the backend-local LogwrtResult.Write is updated before
issue_xlog_fsync(), but it's not available by walsender.

Am I missing something?

>> But that change would cause the problem that Robert pointed out.
>> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php
>
> Yes.  Possibly walsender should only be allowed to send as far as
> LogwrtResult.Flush.

Yes, in order to avoid that problem, walsender should wait for WAL
to be fsync'd before sending it.

But I'm worried that this would slow down the performance on the master
significantly because WAL flush and WAL streaming are not performed
concurrently and the backend must wait for both in a serial manner.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Heikki Linnakangas

Date:

15 June 2010, 02:16:52

On 15/06/10 07:47, Fujii Masao wrote:
> On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Fujii Masao<masao.fujii@gmail.com>  writes:
>>> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
>>> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
>>
>> Wrong.  LogwrtResult.Write tracks how far we've written out data,
>> but it is only (known to be) fsync'd as far as LogwrtResult.Flush.
>
> Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position
> we've written. But in the current XLogWrite() code, it's updated after
> XLogWrite() calls issue_xlog_fsync(). No?

issue_xlog_fsync() is only called if the caller requested a flush by 
advancing WriteRqst.Flush.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

15 June 2010, 05:45:25

On Tue, Jun 15, 2010 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 15/06/10 07:47, Fujii Masao wrote:
>>
>> On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>>
>>> Fujii Masao<masao.fujii@gmail.com>  writes:
>>>>
>>>> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
>>>> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
>>>
>>> Wrong.  LogwrtResult.Write tracks how far we've written out data,
>>> but it is only (known to be) fsync'd as far as LogwrtResult.Flush.
>>
>> Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position
>> we've written. But in the current XLogWrite() code, it's updated after
>> XLogWrite() calls issue_xlog_fsync(). No?
>
> issue_xlog_fsync() is only called if the caller requested a flush by
> advancing WriteRqst.Flush.

True. The scenario that I'm concerned about is:

1. A transaction commit causes XLogFlush() to write *and* fsync WAL up to  the commit record.
2. XLogFlush() calls XLogWrite(), and xlogctl->LogwrtResult.Write is  updated to indicate the LSN bigger than or equal
tothat of the commit  record after XLogWrite() calls issue_xlog_fsync(). 
3. Then walsender can send WAL up to the commit record.

A transaction commit would need to wait for local fsync and replication
in a serial manner, in synchronous replication. IOW, walsender cannot
send the commit record until it's fsync'd in XLogWrite().

This scenario will not happen? Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

15 June 2010, 07:54:02

On Tue, Jun 15, 2010 at 12:46 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> Maybe.  That sounds like a pretty enormous foot-gun to me, considering
>>>> that we have no way of recovering from the situation where the standby
>>>> gets ahead of the master.
>>>
>>> No, we can do that by reconstructing the standby from the backup.
>>>
>>> And, that situation is not a problem for users including me who prefer to
>>> perform a failover when the master goes down.
>>
>> You don't get to pick - if a backend crashes on the master, it will
>> restart right away and come up, but the slave will now be hosed...
>
> You are concerned about the case where postmaster automatically restarts
> the crash recovery, in particular? Yes, this case is more problematic.
> If the standby is ahead of the master, the standby might find an invalid
> record and run into the infinite retry loop, or keep working without
> noticing the inconsistency between the database and the WAL.
>
> I'm thinking that walreceiver should throw a PANIC when it receives the
> record which is in the LSN older than the last WAL receive location,
> except the beginning of streaming (because the standby always requests
> for streaming from the starting of WAL file at first even if some records
> have already been received in previous time). Thought?

Yeah, that seems like it would be a good safety check.

I wonder if it would be possible to jigger things so that we send the
WAL to the standby as soon as it is generated, but somehow arrange
things so that the standby knows the last location that the master has
fsync'd and never applies beyond that point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Florian Pflug

Date:

15 June 2010, 09:05:27

On Jun 15, 2010, at 10:45 , Fujii Masao wrote:
> A transaction commit would need to wait for local fsync and replication
> in a serial manner, in synchronous replication. IOW, walsender cannot
> send the commit record until it's fsync'd in XLogWrite().

Hm, but since 9.0 won't do synchronous replication anyway, the right thing to do for 9.0 is still to send only fsync'ed
WAL,no? Without synchronous replication the overhead seems negligible. 

For synchronous replication (and hence for 9.1) I think there are two basic options

a) Stream only fsync'ed WAL, like in the asynchronous case. Depending on policy, additionally wait for one or more
slavesto fsync before reporting success. 

b) Stream non-fsync'ed WAL. on COMMIT, wait for at last one node (not necessarily the master, exact count depends on
policy)to fsync before reporting success. During recovery of the master, recover up to the latest LSN found on any one
ofthe nodes. 

Option (b) requires some additional thought, though. Controlled removal of slave nodes and concurrent crashes of more
thanone node are the most difficult areas to handle gracefully, it seems. 

best regards,
Florian Pflug

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Josh Berkus

Date:

15 June 2010, 16:57:33

> I wonder if it would be possible to jigger things so that we send the
> WAL to the standby as soon as it is generated, but somehow arrange
> things so that the standby knows the last location that the master has
> fsync'd and never applies beyond that point.

I can't think of any way which would not require major engineering.  And
you'd be slowing down replication *in general* to deal with a fairly
unlikely corner case.

I think the panic is the way to go.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

15 June 2010, 17:18:33

On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> I wonder if it would be possible to jigger things so that we send the
>> WAL to the standby as soon as it is generated, but somehow arrange
>> things so that the standby knows the last location that the master has
>> fsync'd and never applies beyond that point.
>
> I can't think of any way which would not require major engineering.  And
> you'd be slowing down replication *in general* to deal with a fairly
> unlikely corner case.
>
> I think the panic is the way to go.

I have yet to convince myself of how likely this is to occur.  I tried
to reproduce this issue by crashing the database, but I think in 9.0
you need an actual operating system crash to cause this problem, and I
haven't yet set up an environment in which I can repeatedly crash the
OS.  I believe, though, that in 9.1, we're going to want to stream
from WAL buffers as proposed in the patch that started out this
thread, and then I think this issue can be triggered with just a
database crash.

In 9.0, I think we can fix this problem by (1) only streaming WAL that
has been fsync'd and (2) PANIC-ing if the problem occurs anyway.  But
in 9.1, with sync rep and the performance demands that entails, I
think that we're going to need to rethink it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Josh Berkus

Date:

15 June 2010, 21:10:15

> I have yet to convince myself of how likely this is to occur.  I tried
> to reproduce this issue by crashing the database, but I think in 9.0
> you need an actual operating system crash to cause this problem, and I
> haven't yet set up an environment in which I can repeatedly crash the
> OS.  I believe, though, that in 9.1, we're going to want to stream
> from WAL buffers as proposed in the patch that started out this
> thread, and then I think this issue can be triggered with just a
> database crash.

Yes, but it still requires:

a) the master must crash with at least one transaction transmitted to
the slave an not yet fsync'd
b) the slave must not crash as well
c) the master must come back up without the slave ever having been
promoted to master

Note that (a) is fairly improbable to begin with due to both our
batching transactions into bundles for transmission, and network latency
vs. disk latency.

So, is it possible?  Yes.  Will it happen anywhere but the
highest-txn-rate sites one in 10,000 times?  No.

This means that we should look for a solution which does not penalize
the common case in order to close a very improbable hole, if such a
solution exists.

> In 9.0, I think we can fix this problem by (1) only streaming WAL that
> has been fsync'd and 

I don't think this is the best solution; it would be a noticeable
performance penalty on replication.  It also would potentially result in
data loss for the user; if the user fails over to the slave in the
corner case, they can "rescue" the in-flight transaction.  At the least,
this would need to become Yet Another Configuration Option.

>(2) PANIC-ing if the problem occurs anyway.  

The question is, is detecting out-of-order WAL records *sufficient* to
detect a failure?  I'm thinking there are possible sequences where there
would be no out-of-sequence, but the slave would still have a
transaction the master doesn't, which the user wouldn't know until a
page update corrupts their data.

> But
> in 9.1, with sync rep and the performance demands that entails, I
> think that we're going to need to rethink it.

All the more reason to avoid dealing with it now, if we can.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Josh Berkus

Date:

15 June 2010, 21:32:21

On 6/15/10 5:09 PM, Josh Berkus wrote:
>> > In 9.0, I think we can fix this problem by (1) only streaming WAL that
>> > has been fsync'd and 
> 
> I don't think this is the best solution; it would be a noticeable
> performance penalty on replication. 

Actually, there's an even bigger reason not to mandate waiting for
fsync: what if the user turns fsync off?

One can certainly imagine users choosing to rely on their replication
slaves for crash recovery instead of fsync.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

15 June 2010, 23:01:17

On Tue, Jun 15, 2010 at 8:09 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> I have yet to convince myself of how likely this is to occur.  I tried
>> to reproduce this issue by crashing the database, but I think in 9.0
>> you need an actual operating system crash to cause this problem, and I
>> haven't yet set up an environment in which I can repeatedly crash the
>> OS.  I believe, though, that in 9.1, we're going to want to stream
>> from WAL buffers as proposed in the patch that started out this
>> thread, and then I think this issue can be triggered with just a
>> database crash.
>
> Yes, but it still requires:
>
> a) the master must crash with at least one transaction transmitted to
> the slave an not yet fsync'd

Bzzzzt.  Stop right there.  It only requires the master to crash with
at least one *WAL record* written but not transmitted, not one
transaction.  And most WAL record types are not fsync'd immediately.
So in theory I think that, for example, an OS crash in the middle of a
big bulk insert operation should be sufficient to trigger this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

21 June 2010, 06:09:08

On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> I wonder if it would be possible to jigger things so that we send the
>>> WAL to the standby as soon as it is generated, but somehow arrange
>>> things so that the standby knows the last location that the master has
>>> fsync'd and never applies beyond that point.
>>
>> I can't think of any way which would not require major engineering.  And
>> you'd be slowing down replication *in general* to deal with a fairly
>> unlikely corner case.
>>
>> I think the panic is the way to go.
>
> I have yet to convince myself of how likely this is to occur.  I tried
> to reproduce this issue by crashing the database, but I think in 9.0
> you need an actual operating system crash to cause this problem, and I
> haven't yet set up an environment in which I can repeatedly crash the
> OS.  I believe, though, that in 9.1, we're going to want to stream
> from WAL buffers as proposed in the patch that started out this
> thread, and then I think this issue can be triggered with just a
> database crash.
>
> In 9.0, I think we can fix this problem by (1) only streaming WAL that
> has been fsync'd and (2) PANIC-ing if the problem occurs anyway.  But
> in 9.1, with sync rep and the performance demands that entails, I
> think that we're going to need to rethink it.

The problem is not that the master streams non-fsync'd WAL, but that the
standby can replay that. So I'm thinking that we can send non-fsync'd WAL
safely if the standby makes the recovery wait until the master has fsync'd
WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
location to walreceiver, and the standby applies only the WAL which the
master has already fsync'd. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Heikki Linnakangas

Date:

21 June 2010, 06:40:26

On 21/06/10 12:08, Fujii Masao wrote:
> On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> In 9.0, I think we can fix this problem by (1) only streaming WAL that
>> has been fsync'd and (2) PANIC-ing if the problem occurs anyway.  But
>> in 9.1, with sync rep and the performance demands that entails, I
>> think that we're going to need to rethink it.
>
> The problem is not that the master streams non-fsync'd WAL, but that the
> standby can replay that. So I'm thinking that we can send non-fsync'd WAL
> safely if the standby makes the recovery wait until the master has fsync'd
> WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
> location to walreceiver, and the standby applies only the WAL which the
> master has already fsync'd. Thought?

I guess, but you have to be very careful to correctly refrain from 
applying the WAL. For example, a naive implementation might write the 
WAL to disk in walreceiver immediately, but refrain from telling the 
startup process about it. If walreceiver is then killed because the 
connection is broken (and it will be because the master just crashed), 
the startup process will read the streamed WAL from the file in pg_xlog, 
and go ahead to apply it anyway.

So maybe there's some room for optimization there, but given the 
round-trip required for the acknowledgment anyway it might not buy you 
much, and the implementation is not very straightforward. This is 
clearly 9.1 material, if worth optimizing at all.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Greg Stark

Date:

21 June 2010, 09:49:56

On Mon, Jun 21, 2010 at 10:40 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I guess, but you have to be very careful to correctly refrain from applying
> the WAL. For example, a naive implementation might write the WAL to disk in
> walreceiver immediately, but refrain from telling the startup process about
> it. If walreceiver is then killed because the connection is broken (and it
> will be because the master just crashed), the startup process will read the
> streamed WAL from the file in pg_xlog, and go ahead to apply it anyway.

So the goal is that when you *do* failover to the standby it replays
these additional records. So whether the startup process obeys this
limit would have to be conditional on whether it's still in standby
mode.

> So maybe there's some room for optimization there, but given the round-trip
> required for the acknowledgment anyway it might not buy you much, and the
> implementation is not very straightforward. This is clearly 9.1 material, if
> worth optimizing at all.

I don't see any need for a round-trip acknowledgement -- no more than
currently. the master just includes the flush location in every
response. It might have to send additional responses though when
fsyncs happen to update the flush location even if no additional
records are sent. Otherwise a hot standby might spend a long time with
out-dated data even if on failover it would be up to date that seems
nonideal for the hot standby users.

I think this would be a good improvement for databases processing
large batch updates so the standby doesn't have an increased risk of
losing a large amount of data if there's a crash after processing such
a large query. I agree it's 9.1 material.

Earlier we made a change to the WAL streaming protocol on the basis
that we wanted to get the protocol right even if we don't use the
change right away. I'm not sure I understand that -- it's not like
we're going to stream WAL from 9.0 to 9.1. But if that was true then
perhaps we need to add the WAL flush location to the protocol now even
if we're not going to use yet?

-- 
greg

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Simon Riggs

Date:

21 June 2010, 10:51:28

On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:

> The problem is not that the master streams non-fsync'd WAL, but that the
> standby can replay that. So I'm thinking that we can send non-fsync'd WAL
> safely if the standby makes the recovery wait until the master has fsync'd
> WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
> location to walreceiver, and the standby applies only the WAL which the
> master has already fsync'd. Thought?

Yes, good thought. The patch just applied seems too much.

I had the same thought, though it would mean you'd need to send two xlog
end locations, one for write, one for fsync. Though not really clear why
we send the "current end of WAL on the server" anyway, so maybe we can
just alter that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Bruce Momjian

Date:

29 June 2010, 23:06:36

Simon Riggs wrote:
> On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:
> 
> > The problem is not that the master streams non-fsync'd WAL, but that the
> > standby can replay that. So I'm thinking that we can send non-fsync'd WAL
> > safely if the standby makes the recovery wait until the master has fsync'd
> > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
> > location to walreceiver, and the standby applies only the WAL which the
> > master has already fsync'd. Thought?
> 
> Yes, good thought. The patch just applied seems too much.
> 
> I had the same thought, though it would mean you'd need to send two xlog
> end locations, one for write, one for fsync. Though not really clear why
> we send the "current end of WAL on the server" anyway, so maybe we can
> just alter that.

Is this a TODO?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + None of us is going to be here forever. +

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

29 June 2010, 23:27:04

On Tue, Jun 29, 2010 at 10:06 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Simon Riggs wrote:
>> On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:
>>
>> > The problem is not that the master streams non-fsync'd WAL, but that the
>> > standby can replay that. So I'm thinking that we can send non-fsync'd WAL
>> > safely if the standby makes the recovery wait until the master has fsync'd
>> > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
>> > location to walreceiver, and the standby applies only the WAL which the
>> > master has already fsync'd. Thought?
>>
>> Yes, good thought. The patch just applied seems too much.
>>
>> I had the same thought, though it would mean you'd need to send two xlog
>> end locations, one for write, one for fsync. Though not really clear why
>> we send the "current end of WAL on the server" anyway, so maybe we can
>> just alter that.
>
> Is this a TODO?

Maybe.  As Heikki pointed out upthread, the standby can't even write
the WAL to back to the OS until it's been fsync'd on the master
without risking the problem under discussion.  So we can stream the
WAL from master to standby as long as the standby just buffers it in
memory (or somewhere other than the usual location in pg_xlog).

Before we get too busy frobnicating this gonkulator, I'd like to see a
little more discussion of what kind of performance people are
expecting from sync rep.  Sounds to me like the best we can expect
here is, on every commit: (a) wait for master fsync to complete, (b)
send message to standby, (c) wait for reply for reply from standby
indicating that fsync is complete on standby.  Even assuming that the
network overhead is minimal, that halves the commit rate.  Are the
people who want sync rep OK with that?  Is there any way to do better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

30 June 2010, 06:36:58

On Wed, Jun 30, 2010 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Maybe.  As Heikki pointed out upthread, the standby can't even write
> the WAL to back to the OS until it's been fsync'd on the master
> without risking the problem under discussion.

If we change the startup process so that it doesn't go ahead of the
master's fsync location even after the walreceiver is terminated,
we would have no need to worry about that risk. For further robustness,
the walreceiver might be able to zero the WAL records which have not
been fsync'd on the master yet, when being terminated.

But, if the standby crashes after the master crashes, restart of the
standby might replay that non-fsync'd WAL wrongly because it cannot
remember the master's fsync location. In this case, if we promote the
standby to the master, we still don't have to worry about that risk.
But instead of performing a failover, if we restart the master and
make the standby connect to the master again, the database on the standby
would get corrupted.

For now, I don't have good idea to avoid that database corruption by
the double failure (crash of both master and standby)...

> So we can stream the
> WAL from master to standby as long as the standby just buffers it in
> memory (or somewhere other than the usual location in pg_xlog).

Yeah, I was just thinking the same thing. But the problem is that the
buffer size might become too big (might be bigger than 16MB). For
example, synchronous_commit = off and wal_writer_delay = 10000ms on
the master would delay the fsync significantly and increase the buffer
size on the standby.

> Before we get too busy frobnicating this gonkulator, I'd like to see a
> little more discussion of what kind of performance people are
> expecting from sync rep.  Sounds to me like the best we can expect
> here is, on every commit: (a) wait for master fsync to complete, (b)
> send message to standby, (c) wait for reply for reply from standby
> indicating that fsync is complete on standby.  Even assuming that the
> network overhead is minimal, that halves the commit rate.  Are the
> people who want sync rep OK with that?  Is there any way to do better?

(c) would depend on the synchronization mode the user chooses:
 #1 Wait for WAL to be received by the standby #2 Wait for WAL to be received and flushed by the standby #3 Wait for
WALto be received, flushed and replayed by the standby 

(a) would depend on synchronous_commit. Personally I'm interested in
disabling synchronous_commit on the master and choosing #1 as the sync
mode. Though this may be very optimistic configuration :)

The point for performance of sync rep is to parallelize (a) and (b)+(c),
I think. If they are performed in a serial manner, the performance
overhead on the master would become high.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

30 June 2010, 08:37:12

On Wed, Jun 30, 2010 at 5:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Before we get too busy frobnicating this gonkulator, I'd like to see a
>> little more discussion of what kind of performance people are
>> expecting from sync rep.  Sounds to me like the best we can expect
>> here is, on every commit: (a) wait for master fsync to complete, (b)
>> send message to standby, (c) wait for reply for reply from standby
>> indicating that fsync is complete on standby.  Even assuming that the
>> network overhead is minimal, that halves the commit rate.  Are the
>> people who want sync rep OK with that?  Is there any way to do better?
>
> (c) would depend on the synchronization mode the user chooses:
>
>  #1 Wait for WAL to be received by the standby
>  #2 Wait for WAL to be received and flushed by the standby
>  #3 Wait for WAL to be received, flushed and replayed by the standby
>
> (a) would depend on synchronous_commit. Personally I'm interested in
> disabling synchronous_commit on the master and choosing #1 as the sync
> mode. Though this may be very optimistic configuration :)
>
> The point for performance of sync rep is to parallelize (a) and (b)+(c),
> I think. If they are performed in a serial manner, the performance
> overhead on the master would become high.

Right.  So we to try to come up with a design that permits that, which
must be robust in the face of any number of crashes on the two
machines, in any order.  Until we have that, we're just going around
in circles.

One thought that occurred to me is that if the master and standby were
more tightly coupled, you could recover after a crash by making the
one with the further-advanced WAL position the master, and the other
one the standby.  That would get around this problem, though at the
cost of considerable additional complexity.  But then if one of the
servers comes up and can't talk to the other, you need some mechanism
for preventing split-brain syndrome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Greg Stark

Date:

01 July 2010, 08:28:26

On Wed, Jun 30, 2010 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> One thought that occurred to me is that if the master and standby were
> more tightly coupled, you could recover after a crash by making the
> one with the further-advanced WAL position the master, and the other
> one the standby.  That would get around this problem, though at the
> cost of considerable additional complexity.  But then if one of the
> servers comes up and can't talk to the other, you need some mechanism
> for preventing split-brain syndrome.

Users should be free to build infrastructure to allow that. But we
can't just switch ourselves -- we don't know what other pieces of
their systems need to be updated when the master changes.

We also need to stop thinking in terms of one master and one slave.
They could have dozens of slaves and in case of failover would want to
pick the slave with the most recent WAL position. The way I picture
that happening they're monitoring all their slaves in some monitoring
tool and use that data to pick the new master. Some external tool
picks the new master and tells that host, all the other slaves, and
all the rest of the their infrastructure where to find the new master
and does whatever is necessary to restart or reload configurations.

The question I think is what interfaces do we need in Postgres to make
this easy. The monitoring tool needs a way to find the current WAL
position from the slaves even when the master is down. That means
potentially needing to start up the slaves in read-only mode with no
master at all. It also means making it easy for an external tool to
switch a node from slave to primary and change a slave's master. And
it also means a slave should be able to change master and pick up
where it left off easily. I'm not sure what the recommended interfaces
for these operations would be currently for an external tool.

--
greg

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

06 July 2010, 20:44:35

On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> In 9.0, walsender reads WAL always from the disk and sends it to the standby.
> That is, we cannot send WAL until it has been written (and flushed) to the disk.
> This degrades the performance of synchronous replication very much since a
> transaction commit must wait for the WAL write time *plus* the replication time.
>
> The attached patch enables walsender to read data from WAL buffers in addition
> to the disk. Since we can write and send WAL simultaneously, in synchronous
> replication, a transaction commit has only to wait for either of them. So the
> performance would significantly increase.

To recap the previous discussion on this thread, we ended up changing
the behavior of 9.0 so that it only sends WAL which has been written
to the OS *and flushed*, because sending unflushed WAL to the standby
is unsafe.  The standby can get ahead of the master while still
believing that the databases are in sync, due to the fact that after
an SR reconnect we rewind to the start of the current WAL segment.
This results in a silently corrupt standby database.

If it's unsafe to send written but unflushed WAL to the standby, then
for the same reasons we can't send unwritten WAL either.  Therefore, I
believe that this entire patch in its current form is a nonstarter and
we should mark it Rejected in the CF app so that reviewers don't
unnecessarily spend time on it.

Having said that, I do think we urgently need some high-level design
discussion on how sync rep is actually going to handle this issue
(perhaps on a new thread).  If we can't resolve this issue, sync rep
is going to be really slow; but there are no easy solutions to this
problem in sight, so if we want to have sync rep for 9.1 we'd better
agree on one of the difficult solutions soon so that work can begin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Dimitri Fontaine

Date:

07 July 2010, 05:40:57

Robert Haas <robertmhaas@gmail.com> writes:
> If it's unsafe to send written but unflushed WAL to the standby, then
> for the same reasons we can't send unwritten WAL either.
[...]
> Having said that, I do think we urgently need some high-level design
> discussion on how sync rep is actually going to handle this issue

Stop me if I'm all wrong already, but I though we said that we should
handle this case by decoupling what we can send to the standby and what
it can apply. We could do this by sending the current WAL fsync'ed
position on the master in the WAL sender protocol, either in the WAL
itself or as out-of-bound messages, I guess.

Now, this can be made safe, how to make it fast (low-latency) is yet to
be addressed.

Regards,
-- 
dim

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

07 July 2010, 07:58:04

On Wed, Jul 7, 2010 at 4:40 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote:
> Stop me if I'm all wrong already, but I though we said that we should
> handle this case by decoupling what we can send to the standby and what
> it can apply. We could do this by sending the current WAL fsync'ed
> position on the master in the WAL sender protocol, either in the WAL
> itself or as out-of-bound messages, I guess.
>
> Now, this can be made safe, how to make it fast (low-latency) is yet to
> be addressed.

Yeah, that's the trick, isn't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Tom Lane

Date:

07 July 2010, 11:11:26

Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Stop me if I'm all wrong already, but I though we said that we should
> handle this case by decoupling what we can send to the standby and what
> it can apply.

What's the point of that?  It won't make the standby apply any faster.
What it will do is make the protocol more complicated, hence slower
(more messages) and more at risk of bugs.
        regards, tom lane

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Dimitri Fontaine

Date:

07 July 2010, 11:20:37

Tom Lane <tgl@sss.pgh.pa.us> writes:
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Stop me if I'm all wrong already, but I though we said that we should
>> handle this case by decoupling what we can send to the standby and what
>> it can apply.
>
> What's the point of that?  It won't make the standby apply any faster.

True, but it allows to send the WAL content before to ack its fsync.

Regards.
-- 
dim

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Josh Berkus

Date:

07 July 2010, 19:44:32

On 7/6/10 4:44 PM, Robert Haas wrote:
> To recap the previous discussion on this thread, we ended up changing
> the behavior of 9.0 so that it only sends WAL which has been written
> to the OS *and flushed*, because sending unflushed WAL to the standby
> is unsafe.  The standby can get ahead of the master while still
> believing that the databases are in sync, due to the fact that after
> an SR reconnect we rewind to the start of the current WAL segment.
> This results in a silently corrupt standby database.

What was the final decision on behavior if fsync=off?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Robert Haas

Date:

07 July 2010, 19:55:15

On Wed, Jul 7, 2010 at 6:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 7/6/10 4:44 PM, Robert Haas wrote:
>> To recap the previous discussion on this thread, we ended up changing
>> the behavior of 9.0 so that it only sends WAL which has been written
>> to the OS *and flushed*, because sending unflushed WAL to the standby
>> is unsafe.  The standby can get ahead of the master while still
>> believing that the databases are in sync, due to the fact that after
>> an SR reconnect we rewind to the start of the current WAL segment.
>> This results in a silently corrupt standby database.
>
> What was the final decision on behavior if fsync=off?

I'm not sure we made any decision, per se, but if you use fsync=off in
combination with SR and experience an unexpected crash-and-reboot on
the master, you will be sad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

marcin mank

Date:

08 July 2010, 00:52:45

> Having said that, I do think we urgently need some high-level design
> discussion on how sync rep is actually going to handle this issue
> (perhaps on a new thread).  If we can't resolve this issue, sync rep
> is going to be really slow; but there are no easy solutions to this
> problem in sight, so if we want to have sync rep for 9.1 we'd better
> agree on one of the difficult solutions soon so that work can begin.
>

When standbys reconnect after a crash, they could send the
ahead-of-the-master WAL to the master. This is an alternative to
choosing the most-ahead standby as the new master, as suggested
elsewhere.

Greetings
Marcin Mańk

Re: Proposal for 9.1: WAL streaming from WAL buffers

From

Fujii Masao

Date:

08 July 2010, 04:51:32

On Thu, Jul 8, 2010 at 7:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> What was the final decision on behavior if fsync=off?
>
> I'm not sure we made any decision, per se, but if you use fsync=off in
> combination with SR and experience an unexpected crash-and-reboot on
> the master, you will be sad.

True. But, without SR, an unexpected crash-and-reboot in the master
would make you sad ;) So I'm not sure whether we really need to take
action for the case of SR + fsync=off.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center