Thread: Synchronous replication - patch status inquiry

Synchronous replication - patch status inquiry

From

fazool mein

Date:

31 August 2010, 18:34:14

Hello everyone,<br /><br />I'm interested in benchmarking synchronous replication, to see how performance degrades
comparedto asynchronous streaming replication.<br /><br />I browsed through the archive of emails, but things still
seemunclear. Do we have a final agreed upon patch that I can use? Any links for that?<br /><br />Thanks.<br /><br />OS
=Linux Suse, sles 11, 64-bit<br /> Postgres version = 9.0 beta-4<br /><br /><br />

Re: Synchronous replication - patch status inquiry

From

Bruce Momjian

Date:

31 August 2010, 18:44:23

fazool mein wrote:
> Hello everyone,
> 
> I'm interested in benchmarking synchronous replication, to see how
> performance degrades compared to asynchronous streaming replication.
> 
> I browsed through the archive of emails, but things still seem unclear. Do
> we have a final agreed upon patch that I can use? Any links for that?

No.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Synchronous replication - patch status inquiry

From

David Fetter

Date:

31 August 2010, 19:24:23

On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:
> fazool mein wrote:
> > Hello everyone,
> > 
> > I'm interested in benchmarking synchronous replication, to see how
> > performance degrades compared to asynchronous streaming replication.
> > 
> > I browsed through the archive of emails, but things still seem unclear. Do
> > we have a final agreed upon patch that I can use? Any links for that?
> 
> No.

That was a mite brusque and not super informative.

There are patches, and the latest from Fujii Masao is probably worth
looking at :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

31 August 2010, 21:34:45

On Tue, Aug 31, 2010 at 6:24 PM, David Fetter <david@fetter.org> wrote:
> On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:
>> fazool mein wrote:
>> > Hello everyone,
>> >
>> > I'm interested in benchmarking synchronous replication, to see how
>> > performance degrades compared to asynchronous streaming replication.
>> >
>> > I browsed through the archive of emails, but things still seem unclear. Do
>> > we have a final agreed upon patch that I can use? Any links for that?
>>
>> No.
>
> That was a mite brusque and not super informative.
>
> There are patches, and the latest from Fujii Masao is probably worth
> looking at :)

I am pretty sure, however, that the performance will be terrible at
this point.  Heikki is working on fixing that, but it ain't done yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

31 August 2010, 21:35:50

On Tue, Aug 31, 2010 at 6:24 PM, David Fetter <david@fetter.org> wrote:
> On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:
>> fazool mein wrote:
>> > Hello everyone,
>> >
>> > I'm interested in benchmarking synchronous replication, to see how
>> > performance degrades compared to asynchronous streaming replication.
>> >
>> > I browsed through the archive of emails, but things still seem unclear. Do
>> > we have a final agreed upon patch that I can use? Any links for that?
>>
>> No.
>
> That was a mite brusque and not super informative.
>
> There are patches, and the latest from Fujii Masao is probably worth
> looking at :)

I am pretty sure, however, that the performance will be terrible at
this point.  Heikki is working on fixing that, but it ain't done yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

David Fetter

Date:

31 August 2010, 21:45:57

On Tue, Aug 31, 2010 at 08:34:31PM -0400, Robert Haas wrote:
> On Tue, Aug 31, 2010 at 6:24 PM, David Fetter <david@fetter.org> wrote:
> > On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:
> >> fazool mein wrote:
> >> > Hello everyone,
> >> >
> >> > I'm interested in benchmarking synchronous replication, to see
> >> > how performance degrades compared to asynchronous streaming
> >> > replication.
> >> >
> >> > I browsed through the archive of emails, but things still seem
> >> > unclear. Do we have a final agreed upon patch that I can use?
> >> > Any links for that?
> >>
> >> No.
> >
> > That was a mite brusque and not super informative.
> >
> > There are patches, and the latest from Fujii Masao is probably
> > worth looking at :)
> 
> I am pretty sure, however, that the performance will be terrible at
> this point.  Heikki is working on fixing that, but it ain't done
> yet.

Is this something for an eDB feature, or for community PostgreSQL,
or...?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

31 August 2010, 22:03:01

On Tue, Aug 31, 2010 at 8:45 PM, David Fetter <david@fetter.org> wrote:
>> I am pretty sure, however, that the performance will be terrible at
>> this point.  Heikki is working on fixing that, but it ain't done
>> yet.
>
> Is this something for an eDB feature, or for community PostgreSQL,
> or...?

It's an EDB feature in the sense that Heikki is developing it as part
of his employment with EDB, but it will be committed to community
PostgreSQL.  See the thread on interruptible sleeps.  The problem
right now is that there are some polling loops that act to throttle
the maximum rate at which a node doing sync rep can make forward
progress, independent of the capabilities of the hardware.  Those need
to be replaced with a system that doesn't inject unnecessary delays
into the process, which is what Heikki is working on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

31 August 2010, 22:06:28

On Wed, Sep 1, 2010 at 9:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> There are patches, and the latest from Fujii Masao is probably worth
>> looking at :)
>
> I am pretty sure, however, that the performance will be terrible at
> this point.  Heikki is working on fixing that, but it ain't done yet.

Yep. The latest WIP code is available in my git repository, but it's
not worth benchmarking yet. I'll need to merge Heikki's effort and
the synchronous replication patch.
   git://git.postgresql.org/git/users/fujii/postgres.git   branch: synchrep

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

fazool mein

Date:

31 August 2010, 23:14:43

Thanks!<br /><br />I'll wait for the merging then; there is no point in benchmarking otherwise.<br /><br />Regards<br
/><br/><div class="gmail_quote">On Tue, Aug 31, 2010 at 6:06 PM, Fujii Masao <span dir="ltr"><<a
href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="margin:0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">On Wed,
Sep1, 2010 at 9:34 AM, Robert Haas <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>> wrote:<br
/>>> There are patches, and the latest from Fujii Masao is probably worth<br /> >> looking at :)<br />
><br/> > I am pretty sure, however, that the performance will be terrible at<br /> > this point.  Heikki is
workingon fixing that, but it ain't done yet.<br /><br /></div>Yep. The latest WIP code is available in my git
repository,but it's<br /> not worth benchmarking yet. I'll need to merge Heikki's effort and<br /> the synchronous
replicationpatch.<br /><br />    git://<a href="http://git.postgresql.org/git/users/fujii/postgres.git"
target="_blank">git.postgresql.org/git/users/fujii/postgres.git</a><br/>    branch: synchrep<br /><br /> Regards,<br
/><fontcolor="#888888"><br /> --<br /> Fujii Masao<br /> NIPPON TELEGRAPH AND TELEPHONE CORPORATION<br /> NTT Open
SourceSoftware Center<br /></font></blockquote></div><br />

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

01 September 2010, 02:33:20

On 01/09/10 04:02, Robert Haas wrote:
>  See the thread on interruptible sleeps.  The problem
> right now is that there are some polling loops that act to throttle
> the maximum rate at which a node doing sync rep can make forward
> progress, independent of the capabilities of the hardware.

To be precise, the polling doesn't affect the "bandwidth" the 
replication can handle, but it introduces a delay wh

>  Those need
> to be replaced with a system that doesn't inject unnecessary delays
> into the process, which is what Heikki is working on.

Right.

Once we're done with that, all the big questions are still left. How to 
configure it? What does synchronous replication mean, when is a 
transaction acknowledged as committed? What to do if a standby server 
dies and never acknowledges a commit? All these issues have been 
discussed, but there is no consensus yet.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

01 September 2010, 04:53:49

On Wed, Sep 1, 2010 at 2:33 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Once we're done with that, all the big questions are still left.

Yeah, let's discuss about those topics :)

> How to configure it?

Before discussing about that, we should determine whether registering
standbys in master is really required. It affects configuration a lot.
Heikki thinks that it's required, but I'm still unclear about why and
how.

Why do standbys need to be registered in master? What information
should be registered?

> What does synchronous replication mean, when is a transaction
> acknowledged as committed?

I proposed four synchronization levels:

1. async doesn't make transaction commit wait for replication, i.e., asynchronous replication. This mode has been
alreadysupported in 9.0.

2. recv makes transaction commit wait until the standby has received WAL records.

3. fsync makes transaction commit wait until the standby has received and flushed WAL records to disk

4. replay makes transaction commit wait until the standby has replayed WAL records after receiving and flushing them to
disk

OTOH, Simon proposed the quorum commit feature. I think that both
is required for various our use cases. Thought?

> What to do if a standby server dies and never
> acknowledges a commit?

The master's reaction to that situation should be configurable. So
I'd propose new configuration parameter specifying the reaction.
Valid values are:

- standalone When the master has waited for the ACK much longer than the timeout (or detected the failure of the
standby),it closes the connection to the standby and restarts transactions.

- down When that situation occurs, the master shuts down immediately. Though this is unsafe for the system requiring
highavailability, as far as I recall, some people wanted this mode in the previous discussion.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

01 September 2010, 07:24:13

On 01/09/10 10:53, Fujii Masao wrote:
> Before discussing about that, we should determine whether registering
> standbys in master is really required. It affects configuration a lot.
> Heikki thinks that it's required, but I'm still unclear about why and
> how.
>
> Why do standbys need to be registered in master? What information
> should be registered?

That requirement falls out from the handling of disconnected standbys. 
If a standby is not connected, what does the master do with commits? If 
the answer is anything else than acknowledge them to the client 
immediately, as if the standby never existed, the master needs to know 
what standby servers exist. Otherwise it can't know if all the standbys 
are connected or not.

>> What does synchronous replication mean, when is a transaction
>> acknowledged as committed?
>
> I proposed four synchronization levels:
>
> 1. async
>    doesn't make transaction commit wait for replication, i.e.,
>    asynchronous replication. This mode has been already supported in
>    9.0.
>
> 2. recv
>    makes transaction commit wait until the standby has received WAL
>    records.
>
> 3. fsync
>    makes transaction commit wait until the standby has received and
>    flushed WAL records to disk
>
> 4. replay
>    makes transaction commit wait until the standby has replayed WAL
>    records after receiving and flushing them to disk
>
> OTOH, Simon proposed the quorum commit feature. I think that both
> is required for various our use cases. Thought?

I'd like to keep this as simple as possible, yet flexible so that with 
enough scripting and extensions, you can get all sorts of behavior. I 
think quorum commit falls into the "extension" category; if you're setup 
is complex enough, it's going to be impossible to represent that in our 
config files no matter what. But if you write a little proxy, you can 
implement arbitrary rules there.

I think recv/fsync/replay should be specified in the standby. It has no 
direct effect on the master, the master would just relay the setting to 
the standby when it connects, or the standby would send multiple 
XLogRecPtrs and let the master decide when the WAL is persistent enough. 
And what if you write a proxy that has some other meaning of "persistent 
enough"? Like when it has been written to the OS buffers but not yet 
fsync'd, or when it has been fsync'd to at least one standby and 
received by at least three others. recv/fsync/replay is not going to 
represent that behavior well.

"sync vs async" on the other hand should be specified in the master, 
because it has a direct impact on the behavior of commits in the master.

I propose a configuration file standbys.conf, in the master:

# STANDBY NAME    SYNCHRONOUS   TIMEOUT
importantreplica  yes           100ms
tempcopy          no            10s

Or perhaps this should be stored in a system catalog.

>> What to do if a standby server dies and never
>> acknowledges a commit?
>
> The master's reaction to that situation should be configurable. So
> I'd propose new configuration parameter specifying the reaction.
> Valid values are:
>
> - standalone
>    When the master has waited for the ACK much longer than the timeout
>    (or detected the failure of the standby), it closes the connection
>    to the standby and restarts transactions.
>
> - down
>    When that situation occurs, the master shuts down immediately.
>    Though this is unsafe for the system requiring high availability,
>    as far as I recall, some people wanted this mode in the previous
>    discussion.

Yeah, though of course you might want to set that per-standby too..

Let's step back a bit and ask what would be the simplest thing that you 
could call "synchronous replication" in good conscience, and also be 
useful at least to some people. Let's leave out the "down" mode, because 
that requires registration. We'll probably have to do registration at 
some point, but let's take as small steps as possible.

Without the "down" mode in the master, frankly I don't see the point of 
the "recv" and "fsync" levels in the standby. Either way, when the 
master acknowledges a commit to the client, you don't know if it has 
made it to the standby yet because the replication connection might be 
down for some reason.

That leaves us the 'replay' mode, which *is* useful, because it gives 
you the guarantee that when the master acknowledges a commit, it will 
appear committed in all hot standby servers that are currently 
connected. With that guarantee you can build a reliable cluster with 
something pgpool-II where all writes go to one node, and reads are 
distributed to multiple nodes.

I'm not sure what we should aim for in the first phase. But if you want 
as little code as possible yet have something useful, I think 'replay' 
mode with no standby registration is the way to go.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

01 September 2010, 13:04:04

On Wed, Sep 1, 2010 at 6:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I'm not sure what we should aim for in the first phase. But if you want as
> little code as possible yet have something useful, I think 'replay' mode
> with no standby registration is the way to go.

IMHO, less is more.  Trying to do too much at once can cause us to
miss the release window (and can also create more bugs).  We just need
to leave the door open to adding later whatever we leave out now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

01 September 2010, 15:07:30

On Wed, 2010-09-01 at 08:33 +0300, Heikki Linnakangas wrote:
> On 01/09/10 04:02, Robert Haas wrote:
> >  See the thread on interruptible sleeps.  The problem
> > right now is that there are some polling loops that act to throttle
> > the maximum rate at which a node doing sync rep can make forward
> > progress, independent of the capabilities of the hardware.
> 
> To be precise, the polling doesn't affect the "bandwidth" the 
> replication can handle, but it introduces a delay wh

We're sending the WAL data in batches. We can't really escape from the
fact that we're effectively using group commit when we use synch rep.
That will necessarily increase delay and require more sessions to get
same throughput.

> >  Those need
> > to be replaced with a system that doesn't inject unnecessary delays
> > into the process, which is what Heikki is working on.
> 
> Right.

> Once we're done with that, all the big questions are still left. How to 
> configure it? What does synchronous replication mean, when is a 
> transaction acknowledged as committed? What to do if a standby server 
> dies and never acknowledges a commit? All these issues have been 
> discussed, but there is no consensus yet.

That sounds an awful lot like performance tuning first and the feature
additions last.

And if you're in the middle of performance tuning, surely some objective
performance tests would help us, no?

IMHO we should be concentrating on how to add the next features because
its clear to me that if you do things in the wrong order you'll be
wasting time. And we don't have much of that, ever.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

01 September 2010, 17:43:26

On Wed, 2010-09-01 at 13:23 +0300, Heikki Linnakangas wrote:
> On 01/09/10 10:53, Fujii Masao wrote:
> > Before discussing about that, we should determine whether registering
> > standbys in master is really required. It affects configuration a lot.
> > Heikki thinks that it's required, but I'm still unclear about why and
> > how.
> >
> > Why do standbys need to be registered in master? What information
> > should be registered?
> 
> That requirement falls out from the handling of disconnected standbys. 
> If a standby is not connected, what does the master do with commits? If 
> the answer is anything else than acknowledge them to the client 
> immediately, as if the standby never existed, the master needs to know 
> what standby servers exist. Otherwise it can't know if all the standbys 
> are connected or not.

"All the standbys" presupposes that we know what they are, i.e. we have
registered them, so I see that argument as circular. Quorum commit does
not need registration, so quorum commit is the "easy to implement"
option and registration is the more complex later feature. I don't have
a problem with adding registration later and believe it can be done
later without issues.

> >> What does synchronous replication mean, when is a transaction
> >> acknowledged as committed?
> >
> > I proposed four synchronization levels:
> >
> > 1. async
> >    doesn't make transaction commit wait for replication, i.e.,
> >    asynchronous replication. This mode has been already supported in
> >    9.0.
> >
> > 2. recv
> >    makes transaction commit wait until the standby has received WAL
> >    records.
> >
> > 3. fsync
> >    makes transaction commit wait until the standby has received and
> >    flushed WAL records to disk
> >
> > 4. replay
> >    makes transaction commit wait until the standby has replayed WAL
> >    records after receiving and flushing them to disk
> >
> > OTOH, Simon proposed the quorum commit feature. I think that both
> > is required for various our use cases. Thought?
> 
> I'd like to keep this as simple as possible, yet flexible so that with 
> enough scripting and extensions, you can get all sorts of behavior. I 
> think quorum commit falls into the "extension" category; if you're setup 
> is complex enough, it's going to be impossible to represent that in our 
> config files no matter what. But if you write a little proxy, you can 
> implement arbitrary rules there.
> 
> I think recv/fsync/replay should be specified in the standby. 

I think the wait mode (i.e. recv/fsync/replay or others) should be
specified in the master. This allows the application to specify whatever
level of protection it requires, and also allows the behaviour to be
different for user-specifiable parts of the application. As soon as you
set this on the standby then you have the one-size fits all approach to
synchronisation.

We already know performance of synchronous rep is poor, which is exactly
why I want to be able to control it at the application level. Fine
grained control is important, otherwise we may as well just use DRBD and
skip this project completely, since we already have that. It will also
be a feature that no other database has, taking us truly beyond what has
gone before.

The master/standby decision is not something that is easily changed.
Whichever we decide now will be the thing we stick with.

> It has no 
> direct effect on the master, the master would just relay the setting to 
> the standby when it connects, or the standby would send multiple 
> XLogRecPtrs and let the master decide when the WAL is persistent enough. 
> And what if you write a proxy that has some other meaning of "persistent 
> enough"? Like when it has been written to the OS buffers but not yet 
> fsync'd, or when it has been fsync'd to at least one standby and 
> received by at least three others. recv/fsync/replay is not going to 
> represent that behavior well.
> 
> "sync vs async" on the other hand should be specified in the master, 
> because it has a direct impact on the behavior of commits in the master.
> 



> I propose a configuration file standbys.conf, in the master:
> 
> # STANDBY NAME    SYNCHRONOUS   TIMEOUT
> importantreplica  yes           100ms
> tempcopy          no            10s
> 
> Or perhaps this should be stored in a system catalog.

That part sounds like complexity that can wait until later. I would not
object if you really want this, but would prefer it to look like this:

# STANDBY NAME    DEFAULT_WAIT_MODE   TIMEOUT
importantreplica  sync               100ms
tempcopy          async                10s

You don't *have* to use the application level control if you don't want
it. But its an important capability for real world apps, since the
alternative is deliberately splitting an application across two database
servers each with different wait modes.

> >> What to do if a standby server dies and never
> >> acknowledges a commit?
> >
> > The master's reaction to that situation should be configurable. So
> > I'd propose new configuration parameter specifying the reaction.
> > Valid values are:
> >
> > - standalone
> >    When the master has waited for the ACK much longer than the timeout
> >    (or detected the failure of the standby), it closes the connection
> >    to the standby and restarts transactions.
> >
> > - down
> >    When that situation occurs, the master shuts down immediately.
> >    Though this is unsafe for the system requiring high availability,
> >    as far as I recall, some people wanted this mode in the previous
> >    discussion.
> 
> Yeah, though of course you might want to set that per-standby too..
> 
> 
> Let's step back a bit and ask what would be the simplest thing that you 
> could call "synchronous replication" in good conscience, and also be 
> useful at least to some people. Let's leave out the "down" mode, because 
> that requires registration. We'll probably have to do registration at 
> some point, but let's take as small steps as possible.
> 
> Without the "down" mode in the master, frankly I don't see the point of 
> the "recv" and "fsync" levels in the standby. Either way, when the 
> master acknowledges a commit to the client, you don't know if it has 
> made it to the standby yet because the replication connection might be 
> down for some reason.
> 
> That leaves us the 'replay' mode, which *is* useful, because it gives 
> you the guarantee that when the master acknowledges a commit, it will 
> appear committed in all hot standby servers that are currently 
> connected. With that guarantee you can build a reliable cluster with 
> something pgpool-II where all writes go to one node, and reads are 
> distributed to multiple nodes.
> 
> I'm not sure what we should aim for in the first phase. But if you want 
> as little code as possible yet have something useful, I think 'replay' 
> mode with no standby registration is the way to go.

I don't see it as any more code to implement.

When the standby replies, it can return
* latest LSN received
* latest LSN fsynced
* latest LSN replayed
etc

We then release waiting committers on the master according to which of
the above they said they want to wait for. The standby does *not* need
to know the wishes of transactions on the master.

Note that means that receiving, fsyncing and replaying can all progress
as an asynchronous pipeline, giving great overall throughput.

Once you accept that there are multiple modes, then the actual number of
wait modes is unimportant. It's just an array of [NUM_WAIT_MODES], so
the project need not be delayed just because we have 2, 3 or 4 wait
modes.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

02 September 2010, 07:24:16

On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> That requirement falls out from the handling of disconnected standbys. If a
> standby is not connected, what does the master do with commits? If the
> answer is anything else than acknowledge them to the client immediately, as
> if the standby never existed, the master needs to know what standby servers
> exist. Otherwise it can't know if all the standbys are connected or not.

Thanks. I understood why the registration is required.

> I'd like to keep this as simple as possible, yet flexible so that with
> enough scripting and extensions, you can get all sorts of behavior. I think
> quorum commit falls into the "extension" category; if you're setup is
> complex enough, it's going to be impossible to represent that in our config
> files no matter what. But if you write a little proxy, you can implement
> arbitrary rules there.

Agreed.

> I think recv/fsync/replay should be specified in the standby. It has no
> direct effect on the master, the master would just relay the setting to the
> standby when it connects, or the standby would send multiple XLogRecPtrs and
> let the master decide when the WAL is persistent enough.

The latter seems wasteful since the master uses only one XLogRecPtr even if
the standby sends multiple ones. So I prefer the former design. Which also
makes the code and design very simple, and we can easily write the proxy.

> "sync vs async" on the other hand should be specified in the master, because
> it has a direct impact on the behavior of commits in the master.
>
> I propose a configuration file standbys.conf, in the master:
>
> # STANDBY NAME    SYNCHRONOUS   TIMEOUT
> importantreplica  yes           100ms
> tempcopy          no            10s

Seems good. In fact, instead of yes/no, async/recv/fsync/replay is specified
in SYNCHRONOUS field?

OTOH, something like standby_name parameter should be introduced in
recovery.conf.

We should allow multiple standbys with the same name? Probably yes.
We might need to add NUMBER field into the standbys.conf, in the future.

> Yeah, though of course you might want to set that per-standby too..

Yep.

> Let's step back a bit and ask what would be the simplest thing that you
> could call "synchronous replication" in good conscience, and also be useful
> at least to some people. Let's leave out the "down" mode, because that
> requires registration. We'll probably have to do registration at some point,
> but let's take as small steps as possible.

Agreed.

> Without the "down" mode in the master, frankly I don't see the point of the
> "recv" and "fsync" levels in the standby. Either way, when the master
> acknowledges a commit to the client, you don't know if it has made it to the
> standby yet because the replication connection might be down for some
> reason.

True. We cannot know whether the standby can be brought up to the master
without any data loss when the master crashes, because the standby might
be disconnected before for some reasons and not have some latest data.

But the situation would be the same even when 'replay' mode is chosen.
Though we might be able to check whether the latest transaction has
replicated to the standby by running read only query to the standby,
it's actually difficult to do that. How can we know the content of the
latest transaction?

Also even when 'recv' or 'fsync' is chosen, we might be able to check
that by doing pg_last_xlog_receive_location() on the standby. But the
similar question occurs to me: How can we know the LSN of the latest
transaction?

I'm thinking to introduce new parameter specifying the command which
is executed when the standby is disconnected. This command is executed
by walsender before resuming the transaction processings which have
been suspended by the disconnection. For example, if STONISH against
the standby is supplied as the command, we can prevent the standby not
having the latest data from becoming the master by forcibly shutting
such a delayed standby down. Thought?

> That leaves us the 'replay' mode, which *is* useful, because it gives you
> the guarantee that when the master acknowledges a commit, it will appear
> committed in all hot standby servers that are currently connected. With that
> guarantee you can build a reliable cluster with something pgpool-II where
> all writes go to one node, and reads are distributed to multiple nodes.

I'm concerned that the conflict by read-only query and recovery might
harm the performance on the master in 'replay' mode. If the conflict
occurs, all running transactions on the master have to wait for it to
disappear, and which can take very long. Of course, wihtout the conflict,
waiting until the standby has received, fsync'd, read and replayed WAL
would take long. So I'd like to support also 'recv' and 'fsync'.
I believe that it's not complicated and difficult to implement those
two modes.

> I'm not sure what we should aim for in the first phase. But if you want as
> little code as possible yet have something useful, I think 'replay' mode
> with no standby registration is the way to go.

What about recv/fsync/replay mode with no standby registration?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

02 September 2010, 09:04:04

On Thu, 2010-09-02 at 19:24 +0900, Fujii Masao wrote:
> On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > That requirement falls out from the handling of disconnected standbys. If a
> > standby is not connected, what does the master do with commits? If the
> > answer is anything else than acknowledge them to the client immediately, as
> > if the standby never existed, the master needs to know what standby servers
> > exist. Otherwise it can't know if all the standbys are connected or not.
> 
> Thanks. I understood why the registration is required.

I don't. There is a simpler design that does not require registration.

Please explain why we need registration, with an explanation that does
not presume it as a requirement.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

02 September 2010, 09:15:33

On 02/09/10 15:03, Simon Riggs wrote:
> On Thu, 2010-09-02 at 19:24 +0900, Fujii Masao wrote:
>> On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>> That requirement falls out from the handling of disconnected standbys. If a
>>> standby is not connected, what does the master do with commits? If the
>>> answer is anything else than acknowledge them to the client immediately, as
>>> if the standby never existed, the master needs to know what standby servers
>>> exist. Otherwise it can't know if all the standbys are connected or not.
>>
>> Thanks. I understood why the registration is required.
>
> I don't. There is a simpler design that does not require registration.
>
> Please explain why we need registration, with an explanation that does
> not presume it as a requirement.

Please explain how you would implement "don't acknowledge commits until 
they're replicated to all standbys" without standby registration.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

02 September 2010, 09:44:51

On Thu, 2010-09-02 at 15:15 +0300, Heikki Linnakangas wrote:
> On 02/09/10 15:03, Simon Riggs wrote:
> > On Thu, 2010-09-02 at 19:24 +0900, Fujii Masao wrote:
> >> On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
> >> <heikki.linnakangas@enterprisedb.com>  wrote:
> >>> That requirement falls out from the handling of disconnected standbys. If a
> >>> standby is not connected, what does the master do with commits? If the
> >>> answer is anything else than acknowledge them to the client immediately, as
> >>> if the standby never existed, the master needs to know what standby servers
> >>> exist. Otherwise it can't know if all the standbys are connected or not.
> >>
> >> Thanks. I understood why the registration is required.
> >
> > I don't. There is a simpler design that does not require registration.
> >
> > Please explain why we need registration, with an explanation that does
> > not presume it as a requirement.
> 
> Please explain how you would implement "don't acknowledge commits until 
> they're replicated to all standbys" without standby registration.

"All standbys" has no meaning without registration. It is not a question
that needs an answer.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

02 September 2010, 09:59:40

On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> "All standbys" has no meaning without registration. It is not a question
> that needs an answer.

Tell that to the DBA.  I bet s/he knows what "all standbys" means.
The fact that the system doesn't know something doesn't make it
unimportant.

I agree that we don't absolutely need standby registration for some
really basic version of synchronous replication.  But I think we'd be
better off biting the bullet and adding it.  I think that without it
we're going to resort to a series of increasingly grotty and
user-unfriendly hacks to make this work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Dimitri Fontaine

Date:

02 September 2010, 10:51:26

Robert Haas <robertmhaas@gmail.com> writes:
> Tell that to the DBA.  I bet s/he knows what "all standbys" means.
> The fact that the system doesn't know something doesn't make it
> unimportant.

Well as a DBA I think I'd much prefer to attribute "votes" to each
standby so that each ack is weighted. Let me explain in more details the
setup I'm thinking about.

The transaction on the master wants a certain "service level" (async,
recv, fsync, replay) and a certain number of votes. As proposed earlier,
the standby would feedback the last XID known locally in each state
(received, synced, replayed) and its current weight, and the master
would arbitrate given those information.

That's highly flexible, you can have slaves join the party at any point
in time, and change 2 user GUC (set by session, transaction, function,
database, role, in postgresql.conf) to setup the service level target
you want to ensure, from the master.
 (We could go as far as wanting fsync:2,replay:1 as a service level.)

From that you have either the "fail when slave disappear" and the
"please don't shut the service down if a slave disappear" settings, per
transaction, and per slave too (that depends on its weight, remember).
 (You can setup the slave weights as powers of 2 and have the service  level be masks to allow you to choose precisely
whichslave will ack  your fsync service level, and you can switch this slave at run time  easily — sounds cleverer, but
soundsalso easier to implement given  the flexibility it gives — precedents in PostgreSQL? the PITR and WAL  Shipping
facilitiesare hard to use, full of traps, but very  flexible). 

You can even give some more weight to one slave while you're maintaining
another so that the master just don't complain.

I see a need for very dynamic *and decentralized* replication topology
setup, I fail to see a need for a centralized registration based setup.

> I agree that we don't absolutely need standby registration for some
> really basic version of synchronous replication.  But I think we'd be
> better off biting the bullet and adding it.

What does that mechanism allow us to implement we can't do without?
--
dim

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

02 September 2010, 11:06:59

On Thu, 2010-09-02 at 08:59 -0400, Robert Haas wrote:
> On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > "All standbys" has no meaning without registration. It is not a question
> > that needs an answer.
> 
> Tell that to the DBA.  I bet s/he knows what "all standbys" means.
> The fact that the system doesn't know something doesn't make it
> unimportant.

> I agree that we don't absolutely need standby registration for some
> really basic version of synchronous replication.  But I think we'd be
> better off biting the bullet and adding it.  I think that without it
> we're going to resort to a series of increasingly grotty and
> user-unfriendly hacks to make this work.

I'm personally quite happy to have server registration.

My interest is in ensuring we have master-controlled robustness, which
is so far being ignored because "we need simple". Refrring to above, we
are clearly quite willing to go beyond the most basic implementation, so
there's no further argument to exclude it for that reason.

The implementation of master-controlled robustness is no more difficult
than the alternative.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

02 September 2010, 11:17:52

On Thu, Sep 2, 2010 at 10:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 2010-09-02 at 08:59 -0400, Robert Haas wrote:
>> On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > "All standbys" has no meaning without registration. It is not a question
>> > that needs an answer.
>>
>> Tell that to the DBA.  I bet s/he knows what "all standbys" means.
>> The fact that the system doesn't know something doesn't make it
>> unimportant.
>
>> I agree that we don't absolutely need standby registration for some
>> really basic version of synchronous replication.  But I think we'd be
>> better off biting the bullet and adding it.  I think that without it
>> we're going to resort to a series of increasingly grotty and
>> user-unfriendly hacks to make this work.
>
> I'm personally quite happy to have server registration.

OK, thanks for clarifying.

> My interest is in ensuring we have master-controlled robustness, which
> is so far being ignored because "we need simple". Refrring to above, we
> are clearly quite willing to go beyond the most basic implementation, so
> there's no further argument to exclude it for that reason.
>
> The implementation of master-controlled robustness is no more difficult
> than the alternative.

But I'm not sure I quite follow this part.  I don't think I know what
you mean by "master-controlled robustness".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

02 September 2010, 11:33:00

On 02/09/10 17:06, Simon Riggs wrote:
> On Thu, 2010-09-02 at 08:59 -0400, Robert Haas wrote:
>> On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>> "All standbys" has no meaning without registration. It is not a question
>>> that needs an answer.
>>
>> Tell that to the DBA.  I bet s/he knows what "all standbys" means.
>> The fact that the system doesn't know something doesn't make it
>> unimportant.
>
>> I agree that we don't absolutely need standby registration for some
>> really basic version of synchronous replication.  But I think we'd be
>> better off biting the bullet and adding it.  I think that without it
>> we're going to resort to a series of increasingly grotty and
>> user-unfriendly hacks to make this work.
>
> I'm personally quite happy to have server registration.
>
> My interest is in ensuring we have master-controlled robustness, which
> is so far being ignored because "we need simple". Refrring to above, we
> are clearly quite willing to go beyond the most basic implementation, so
> there's no further argument to exclude it for that reason.
>
> The implementation of master-controlled robustness is no more difficult
> than the alternative.

I understand what you're after, the idea of being able to set 
synchronization level on a per-transaction basis is cool. But I haven't 
seen a satisfactory design for it. I don't understand how it would work 
in practice. Even though it's cool, having different kinds of standbys 
connected is a more common scenario, and the design needs to accommodate 
that too. I'm all ears if you can sketch a design that can do that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Joshua Tolley

Date:

02 September 2010, 14:18:00

On Wed, Sep 01, 2010 at 04:53:38PM +0900, Fujii Masao wrote:
> - down
>   When that situation occurs, the master shuts down immediately.
>   Though this is unsafe for the system requiring high availability,
>   as far as I recall, some people wanted this mode in the previous
>   discussion.

Oracle provides this, among other possible configurations; perhaps that's why
it came up earlier.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

03 September 2010, 00:51:00

On Thu, Sep 2, 2010 at 11:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I understand what you're after, the idea of being able to set
> synchronization level on a per-transaction basis is cool. But I haven't seen
> a satisfactory design for it. I don't understand how it would work in
> practice. Even though it's cool, having different kinds of standbys
> connected is a more common scenario, and the design needs to accommodate
> that too. I'm all ears if you can sketch a design that can do that.

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

03 September 2010, 03:36:50

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:
> On Thu, Sep 2, 2010 at 11:32 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > I understand what you're after, the idea of being able to set
> > synchronization level on a per-transaction basis is cool. But I haven't seen
> > a satisfactory design for it. I don't understand how it would work in
> > practice. Even though it's cool, having different kinds of standbys
> > connected is a more common scenario, and the design needs to accommodate
> > that too. I'm all ears if you can sketch a design that can do that.
> 
> That design would affect what the standby should reply. If we choose
> async/recv/fsync/replay on a per-transaction basis, the standby
> should send multiple LSNs and the master needs to decide when
> replication has been completed. OTOH, if we choose just sync/async,
> the standby has only to send one LSN.
> 
> The former seems to be more useful, but triples the number of ACK
> from the standby. I'm not sure whether its overhead is ignorable,
> especially when the distance between the master and the standby is
> very long.

No, it doesn't. There is no requirement for additional messages. It just
adds 16 bytes onto the reply message, maybe 24. If there is a noticeable
overhead from that, shoot me. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

03 September 2010, 03:43:00

On Thu, Sep 2, 2010 at 7:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I propose a configuration file standbys.conf, in the master:
>>
>> # STANDBY NAME    SYNCHRONOUS   TIMEOUT
>> importantreplica  yes           100ms
>> tempcopy          no            10s
>
> Seems good. In fact, instead of yes/no, async/recv/fsync/replay is specified
> in SYNCHRONOUS field?
>
> OTOH, something like standby_name parameter should be introduced in
> recovery.conf.
>
> We should allow multiple standbys with the same name? Probably yes.
> We might need to add NUMBER field into the standbys.conf, in the future.

Here is the proposed detailed design:

standbys.conf
=============
# This is not initialized by initdb, so users need to create it under $PGDATA.   * The template is located in the
PREFIX/sharedirectory. 

# This is read by postmaster at the startup as well as pg_hba.conf is.   * In EXEC_BACKEND environement, each walsender
mustread it at the startup.   * This is ignored when max_wal_senders is zero.   * FATAL is emitted when standbys.conf
doesn'texist even if max_wal_senders     is positive. 

# SIGHUP makes only postmaser re-read the standbys.conf.   * New configuration doesn't affect the existing connections
tothe standbys,     i.e., it's used only for subsequent connections.   * XXX: Should the existing connections react to
newconfiguration? What if     new standbys.conf doesn't have the standby_name of the existing 
connection?

# The connection from the standby is rejected if its standby_name is not listed in standbys.conf.   * Multiple standbys
withthe same name are allowed. 

# The valid values of SYNCHRONOUS field are async, recv, fsync and replay.

standby_name
============
# This is new string-typed parameter in recovery.conf.   * XXX: Should standby_name and standby_mode be merged?

# Walreceiver sends this to the master when establishing the connection.

Comments? Is the above too complicated for the first step? If so, I'd
propose to just introduce new recovery.conf parameter like replication_mode
specifying the synchronization level, instead.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

03 September 2010, 03:55:56

On 03/09/10 09:36, Simon Riggs wrote:
> On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:
>> That design would affect what the standby should reply. If we choose
>> async/recv/fsync/replay on a per-transaction basis, the standby
>> should send multiple LSNs and the master needs to decide when
>> replication has been completed. OTOH, if we choose just sync/async,
>> the standby has only to send one LSN.
>>
>> The former seems to be more useful, but triples the number of ACK
>> from the standby. I'm not sure whether its overhead is ignorable,
>> especially when the distance between the master and the standby is
>> very long.
>
> No, it doesn't. There is no requirement for additional messages.

Please explain how you do it then. When a commit record is sent to the 
standby, it needs to acknowledge it 1) when it has received it, 2) when 
it fsyncs it to disk and c) when it's replayed. I don't see how you can 
get around that.

Perhaps you can save a bit by combining multiple messages together, like 
in Nagle's algorithm, but then you introduce extra delays which is 
exactly what you don't want.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

03 September 2010, 04:08:31

On Fri, Sep 3, 2010 at 3:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The former seems to be more useful, but triples the number of ACK
>> from the standby. I'm not sure whether its overhead is ignorable,
>> especially when the distance between the master and the standby is
>> very long.
>
> No, it doesn't. There is no requirement for additional messages. It just
> adds 16 bytes onto the reply message, maybe 24. If there is a noticeable
> overhead from that, shoot me.

The reply message would be sent at least three times every WAL chunk,
i.e., when the standby has received, synced and replayed it. So ISTM
that additional messagings happen. Though I'm not sure if this really
harms the performance...

You'd like to choose async/recv/fsync/replay on a per-transaction basis
rather than async/sync?

Even when async is chosen as the synchronization level in standbys.conf,
it can be changed to other level in transaction? If so, the standby has
to send the reply even if async is chosen and most replies might be
ignored in the master.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

03 September 2010, 04:46:14

On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:
> On 03/09/10 09:36, Simon Riggs wrote:
> > On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:
> >> That design would affect what the standby should reply. If we choose
> >> async/recv/fsync/replay on a per-transaction basis, the standby
> >> should send multiple LSNs and the master needs to decide when
> >> replication has been completed. OTOH, if we choose just sync/async,
> >> the standby has only to send one LSN.
> >>
> >> The former seems to be more useful, but triples the number of ACK
> >> from the standby. I'm not sure whether its overhead is ignorable,
> >> especially when the distance between the master and the standby is
> >> very long.
> >
> > No, it doesn't. There is no requirement for additional messages.
> 
> Please explain how you do it then. When a commit record is sent to the 
> standby, it needs to acknowledge it 1) when it has received it, 2) when 
> it fsyncs it to disk and c) when it's replayed. I don't see how you can 
> get around that.
> 
> Perhaps you can save a bit by combining multiple messages together, like 
> in Nagle's algorithm, but then you introduce extra delays which is 
> exactly what you don't want.

>From my perspective, you seem to be struggling to find reasons why this
should not happen, rather than seeing the alternatives that would
obviously present themselves if your attitude was a positive one. We
won't make any progress with this style of discussion.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

03 September 2010, 06:33:59

On 03/09/10 10:45, Simon Riggs wrote:
> On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:
>> On 03/09/10 09:36, Simon Riggs wrote:
>>> On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:
>>>> That design would affect what the standby should reply. If we choose
>>>> async/recv/fsync/replay on a per-transaction basis, the standby
>>>> should send multiple LSNs and the master needs to decide when
>>>> replication has been completed. OTOH, if we choose just sync/async,
>>>> the standby has only to send one LSN.
>>>>
>>>> The former seems to be more useful, but triples the number of ACK
>>>> from the standby. I'm not sure whether its overhead is ignorable,
>>>> especially when the distance between the master and the standby is
>>>> very long.
>>>
>>> No, it doesn't. There is no requirement for additional messages.
>>
>> Please explain how you do it then. When a commit record is sent to the
>> standby, it needs to acknowledge it 1) when it has received it, 2) when
>> it fsyncs it to disk and c) when it's replayed. I don't see how you can
>> get around that.
>>
>> Perhaps you can save a bit by combining multiple messages together, like
>> in Nagle's algorithm, but then you introduce extra delays which is
>> exactly what you don't want.
>
>> From my perspective, you seem to be struggling to find reasons why this
> should not happen, rather than seeing the alternatives that would
> obviously present themselves if your attitude was a positive one. We
> won't make any progress with this style of discussion.

Huh? You made a very clear claim above that you don't need additional 
messages. I explained why I don't think that's true, and asked you to 
explain why you think it is true. Whether the claim is true or not does 
not depend on my attitude.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

03 September 2010, 07:20:33

On Fri, 2010-09-03 at 12:33 +0300, Heikki Linnakangas wrote:
> On 03/09/10 10:45, Simon Riggs wrote:
> > On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:
> >> On 03/09/10 09:36, Simon Riggs wrote:
> >>> On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:
> >>>> That design would affect what the standby should reply. If we choose
> >>>> async/recv/fsync/replay on a per-transaction basis, the standby
> >>>> should send multiple LSNs and the master needs to decide when
> >>>> replication has been completed. OTOH, if we choose just sync/async,
> >>>> the standby has only to send one LSN.
> >>>>
> >>>> The former seems to be more useful, but triples the number of ACK
> >>>> from the standby. I'm not sure whether its overhead is ignorable,
> >>>> especially when the distance between the master and the standby is
> >>>> very long.
> >>>
> >>> No, it doesn't. There is no requirement for additional messages.
> >>
> >> Please explain how you do it then. When a commit record is sent to the
> >> standby, it needs to acknowledge it 1) when it has received it, 2) when
> >> it fsyncs it to disk and c) when it's replayed. I don't see how you can
> >> get around that.
> >>
> >> Perhaps you can save a bit by combining multiple messages together, like
> >> in Nagle's algorithm, but then you introduce extra delays which is
> >> exactly what you don't want.
> >
> >> From my perspective, you seem to be struggling to find reasons why this
> > should not happen, rather than seeing the alternatives that would
> > obviously present themselves if your attitude was a positive one. We
> > won't make any progress with this style of discussion.
> 
> Huh? You made a very clear claim above that you don't need additional 
> messages. I explained why I don't think that's true, and asked you to 
> explain why you think it is true. Whether the claim is true or not does 
> not depend on my attitude.

Why exactly would we need to send 3 messages when we could send 1? 
Replace your statements of "it needs to" with "why would it" instead.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

03 September 2010, 07:31:22

On 03/09/10 13:20, Simon Riggs wrote:
> On Fri, 2010-09-03 at 12:33 +0300, Heikki Linnakangas wrote:
>> On 03/09/10 10:45, Simon Riggs wrote:
>>> On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:
>>>> On 03/09/10 09:36, Simon Riggs wrote:
>>>>> On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:
>>>>>> That design would affect what the standby should reply. If we choose
>>>>>> async/recv/fsync/replay on a per-transaction basis, the standby
>>>>>> should send multiple LSNs and the master needs to decide when
>>>>>> replication has been completed. OTOH, if we choose just sync/async,
>>>>>> the standby has only to send one LSN.
>>>>>>
>>>>>> The former seems to be more useful, but triples the number of ACK
>>>>>> from the standby. I'm not sure whether its overhead is ignorable,
>>>>>> especially when the distance between the master and the standby is
>>>>>> very long.
>>>>>
>>>>> No, it doesn't. There is no requirement for additional messages.
>>>>
>>>> Please explain how you do it then. When a commit record is sent to the
>>>> standby, it needs to acknowledge it 1) when it has received it, 2) when
>>>> it fsyncs it to disk and c) when it's replayed. I don't see how you can
>>>> get around that.
>>>>
>>>> Perhaps you can save a bit by combining multiple messages together, like
>>>> in Nagle's algorithm, but then you introduce extra delays which is
>>>> exactly what you don't want.
>>>
>>>>  From my perspective, you seem to be struggling to find reasons why this
>>> should not happen, rather than seeing the alternatives that would
>>> obviously present themselves if your attitude was a positive one. We
>>> won't make any progress with this style of discussion.
>>
>> Huh? You made a very clear claim above that you don't need additional
>> messages. I explained why I don't think that's true, and asked you to
>> explain why you think it is true. Whether the claim is true or not does
>> not depend on my attitude.
>
> Why exactly would we need to send 3 messages when we could send 1?
> Replace your statements of "it needs to" with "why would it" instead.

(scratches head..) What's the point of differentiating 
received/fsynced/replayed, if the master receives the ack for all of 
them at the same time?

Let's try this with an example: In the master, I do stuff and commit a 
transaction. I want to know when the transaction is fsynced in the 
standby. The WAL is sent to the standby, up to the commit record.

Upthread you said that:
> The standby does *not* need> to know the wishes of transactions on the master.

So, when does standby send the single message back to the master?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Dimitri Fontaine

Date:

06 September 2010, 10:03:59

Disclaimer : I have understood things in a way that allows me to answer
here, I don't know at all if that's the way it's meant to be understood.

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> (scratches head..) What's the point of differentiating
> received/fsynced/replayed, if the master receives the ack for all of them at
> the same time?

It wouldn't the way I understand Simon's proposal.

What's happening is that the feedback channel is periodically sending an
array of 3 LSN, the currently last received, fsync()ed and applied ones.

Now what you're saying is that we should feed back this information
after each recovery step forward, what Simon is saying is that we could
have a looser coupling between the slave activity and the feedback
channel to the master.

That means the master will not see all the slave's restoring activity,
but as the LSN are a monotonic sequence that's not a problem, we can use
<= rather than = in the wait-and-wakeup loop on the master.

> Let's try this with an example: In the master, I do stuff and commit a
> transaction. I want to know when the transaction is fsynced in the
> standby. The WAL is sent to the standby, up to the commit record.
[...]
> So, when does standby send the single message back to the master?

The standby is sending a stream of messages to the master with current
LSN positions at the time the message is sent. Given a synchronous
transaction, the master would wait until the feedback stream reports
that the current transaction is in the past compared to the streamed
last known synced one (or the same).

Hope this helps, regards,
--
dim

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

06 September 2010, 10:15:10

On 06/09/10 16:03, Dimitri Fontaine wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> (scratches head..) What's the point of differentiating
>> received/fsynced/replayed, if the master receives the ack for all of them at
>> the same time?
>
> It wouldn't the way I understand Simon's proposal.
>
> What's happening is that the feedback channel is periodically sending an
> array of 3 LSN, the currently last received, fsync()ed and applied ones.

"Periodically" is a performance problem. The bottleneck in synchronous 
replication is typically the extra round-trip between master and 
standby, as the master needs to wait for the acknowledgment. Any delays 
in sending that acknowledgment lead directly to a decrease in 
performance. That's also why we need to eliminate the polling loops in 
walsender and walreceiver, and make them react immediately when there's 
work to do.

>> Let's try this with an example: In the master, I do stuff and commit a
>> transaction. I want to know when the transaction is fsynced in the
>> standby. The WAL is sent to the standby, up to the commit record.
> [...]
>> So, when does standby send the single message back to the master?
>
> The standby is sending a stream of messages to the master with current
> LSN positions at the time the message is sent. Given a synchronous
> transaction, the master would wait until the feedback stream reports
> that the current transaction is in the past compared to the streamed
> last known synced one (or the same).

That doesn't really answer the question: *when* does standby send back 
the acknowledgment?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

06 September 2010, 11:28:18

On Mon, 2010-09-06 at 16:14 +0300, Heikki Linnakangas wrote:
> >
> > The standby is sending a stream of messages to the master with current
> > LSN positions at the time the message is sent. Given a synchronous
> > transaction, the master would wait until the feedback stream reports
> > that the current transaction is in the past compared to the streamed
> > last known synced one (or the same).
> 
> That doesn't really answer the question: *when* does standby send back 
> the acknowledgment?

I think you should explain when you think this happens in your proposal.

Are you saying that you think the standby should send back one message
for every transaction? That you do not think we should buffer the return
messages?

You seem to be proposing a design for responsiveness to a single
transaction, not for overall throughput. That's certainly a design
choice, but it wouldn't be my recommendation that we did that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

06 September 2010, 20:48:32

On Mon, Sep 6, 2010 at 10:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> That doesn't really answer the question: *when* does standby send back
>> the acknowledgment?
>
> I think you should explain when you think this happens in your proposal.
>
> Are you saying that you think the standby should send back one message
> for every transaction? That you do not think we should buffer the return
> messages?

That's certainly what I was assuming - I can't speak for anyone else, of course.

> You seem to be proposing a design for responsiveness to a single
> transaction, not for overall throughput. That's certainly a design
> choice, but it wouldn't be my recommendation that we did that.

Gee, I thought that if we tried to buffer the messages, you'd end up
*reducing* overall throughput.  Suppose we have a busy system.  The
number of simultaneous transactions in flight is limited by
max_connections.  So it seems to me that if each transaction takes X%
longer to commit, then throughput will be reduced by X%.  And as
you've said, batching responses will make individual transactions less
responsive.  The corresponding advantage of batching the responses is
that you reduce consumption of network bandwidth, but I don't think
that's normally where the bottleneck will be.

Of course, you might be able to opportunistically combine messages, if
additional transactions become ready to acknowledge after the first
one has become ready but before the acknowledgement has actually been
sent.  But waiting to try to increase the batch size doesn't seem
right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

07 September 2010, 03:28:09

On 06/09/10 17:14, Simon Riggs wrote:
> On Mon, 2010-09-06 at 16:14 +0300, Heikki Linnakangas wrote:
>>>
>>> The standby is sending a stream of messages to the master with current
>>> LSN positions at the time the message is sent. Given a synchronous
>>> transaction, the master would wait until the feedback stream reports
>>> that the current transaction is in the past compared to the streamed
>>> last known synced one (or the same).
>>
>> That doesn't really answer the question: *when* does standby send back
>> the acknowledgment?
>
> I think you should explain when you think this happens in your proposal.
>
> Are you saying that you think the standby should send back one message
> for every transaction? That you do not think we should buffer the return
> messages?

For the sake of argument, yes that's what I was thinking. Now please 
explain how *you're* thinking it should work.

> You seem to be proposing a design for responsiveness to a single
> transaction, not for overall throughput. That's certainly a design
> choice, but it wouldn't be my recommendation that we did that.

Sure, if there's more traffic, you can combine things. For example, if 
one fsync in the standby flushes more than one commit record, you only 
need one acknowledgment for all of them.

But don't dodge the question!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

07 September 2010, 07:01:45

On Tue, 2010-09-07 at 09:27 +0300, Heikki Linnakangas wrote:
> On 06/09/10 17:14, Simon Riggs wrote:
> > On Mon, 2010-09-06 at 16:14 +0300, Heikki Linnakangas wrote:
> >>>
> >>> The standby is sending a stream of messages to the master with current
> >>> LSN positions at the time the message is sent. Given a synchronous
> >>> transaction, the master would wait until the feedback stream reports
> >>> that the current transaction is in the past compared to the streamed
> >>> last known synced one (or the same).
> >>
> >> That doesn't really answer the question: *when* does standby send back
> >> the acknowledgment?
> >
> > I think you should explain when you think this happens in your proposal.
> >
> > Are you saying that you think the standby should send back one message
> > for every transaction? That you do not think we should buffer the return
> > messages?
> 
> For the sake of argument, yes that's what I was thinking. Now please 
> explain how *you're* thinking it should work.

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do. 

> > You seem to be proposing a design for responsiveness to a single
> > transaction, not for overall throughput. That's certainly a design
> > choice, but it wouldn't be my recommendation that we did that.
> 
> Sure, if there's more traffic, you can combine things. For example, if 
> one fsync in the standby flushes more than one commit record, you only 
> need one acknowledgment for all of them.

> But don't dodge the question!

Given that I've previously outlined the size and contents of request
packets, their role and frequency I don't think I've dodged anything; in
fact, I've almost outlined the whole design for you. 

I am coding something to demonstrate the important aspects I've
espoused, just as you have done in the past when I didn't appreciate
and/or understand your ideas. That seems like the best way forwards
rather than wrangle through all the "that can't work" responses, which
actually takes longer.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Heikki Linnakangas

Date:

07 September 2010, 07:11:53

On 07/09/10 12:47, Simon Riggs wrote:
> The WAL is sent from master to standby in 8192 byte chunks, frequently
> including multiple commits. From standby, one reply per chunk. If we
> need to wait for apply while nothing else is received, we do.

Ok, thank you. The obvious performance problem is that even if you 
define a transaction to use synchronization level 'recv', and there's no 
other concurrent transactions running, you actually need to wait until 
it's applied. If you have only one client, there is no difference 
between the levels, you always get the same performance hit you get with 
'apply'. With more clients, you get some benefit, but there's still 
plenty of delays compared to the optimum.

Also remember that there can be a very big gap between when a record is 
fsync'd and when it's applied, if the recovery needs to wait for a hot 
standby transaction to finish.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

07 September 2010, 07:24:57

On Tue, 2010-09-07 at 13:11 +0300, Heikki Linnakangas wrote:
> The obvious performance problem 

Is not obvious at all, and you misunderstand again. This emphasises the
need for me to show code.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Tom Lane

Date:

07 September 2010, 11:47:46

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2010-09-07 at 09:27 +0300, Heikki Linnakangas wrote:
>> For the sake of argument, yes that's what I was thinking. Now please 
>> explain how *you're* thinking it should work.

> The WAL is sent from master to standby in 8192 byte chunks, frequently
> including multiple commits. From standby, one reply per chunk. If we
> need to wait for apply while nothing else is received, we do. 

That premise is completely false.  SR does not send WAL in page units.
If it did, it would have the same performance problems as the old
WAL-file-at-a-time implementation, just with slightly smaller
granularity.
        regards, tom lane

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

07 September 2010, 12:24:05

On Tue, 2010-09-07 at 10:47 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2010-09-07 at 09:27 +0300, Heikki Linnakangas wrote:
> >> For the sake of argument, yes that's what I was thinking. Now please 
> >> explain how *you're* thinking it should work.
> 
> > The WAL is sent from master to standby in 8192 byte chunks, frequently
> > including multiple commits. From standby, one reply per chunk. If we
> > need to wait for apply while nothing else is received, we do. 
> 
> That premise is completely false.  SR does not send WAL in page units.
> If it did, it would have the same performance problems as the old
> WAL-file-at-a-time implementation, just with slightly smaller
> granularity.

There's no dependence on pages in that proposal, so don't understand.

What aspect of the above would you change? and to what?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Tom Lane

Date:

07 September 2010, 12:51:02

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2010-09-07 at 10:47 -0400, Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> The WAL is sent from master to standby in 8192 byte chunks, frequently
>>> including multiple commits. From standby, one reply per chunk. If we
>>> need to wait for apply while nothing else is received, we do. 
>> 
>> That premise is completely false.  SR does not send WAL in page units.
>> If it did, it would have the same performance problems as the old
>> WAL-file-at-a-time implementation, just with slightly smaller
>> granularity.

> There's no dependence on pages in that proposal, so don't understand.

Oh, well you certainly didn't explain it well then.

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk.  That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance.  You still have the problem that the master
has to fsync its WAL before it can send it to the slave.  Also, the
slave won't know whether it ought to fsync its own WAL before replying.
        regards, tom lane

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

07 September 2010, 13:04:47

On Tue, Sep 7, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh, well you certainly didn't explain it well then.
>
> What I *think* you're saying is that the slave doesn't send per-commit
> messages, but instead processes the WAL as it's received and then sends
> a heres-where-I-am status message back upstream immediately before going
> to sleep waiting for the next chunk.  That's fine as far as the protocol
> goes, but I'm not convinced that it really does all that much in terms
> of improving performance.  You still have the problem that the master
> has to fsync its WAL before it can send it to the slave.

We have that problem in all of these proposals, don't we?  We
certainly have no infrastructure to handle the slave getting ahead of
the master in the WAL stream.

> Also, the
> slave won't know whether it ought to fsync its own WAL before replying.

Right.  And whether it ought to replay it before replying.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

07 September 2010, 13:05:54

On Tue, 2010-09-07 at 11:41 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2010-09-07 at 10:47 -0400, Tom Lane wrote:
> >> Simon Riggs <simon@2ndQuadrant.com> writes:
> >>> The WAL is sent from master to standby in 8192 byte chunks, frequently
> >>> including multiple commits. From standby, one reply per chunk. If we
> >>> need to wait for apply while nothing else is received, we do. 
> >> 
> >> That premise is completely false.  SR does not send WAL in page units.
> >> If it did, it would have the same performance problems as the old
> >> WAL-file-at-a-time implementation, just with slightly smaller
> >> granularity.
> 
> > There's no dependence on pages in that proposal, so don't understand.
> 
> Oh, well you certainly didn't explain it well then.
> 
> What I *think* you're saying is that the slave doesn't send per-commit
> messages, but instead processes the WAL as it's received and then sends
> a heres-where-I-am status message back upstream immediately before going
> to sleep waiting for the next chunk.  That's fine as far as the protocol
> goes, but I'm not convinced that it really does all that much in terms
> of improving performance.  You still have the problem that the master
> has to fsync its WAL before it can send it to the slave.  Also, the
> slave won't know whether it ought to fsync its own WAL before replying.

Yes, apart from last sentence. Please wait for the code.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

07 September 2010, 13:09:26

On Tue, Sep 7, 2010 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What I *think* you're saying is that the slave doesn't send per-commit
>> messages, but instead processes the WAL as it's received and then sends
>> a heres-where-I-am status message back upstream immediately before going
>> to sleep waiting for the next chunk.  That's fine as far as the protocol
>> goes, but I'm not convinced that it really does all that much in terms
>> of improving performance.  You still have the problem that the master
>> has to fsync its WAL before it can send it to the slave.  Also, the
>> slave won't know whether it ought to fsync its own WAL before replying.
>
> Yes, apart from last sentence. Please wait for the code.

So, we're going around and around in circles here because you're
repeatedly refusing to explain how the slave will know WHEN to send
acknowledgments back to the master without knowing which sync rep
level is in use.  It seems to be perfectly evident to everyone else
here that there are only two ways for this to work: either the value
is configured on the standby, or there's a registration system on the
master and the master tells the standby its wishes.  Instead of asking
the entire community to wait for an unspecified period of time for you
to write code that will handle this in an unspecified way, how about
answering the question?  We've wasted far too much time arguing about
this already.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Simon Riggs

Date:

07 September 2010, 15:16:11

On Tue, 2010-09-07 at 12:07 -0400, Robert Haas wrote:
> On Tue, Sep 7, 2010 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> What I *think* you're saying is that the slave doesn't send per-commit
> >> messages, but instead processes the WAL as it's received and then sends
> >> a heres-where-I-am status message back upstream immediately before going
> >> to sleep waiting for the next chunk.  That's fine as far as the protocol
> >> goes, but I'm not convinced that it really does all that much in terms
> >> of improving performance.  You still have the problem that the master
> >> has to fsync its WAL before it can send it to the slave.  Also, the
> >> slave won't know whether it ought to fsync its own WAL before replying.
> >
> > Yes, apart from last sentence. Please wait for the code.
> 
> So, we're going around and around in circles here because you're
> repeatedly refusing to explain how the slave will know WHEN to send
> acknowledgments back to the master without knowing which sync rep
> level is in use.  It seems to be perfectly evident to everyone else
> here that there are only two ways for this to work: either the value
> is configured on the standby, or there's a registration system on the
> master and the master tells the standby its wishes.  Instead of asking
> the entire community to wait for an unspecified period of time for you
> to write code that will handle this in an unspecified way, how about
> answering the question?  We've wasted far too much time arguing about
> this already.

Every time I explain anything, I get someone run around shouting "but
that can't work!". I'm sorry, but again your logic is poor and the bias
against properly considering viable alternatives is the only thing
perfectly evident. So yes, I agree, it is a waste of time discussing it
until I show working code.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Synchronous replication - patch status inquiry

From

Robert Haas

Date:

07 September 2010, 15:28:22

On Tue, Sep 7, 2010 at 2:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Every time I explain anything, I get someone run around shouting "but
> that can't work!". I'm sorry, but again your logic is poor and the bias
> against properly considering viable alternatives is the only thing
> perfectly evident. So yes, I agree, it is a waste of time discussing it
> until I show working code.

Obviously you don't "agree", because that's the exact opposite of what
I just said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication - patch status inquiry

From

Bruce Momjian

Date:

07 September 2010, 20:01:10

Robert Haas wrote:
> On Tue, Sep 7, 2010 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> What I *think* you're saying is that the slave doesn't send per-commit
> >> messages, but instead processes the WAL as it's received and then sends
> >> a heres-where-I-am status message back upstream immediately before going
> >> to sleep waiting for the next chunk. ?That's fine as far as the protocol
> >> goes, but I'm not convinced that it really does all that much in terms
> >> of improving performance. ?You still have the problem that the master
> >> has to fsync its WAL before it can send it to the slave. ?Also, the
> >> slave won't know whether it ought to fsync its own WAL before replying.
> >
> > Yes, apart from last sentence. Please wait for the code.
> 
> So, we're going around and around in circles here because you're
> repeatedly refusing to explain how the slave will know WHEN to send
> acknowledgments back to the master without knowing which sync rep
> level is in use.  It seems to be perfectly evident to everyone else
> here that there are only two ways for this to work: either the value
> is configured on the standby, or there's a registration system on the
> master and the master tells the standby its wishes.  Instead of asking
> the entire community to wait for an unspecified period of time for you
> to write code that will handle this in an unspecified way, how about
> answering the question?  We've wasted far too much time arguing about
> this already.

Ideally I would like the sync method to be set on each slave, and have
some method for the master to query the sync mode of all the slaves, e.g.
appname.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

09 September 2010, 23:52:33

On Fri, Sep 3, 2010 at 3:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Here is the proposed detailed design:
>
> standbys.conf
> =============
> # This is not initialized by initdb, so users need to create it under $PGDATA.
>    * The template is located in the PREFIX/share directory.
>
> # This is read by postmaster at the startup as well as pg_hba.conf is.
>    * In EXEC_BACKEND environement, each walsender must read it at the startup.
>    * This is ignored when max_wal_senders is zero.
>    * FATAL is emitted when standbys.conf doesn't exist even if max_wal_senders
>      is positive.
>
> # SIGHUP makes only postmaser re-read the standbys.conf.
>    * New configuration doesn't affect the existing connections to the standbys,
>      i.e., it's used only for subsequent connections.
>    * XXX: Should the existing connections react to new configuration? What if
>      new standbys.conf doesn't have the standby_name of the existing
> connection?
>
> # The connection from the standby is rejected if its standby_name is not listed
>  in standbys.conf.
>    * Multiple standbys with the same name are allowed.
>
> # The valid values of SYNCHRONOUS field are async, recv, fsync and replay.
>
> standby_name
> ============
> # This is new string-typed parameter in recovery.conf.
>    * XXX: Should standby_name and standby_mode be merged?
>
> # Walreceiver sends this to the master when establishing the connection.

The attached patch implements the above and simple synchronous replication
feature, which doesn't include quorum commit capability. The replication
mode (async, recv, fsync, replay) can be specified on a per-standby basis,
in standbys.conf.

The patch still uses a poll loop in the backend, walsender, startup process
and walreceiver. If a latch feature Heikki proposed will have been committed,
I'll replace that with a latch.

The documentation has not fully updated yet. I'll work on the document until
the deadline of the next CF.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

synchrep_0910.patch

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

09 September 2010, 23:57:45

On Fri, Sep 10, 2010 at 11:52 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> The attached patch implements the above and simple synchronous replication
> feature, which doesn't include quorum commit capability. The replication
> mode (async, recv, fsync, replay) can be specified on a per-standby basis,
> in standbys.conf.
>
> The patch still uses a poll loop in the backend, walsender, startup process
> and walreceiver. If a latch feature Heikki proposed will have been committed,
> I'll replace that with a latch.
>
> The documentation has not fully updated yet. I'll work on the document until
> the deadline of the next CF.

BTW, the latest code is available in my git repository too:
  git://git.postgresql.org/git/users/fujii/postgres.git  branch: synchrep

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication - patch status inquiry

From

David Fetter

Date:

14 September 2010, 18:38:55

On Fri, Sep 10, 2010 at 11:52:20AM +0900, Fujii Masao wrote:
> On Fri, Sep 3, 2010 at 3:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > Here is the proposed detailed design:
> >
> > standbys.conf
> > =============
> > # This is not initialized by initdb, so users need to create it under $PGDATA.
> >    * The template is located in the PREFIX/share directory.
> >
> > # This is read by postmaster at the startup as well as pg_hba.conf is.
> >    * In EXEC_BACKEND environement, each walsender must read it at the startup.
> >    * This is ignored when max_wal_senders is zero.
> >    * FATAL is emitted when standbys.conf doesn't exist even if max_wal_senders
> >      is positive.
> >
> > # SIGHUP makes only postmaser re-read the standbys.conf.
> >    * New configuration doesn't affect the existing connections to the standbys,
> >      i.e., it's used only for subsequent connections.
> >    * XXX: Should the existing connections react to new configuration? What if
> >      new standbys.conf doesn't have the standby_name of the existing
> > connection?
> >
> > # The connection from the standby is rejected if its standby_name is not listed
> >  in standbys.conf.
> >    * Multiple standbys with the same name are allowed.
> >
> > # The valid values of SYNCHRONOUS field are async, recv, fsync and replay.
> >
> > standby_name
> > ============
> > # This is new string-typed parameter in recovery.conf.
> >    * XXX: Should standby_name and standby_mode be merged?
> >
> > # Walreceiver sends this to the master when establishing the connection.
> 
> The attached patch implements the above and simple synchronous replication
> feature, which doesn't include quorum commit capability. The replication
> mode (async, recv, fsync, replay) can be specified on a per-standby basis,
> in standbys.conf.
> 
> The patch still uses a poll loop in the backend, walsender, startup process
> and walreceiver. If a latch feature Heikki proposed will have been committed,
> I'll replace that with a latch.

Now that the latch patch is in, when do you think you'll be able to use it
instead of the poll loop?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

15 September 2010, 06:59:11

On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:
> Now that the latch patch is in, when do you think you'll be able to use it
> instead of the poll loop?

Here is the updated version, which uses a latch in communication from
walsender to backend. I've not changed the others. Because walsender
already uses it in HEAD, and Heikki already proposed the patch which
replaced the poll loop between walreceiver and startup process with
a latch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

synchrep_0915.patch

Re: Synchronous replication - patch status inquiry

From

Fujii Masao

Date:

15 September 2010, 09:39:44

On Wed, Sep 15, 2010 at 6:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:
>> Now that the latch patch is in, when do you think you'll be able to use it
>> instead of the poll loop?
>
> Here is the updated version, which uses a latch in communication from
> walsender to backend. I've not changed the others. Because walsender
> already uses it in HEAD, and Heikki already proposed the patch which
> replaced the poll loop between walreceiver and startup process with
> a latch.

I rebased the patch against current HEAD because it conflicted with
recent commits about a latch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

synchrep_0915-2.patch

Re: Synchronous replication - patch status inquiry

From

"Erik Rijkers"

Date:

15 September 2010, 20:12:56

On Wed, September 15, 2010 11:58, Fujii Masao wrote:
> On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:
>> Now that the latch patch is in, when do you think you'll be able to use it
>> instead of the poll loop?
>
> Here is the updated version, which uses a latch in communication from
> walsender to backend. I've not changed the others. Because walsender
> already uses it in HEAD, and Heikki already proposed the patch which
> replaced the poll loop between walreceiver and startup process with
> a latch.
>

( synchrep_0915-2.patch; patch applies cleanly;
compile, check and install are without problem)

How does one enable synchronous replication with this patch?
With previous versions I could do (in standby's recovery.conf):

replication_mode = 'recv'

but not anymore, apparently.

(sorry, I have probably overlooked part of the discussion;
-hackers is getting too high-volume for me... )

thanks,


Erik Rijkers

Re: Synchronous replication - patch status inquiry

From

"Erik Rijkers"

Date:

15 September 2010, 20:58:30

nevermind...  I see standbys.conf is now used.

sorry for the noise...


Erik Rijkers

On Thu, September 16, 2010 01:12, Erik Rijkers wrote:
> On Wed, September 15, 2010 11:58, Fujii Masao wrote:
>> On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:
>>> Now that the latch patch is in, when do you think you'll be able to use it
>>> instead of the poll loop?
>>
>> Here is the updated version, which uses a latch in communication from
>> walsender to backend. I've not changed the others. Because walsender
>> already uses it in HEAD, and Heikki already proposed the patch which
>> replaced the poll loop between walreceiver and startup process with
>> a latch.
>>
>
> ( synchrep_0915-2.patch; patch applies cleanly;
> compile, check and install are without problem)
>
> How does one enable synchronous replication with this patch?
> With previous versions I could do (in standby's recovery.conf):
>
> replication_mode = 'recv'
>
> but not anymore, apparently.
>
> (sorry, I have probably overlooked part of the discussion;
> -hackers is getting too high-volume for me... )
>
> thanks,
>
>
> Erik Rijkers
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>