Thread: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
> 
> On 03/06/2011 05:51 PM, Simon Riggs wrote:
> > Efficient transaction-controlled synchronous replication.
> >
> 
> I'm glad this is in, but I thought we agreed NOT to call it "synchronous 
> replication".

The discussion on the thread was that its not sync rep unless we have
the strictest guarantees. We have the strictest guarantees, so it
qualifies as sync rep. 

Relaxations are possible and, to some people, desirable.

Perhaps there is a more marketable term, and if so, we can rebrand. It
wouldn't be the first time things got renamed in beta.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

From
Heikki Linnakangas
Date:
On 07.03.2011 01:28, Simon Riggs wrote:
> On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
>>
>> On 03/06/2011 05:51 PM, Simon Riggs wrote:
>>> Efficient transaction-controlled synchronous replication.
>>
>> I'm glad this is in, but I thought we agreed NOT to call it "synchronous
>> replication".
>
> The discussion on the thread was that its not sync rep unless we have
> the strictest guarantees. We have the strictest guarantees, so it
> qualifies as sync rep.

What do you mean by "strictes guarantees"?

I don't see allow_synchronous_standby setting in the committed patch. I 
presume you didn't make allow_synchronous_standby=off the default 
behavior. Also, the documentation that describes this as two-safe 
replication and claims that "the only possibility that data can be lost 
is if both the primary and the standby suffer crashes at the same time" 
needs big fat caveats to clarify that this doesn't actually achieve 
those guarantees.

Please change the name.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Mon, 2011-03-07 at 09:29 +0200, Heikki Linnakangas wrote:
> I presume you didn't make allow_synchronous_standby=off the default 
> behavior.

You presume incorrectly.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

From
Heikki Linnakangas
Date:
On 07.03.2011 09:48, Simon Riggs wrote:
> On Mon, 2011-03-07 at 09:29 +0200, Heikki Linnakangas wrote:
>
>> I presume you didn't make allow_synchronous_standby=off the default
>> behavior.

Sorry, s/allow_synchronous_standby/allow_standalone_master

> You presume incorrectly.

Ok, ok then. Thank you! Looks like I need to git pull and get myself 
up-to-speed with these latest developments :-).

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



On 03/07/2011 02:29 AM, Heikki Linnakangas wrote:
> On 07.03.2011 01:28, Simon Riggs wrote:
>> On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
>>>
>>> On 03/06/2011 05:51 PM, Simon Riggs wrote:
>>>> Efficient transaction-controlled synchronous replication.
>>>
>>> I'm glad this is in, but I thought we agreed NOT to call it 
>>> "synchronous
>>> replication".
>>
>> The discussion on the thread was that its not sync rep unless we have
>> the strictest guarantees. We have the strictest guarantees, so it
>> qualifies as sync rep.
>
> What do you mean by "strictes guarantees"?
>
> I don't see allow_synchronous_standby setting in the committed patch. 
> I presume you didn't make allow_synchronous_standby=off the default 
> behavior. Also, the documentation that describes this as two-safe 
> replication and claims that "the only possibility that data can be 
> lost is if both the primary and the standby suffer crashes at the same 
> time" needs big fat caveats to clarify that this doesn't actually 
> achieve those guarantees.
>
> Please change the name.
>

Previously, Simon said:

> Truly "synchronous" requires two-phase commit, which this never was.

So I too am confused about how it's now become "truly synchronous". Are 
we saying this give the same or better guarantees than a 2PC setup?

cheers

andrew



Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

From
Heikki Linnakangas
Date:
On 07.03.2011 15:30, Andrew Dunstan wrote:
> Previously, Simon said:
>
>> Truly "synchronous" requires two-phase commit, which this never was.
>
> So I too am confused about how it's now become "truly synchronous". Are
> we saying this give the same or better guarantees than a 2PC setup?

The guarantee we have now with synchronous_replication=on is that when 
the server acknowledges a commit to the client (ie. when COMMIT command 
returns), the transaction is safely flushed to disk on the master and at 
least one synchronous standby server.

What you don't get is a guarantee on what happens to transactions that 
were not acknowledged to the client. For example, if you pull the power 
plug, the transaction that was just being committed might be committed 
on the master, but not yet on the standby.

For me, that's enough to call it "synchronous replication". It provides 
a useful guarantee to the client. But you could argue for an even 
stricter definition, requiring atomicity so that if a transaction is not 
successfully replicated for any reason, including crash, it is rolled 
back in the master too. That would require 2PC.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



On 03/07/2011 09:02 AM, Heikki Linnakangas wrote:
> On 07.03.2011 15:30, Andrew Dunstan wrote:
>> Previously, Simon said:
>>
>>> Truly "synchronous" requires two-phase commit, which this never was.
>>
>> So I too am confused about how it's now become "truly synchronous". Are
>> we saying this give the same or better guarantees than a 2PC setup?
>
> The guarantee we have now with synchronous_replication=on is that when 
> the server acknowledges a commit to the client (ie. when COMMIT 
> command returns), the transaction is safely flushed to disk on the 
> master and at least one synchronous standby server.
>
> What you don't get is a guarantee on what happens to transactions that 
> were not acknowledged to the client. For example, if you pull the 
> power plug, the transaction that was just being committed might be 
> committed on the master, but not yet on the standby.
>
> For me, that's enough to call it "synchronous replication". It 
> provides a useful guarantee to the client. But you could argue for an 
> even stricter definition, requiring atomicity so that if a transaction 
> is not successfully replicated for any reason, including crash, it is 
> rolled back in the master too. That would require 2PC.
>

My worry is that the stricter definition is what many people will 
expect, without reading the fine print.

cheers

andrew


On Mon, Mar 7, 2011 at 2:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

>> For me, that's enough to call it "synchronous replication". It provides a
>> useful guarantee to the client. But you could argue for an even stricter
>> definition, requiring atomicity so that if a transaction is not successfully
>> replicated for any reason, including crash, it is rolled back in the master
>> too. That would require 2PC.
>>
>
> My worry is that the stricter definition is what many people will expect,
> without reading the fine print.

They they are either already hosed or already using 2PC.

a.
--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> if you pull the power plug, the transaction that was just being
> committed might be committed on the master, but not yet on the
> standby.
> For me, that's enough to call it "synchronous replication". It
> provides useful guarantee to the client.
I don't think most people would expect full 2PC behavior from
something called "synchronous replication" -- I agree that a
guarantee that a successful commit means it has been written to the
master and at least one replica is sufficient.
> you could argue for an even stricter definition, requiring
> atomicity so that if a transaction is not successfully replicated
> for any reason, including crash, it is rolled back in the master
> too. That would require 2PC.
I'm not sure you can say it breaks atomicity; if proper procedures
are followed on recovery, all servers will either reflect the
transaction or not, right?  It seems to me what you lose is the
ability to know whether a transaction for which commit was requested
and for which there had not yet been a reply at the time of failure
is going to be in your recovered database.  In this particular
regard it is no different from a standalone or async replication,
and you would need 2PC with a proper transaction manager to do
better.
Getting that additional guarantee may not be worth the performance
hit for most people.  We train our users to save (or make) a paper
copy of whet they were entering if a crash occurs (which, of course,
is very rare, but does happen), so they can check the state of it on
recovery.  It is, of course, important for the programmers to use
appropriate database transaction boundaries so that the database is
always in a state with internal integrity and from which users can
determine the state and proceed on their own.
I think we should document the issues, of course.
If there is really a demand for a stricter "sync rep" feature, I
think it must be built on top of 2PC and some particular transaction
manager, which seems a though that makes it pgfoundry material.
-Kevin



On 03/07/2011 09:29 AM, Aidan Van Dyk wrote:
> On Mon, Mar 7, 2011 at 2:21 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>
>>> For me, that's enough to call it "synchronous replication". It provides a
>>> useful guarantee to the client. But you could argue for an even stricter
>>> definition, requiring atomicity so that if a transaction is not successfully
>>> replicated for any reason, including crash, it is rolled back in the master
>>> too. That would require 2PC.
>>>
>> My worry is that the stricter definition is what many people will expect,
>> without reading the fine print.
> They they are either already hosed or already using 2PC.
>
>


This is about expectations. The thing that worries me is that the use of 
this term might cause some people NOT to use 2PC because they think they 
are getting an equivalent guarantee, when in fact they are not. And 
that's hardly unreasonable. Here for example is what wikipedia says 
<http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:
   Synchronous replication - guarantees "zero data loss" by the means   of atomic write operation, i.e. write either
completeson both sides   or not at all. Write is not considered complete until   acknowledgement by both local and
remotestorage.
 


cheers

andrew


On Mon, Mar 7, 2011 at 2:29 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:

> They they are either already hosed or already using 2PC.

Sorry, to expand on my all too brief comment, even *without*
replication, they are hosed.

Once you issue commit, you have know knowledge if the commit is
durable, (or even posibly seen by somoene else even) until you get the
acknowledgement of the commit.

That's already a posibility with a single machine databse.  Adding
replication in it, just increases the perioud that window exists for
(and the possiblity of things making something "Bad" hit that window).

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Andrew Dunstan <andrew@dunslane.net> wrote:
>   Synchronous replication - guarantees "zero data loss" by the
>   means of atomic write operation, i.e. write either completes on
>   both sides or not at all.
So far, so good.
>   Write is not considered complete until acknowledgement by both
>   local and remote storage.
OK, *if* we want to live up to this definition, we don't seem to
have that part covered.  Of course, since the connection is broken
during the hypothetical crash, it seems hard to acknowledge it on
recovery, and short of 2PC I don't see how we roll it back.  About
the best we could do is somehow have explicit logging of the
disposition of unacknowledged commit requests upon recovery, and
consider logging of success to be "acknowledgement".  Is this
logging provided by other databases with "synchronous replication"
features?
-Kevin


Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

From
Heikki Linnakangas
Date:
On 07.03.2011 17:03, Andrew Dunstan wrote:
> This is about expectations. The thing that worries me is that the use of
> this term might cause some people NOT to use 2PC because they think they
> are getting an equivalent guarantee, when in fact they are not. And
> that's hardly unreasonable. Here for example is what wikipedia says
> <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:
>
> Synchronous replication - guarantees "zero data loss" by the means
> of atomic write operation, i.e. write either completes on both sides
> or not at all. Write is not considered complete until
> acknowledgement by both local and remote storage.

Hmm, I've read that wikipedia definition before, but the "atomic" part 
never caught my eye. You do get zero data loss with what we have; if a 
meteor strikes the master, no acknowledged transaction is lost. I find 
that definition a bit confusing.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



On 03/07/2011 10:46 AM, Heikki Linnakangas wrote:
> On 07.03.2011 17:03, Andrew Dunstan wrote:
>> This is about expectations. The thing that worries me is that the use of
>> this term might cause some people NOT to use 2PC because they think they
>> are getting an equivalent guarantee, when in fact they are not. And
>> that's hardly unreasonable. Here for example is what wikipedia says
>> <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:
>>
>> Synchronous replication - guarantees "zero data loss" by the means
>> of atomic write operation, i.e. write either completes on both sides
>> or not at all. Write is not considered complete until
>> acknowledgement by both local and remote storage.
>
> Hmm, I've read that wikipedia definition before, but the "atomic" part 
> never caught my eye. You do get zero data loss with what we have; if a 
> meteor strikes the master, no acknowledged transaction is lost. I find 
> that definition a bit confusing.

Maybe it is - I agree the difference might be small. I'm just trying to 
make sure we don't use a term that could mislead reasonable people about 
what we're providing. If we're satisfied that we aren't, then keep it.

cheers

andrew


Excerpts from Andrew Dunstan's message of lun mar 07 12:51:49 -0300 2011:
> 
> On 03/07/2011 10:46 AM, Heikki Linnakangas wrote:

> > Hmm, I've read that wikipedia definition before, but the "atomic" part 
> > never caught my eye. You do get zero data loss with what we have; if a 
> > meteor strikes the master, no acknowledged transaction is lost. I find 
> > that definition a bit confusing.
> 
> Maybe it is - I agree the difference might be small. I'm just trying to 
> make sure we don't use a term that could mislead reasonable people about 
> what we're providing. If we're satisfied that we aren't, then keep it.

I think these terms are used inconsistenly enough across the industry
that what would make the most sense would be to use the common term and
document accurately what we mean by it, rather than relying on some
external entity's definition, which could change (like wikipedia's).

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Hi,

sorry for being late to join that bike-shedding discussion.

On 03/07/2011 05:09 PM, Alvaro Herrera wrote:
> I think these terms are used inconsistenly enough across the industry
> that what would make the most sense would be to use the common term and
> document accurately what we mean by it, rather than relying on some
> external entity's definition, which could change (like wikipedia's).

I absolutely agree to Alvaro here.

The Wikipedia definition seems to only speak about one local and one
remote node.  Requiring an ack from "at least one" remote node seems to
cover that.

Not even Wikipedia goes further in their definition and tries to explain
what 'synchronous replication' could mean in case we have more than two
nodes.  A somewhat common expectation is, that all nodes would have to
ack.  However, with such a requirement a single node failure brings your
cluster to a full stop.  So this isn't a practical option.

Google invented the term "semi-syncronous" for something that's
essentially the same that we have, now, I think.  However, I full
heartedly hate that term (based on the reasoning that there's no
semi-pregnant, either).

Others (like me) use "synchronous" or (lately rather) "eager" to mean
that only a majority of nodes need to send an ACK.  I have to explain
what I mean every time.

In the end, I don't have a strong opinion either way, anymore.  I'm
happy to think of the replication between the master and the one standby
that's sending an ACK first as "synchronous".  (Even if those may well
be different standbies for different transactions).

Hope to have brought some light into this discussion.

Regards

Markus Wanner


On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner <markus@bluegap.ch> wrote:
> Google invented the term "semi-syncronous" for something that's
> essentially the same that we have, now, I think.  However, I full
> heartedly hate that term (based on the reasoning that there's no
> semi-pregnant, either).

We didn't invent the term, we just implemented something that Heikki
Tuuri briefly described, for example:
http://bugs.mysql.com/bug.php?id=7440

In the Google patch and official MySQL version, the sequence is:
1) commit on master
2) wait for slave to ack
3) return to user

After step 1 another user on the master can observe the commit and the
following is possible:
1) commit on master
2) other user observes that commit on master
3) master blows up and a user observed a commit that never made it to a slave

I do not think this sequence should be possible in a sync replication
system. But it is possible in what has been implemented for MySQL.
Thus it was named semi-sync rather than sync.

--
Mark Callaghan
mdcallag@gmail.com


On Fri, Mar 18, 2011 at 9:16 AM, MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner <markus@bluegap.ch> wrote:
>> Google invented the term "semi-syncronous" for something that's
>> essentially the same that we have, now, I think.  However, I full
>> heartedly hate that term (based on the reasoning that there's no
>> semi-pregnant, either).
>
> We didn't invent the term, we just implemented something that Heikki
> Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440
>
> In the Google patch and official MySQL version, the sequence is:
> 1) commit on master
> 2) wait for slave to ack
> 3) return to user
>
> After step 1 another user on the master can observe the commit and the
> following is possible:
> 1) commit on master
> 2) other user observes that commit on master
> 3) master blows up and a user observed a commit that never made it to a slave
>
> I do not think this sequence should be possible in a sync replication
> system. But it is possible in what has been implemented for MySQL.
> Thus it was named semi-sync rather than sync.

Thanks for the insight.  That can't happen with our implementation, I believe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> Markus Wanner <markus@bluegap.ch> wrote:
>> Google invented the term "semi-syncronous" for something that's
>> essentially the same that we have, now, I think.  However, I full
>> heartedly hate that term (based on the reasoning that there's no
>> semi-pregnant, either).
To be fair, what we're considering calling semi-synchronous is
something which tries to stay in synchronous mode but switches out
of it when necessary to meet availability targets.  Your analogy
doesn't match up at all well -- at least without getting really
ugly.
> We didn't invent the term, we just implemented something that
> Heikki Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440
> 
> In the Google patch and official MySQL version, the sequence is:
> 1) commit on master
> 2) wait for slave to ack
> 3) return to user
> 
> After step 1 another user on the master can observe the commit and
> the following is possible:
> 1) commit on master
> 2) other user observes that commit on master
> 3) master blows up and a user observed a commit that never made it
> to a slave
> 
> I do not think this sequence should be possible in a sync
> replication system.
Then the only thing you would consider sync replication, as far as I
can see, is two phase commit, which we already have.  So your use
case seems to be covered already, and we're trying to address other
people's needs.  The guarantee that some people are looking for is
that a successful commit means that the data has been persisted on
two separate servers.  Others want to try for that, but are willing
to compromise it for HA; in general I think they want to know when
the guarantee is not there so they can take action to get back to a
safer condition.
-Kevin


On Fri, 2011-03-18 at 13:16 +0000, MARK CALLAGHAN wrote:
> On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner <markus@bluegap.ch> wrote:
> > Google invented the term "semi-syncronous" for something that's
> > essentially the same that we have, now, I think.  However, I full
> > heartedly hate that term (based on the reasoning that there's no
> > semi-pregnant, either).
> 
> We didn't invent the term, we just implemented something that Heikki
> Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440
> 
> In the Google patch and official MySQL version, the sequence is:
> 1) commit on master
> 2) wait for slave to ack
> 3) return to user
> 
> After step 1 another user on the master can observe the commit and the
> following is possible:
> 1) commit on master
> 2) other user observes that commit on master
> 3) master blows up and a user observed a commit that never made it to a slave
> 
> I do not think this sequence should be possible in a sync replication
> system. But it is possible in what has been implemented for MySQL.
> Thus it was named semi-sync rather than sync.

Thanks for clearing it up Mark.

We should definitely not be calling what we have "semi-sync". The
semantics are very different.

In PostgreSQL other users cannot observe the commit until an
acknowledgement has been received.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Mark,

On 03/18/2011 02:16 PM, MARK CALLAGHAN wrote:
> We didn't invent the term, we just implemented something that Heikki
> Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440

Oh, okay, good to know who to blame ;-)  However, I didn't mean to
offend anybody.

> I do not think this sequence should be possible in a sync replication
> system. But it is possible in what has been implemented for MySQL.
> Thus it was named semi-sync rather than sync.

Sure?

Their documentation [1] isn't entirely clear on that first: "the master
blocks after the commit is done and waits until at least one
semisynchronous slave acknowledges that it has received all events for
the transaction" and the "slave acknowledges receipt of a transaction's
events only after the events have been written to its relay log and
flushed to disk".

But then continues to say that "[the master is] waiting for
acknowledgment from a slave after having performed a commit", so this
indeed sounds like the transaction is visible to other sessions before
the slave ACKs.

So, semi-sync may show temporary inconsistencies in case of a master
failure.  Wow!

Regards

Markus Wanner


[1] MySQL 5.5 reference manual, 17.3.8. Semisynchronous Replication:
http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html


Hi,

On 03/18/2011 02:40 PM, Kevin Grittner wrote:
> Then the only thing you would consider sync replication, as far as I
> can see, is two phase commit

I think waiting for the ACK before actually making the changes from the
transaction visible (COMMIT) would suffice for disallowing such an
inconsistency to manifest.  But obviously, MySQL decided it's not worth
doing that, as it's such a rare event and a short period of time that
may show inconsistencies...

> people's needs.  The guarantee that some people are looking for is
> that a successful commit means that the data has been persisted on
> two separate servers.

Well, MySQL's semi-sync also seems to guarantee that WRT the client
confirmation.  And transactions always appear committed *before* the
client receives the COMMIT acknowledgement, due to the time it takes for
the ACK to arrive at the client.

It's just the commit *before* receiving the slave's ACK, which might
make a transaction visible that's not durable, yet.  But I guess that
simplified implementation for them...

Regards

Markus Wanner


Simon Riggs <simon@2ndQuadrant.com> wrote:
> In PostgreSQL other users cannot observe the commit until an
> acknowledgement has been received.
Really?  I hadn't picked up on that.  That makes for a lot of
complication on crash-and-recovery of a master, but if we can pull
it off, that's really cool.  If we do that and MySQL doesn't, we
definitely don't want to use the same terminology they do, which
would imply the same behavior.
Apologies for not picking up on that aspect of the implementation.
-Kevin


On 03/18/2011 03:52 PM, Kevin Grittner wrote:
> Really?  I hadn't picked up on that.  That makes for a lot of
> complication on crash-and-recovery of a master

What complication do you have in mind here?

I think of it the opposite way (at least for Postgres, that is):
committing a transaction that's not acknowledged means having to revert
a (locally only) committed transaction if you want to use the current
data to recover to some cluster-agreed state.  (Of course, you can
always simply transfer the whole

If you don't commit the transaction before the ACK in the first place,
you don't have anything special to do upon recovery.

Regards

Markus Wanner


Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

From
Heikki Linnakangas
Date:
On 18.03.2011 16:52, Kevin Grittner wrote:
> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>
>> In PostgreSQL other users cannot observe the commit until an
>> acknowledgement has been received.
>
> Really?  I hadn't picked up on that.  That makes for a lot of
> complication on crash-and-recovery of a master, but if we can pull
> it off, that's really cool.  If we do that and MySQL doesn't, we
> definitely don't want to use the same terminology they do, which
> would imply the same behavior.

To be clear: other users cannot observe the commit until standby 
acknowledges it - unless the master crashes while waiting for the 
acknowledgment. If that happens, the commit will be visible to everyone 
after recovery.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Fri, Mar 18, 2011 at 2:19 PM, Markus Wanner <markus@bluegap.ch> wrote:

> Their documentation [1] isn't entirely clear on that first: "the master
> blocks after the commit is done and waits until at least one
> semisynchronous slave acknowledges that it has received all events for
> the transaction" and the "slave acknowledges receipt of a transaction's
> events only after the events have been written to its relay log and
> flushed to disk".
>
> But then continues to say that "[the master is] waiting for
> acknowledgment from a slave after having performed a commit", so this
> indeed sounds like the transaction is visible to other sessions before
> the slave ACKs.

Yes, their docs are not clear on this.

-- 
Mark Callaghan
mdcallag@gmail.com


On Fri, Mar 18, 2011 at 2:37 PM, Markus Wanner <markus@bluegap.ch> wrote:
> Hi,
>
> On 03/18/2011 02:40 PM, Kevin Grittner wrote:
>> Then the only thing you would consider sync replication, as far as I
>> can see, is two phase commit
>
> I think waiting for the ACK before actually making the changes from the
> transaction visible (COMMIT) would suffice for disallowing such an
> inconsistency to manifest.  But obviously, MySQL decided it's not worth
> doing that, as it's such a rare event and a short period of time that
> may show inconsistencies...

There are fewer options for implementing this in MySQL because
replication requires a binlog on the master and that requires the
internal use of XA to keep the binlog and InnoDB in sync as they are
separate resource managers. In theory, this can be changed so that
commit is only forced for the binlog and then on a crash missing
transactions could be copied from the binlog to InnoDB but I don't
think this will ever change.

By "fewer options" I mean that commit in MySQL with InnoDB and the
binlog requires:
1) prepare to InnoDB (force transaction log to disk for changes from
this transaction)
2) write binlog events from this transaction to the binlog
3) write XID event to the binlog (at this point transaction commit is
official, will survive a crash)
4) force binlog to disk
5) release row locks held by transaction in innodb
6) write commit record to innodb transaction log
7) force write of commit record to disk

Group commit is done for the fsyncs from steps 1 and 7. It is not done
for the fsync done in step 4.

Regardless, the processing above is complicated even without
semi-sync. AFAIK, semi-sync code occurs after step 7 but I have not
looked at the official version of semi-sync code in MySQL and my
memory of the work we did at Google is vague.

It is great if Postgres doesn't have this issue. It wasn't clear to me
from lurking on this list. I hope your docs highlight the behavior as
not having the issue is a big deal.

--
Mark Callaghan
mdcallag@gmail.com


On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
> On 18.03.2011 16:52, Kevin Grittner wrote:
> > Simon Riggs<simon@2ndQuadrant.com>  wrote:
> >
> >> In PostgreSQL other users cannot observe the commit until an
> >> acknowledgement has been received.
> >
> > Really?  I hadn't picked up on that.  That makes for a lot of
> > complication on crash-and-recovery of a master, but if we can pull
> > it off, that's really cool.  If we do that and MySQL doesn't, we
> > definitely don't want to use the same terminology they do, which
> > would imply the same behavior.
> 
> To be clear: other users cannot observe the commit until standby 
> acknowledges it - unless the master crashes while waiting for the 
> acknowledgment. If that happens, the commit will be visible to everyone 
> after recovery.

No, only in the case where you choose not to failover to the standby
when you crash, which would be a fairly strange choice after the effort
to set up the standby. In a correctly configured and operated cluster
what I say above is fully correct and needs no addendum.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



> On 18.03.2011 16:52, Kevin Grittner wrote:
>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>> 
>>> In PostgreSQL other users cannot observe the commit until an
>>> acknowledgement has been received.
>> 
>> Really?  I hadn't picked up on that.  That makes for a lot of
>> complication on crash-and-recovery of a master, but if we can
>> pull it off, that's really cool.
Markus Wanner <markus@bluegap.ch> wrote:
> What complication do you have in mind here?
Basically, what Heikki addresses.  It has to be committed after
crash and recovery, and deal with replicas which may or may not have
been notified and may or may not have applied the transaction.
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> To be clear: other users cannot observe the commit until standby 
> acknowledges it - unless the master crashes while waiting for the 
> acknowledgment. If that happens, the commit will be visible to
> everyone after recovery.
Right.  If other transactions cannot see the transaction before the
COMMIT returns, I was kinda assuming that this was the behavior,
because otherwise one or more replicas could be ahead of the master
after recovery, which would be horribly broken.  I agree that the
behavior which you describe is much better than allowing other
transactions to see the work of the pending COMMIT.
In fact, on further reflection, allowing other transactions to see
work before the committing transaction returns could lead to broken
behavior if that viewing transaction took some action based on the
that, the master crashed, recovery was done using a standby, and
that standby hadn't persisted the transaction.  So this behavior is
necessary for good behavior.  Even though that "perfect storm" of
events might be fairly rare, the difference in the level of
confidence in correctness is significant, and certainly something to
brag about.
-Kevin


On Fri, Mar 18, 2011 at 12:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
>> On 18.03.2011 16:52, Kevin Grittner wrote:
>> > Simon Riggs<simon@2ndQuadrant.com>  wrote:
>> >
>> >> In PostgreSQL other users cannot observe the commit until an
>> >> acknowledgement has been received.
>> >
>> > Really?  I hadn't picked up on that.  That makes for a lot of
>> > complication on crash-and-recovery of a master, but if we can pull
>> > it off, that's really cool.  If we do that and MySQL doesn't, we
>> > definitely don't want to use the same terminology they do, which
>> > would imply the same behavior.
>>
>> To be clear: other users cannot observe the commit until standby
>> acknowledges it - unless the master crashes while waiting for the
>> acknowledgment. If that happens, the commit will be visible to everyone
>> after recovery.
>
> No, only in the case where you choose not to failover to the standby
> when you crash, which would be a fairly strange choice after the effort
> to set up the standby. In a correctly configured and operated cluster
> what I say above is fully correct and needs no addendum.

Except it doesn't work that way.  If, say, a backend on the master
core dumps, the system will perform a crash and restart cycle, and the
transaction will become visible whether it's yet been replicated or
not.  Since we now have a GUC to suppress restart after a backend
crash, it's theoretically possible to set up the system so that this
doesn't occur, but it'd take quite a bit of work to make it robust and
automatic, and it's certainly not the default out of the box.

The fundamental problem here is that once you update CLOG and flush
the corresponding WAL record, there is no going backward.  You can
hold the system in some intermediate state where the transaction still
holds locks and is excluded from MVCC snapshots, but there's no way to
back up.  So there are bound to be corner cases where the where the
wait doesn't last as long as you want, and stuff leaks out around the
edges.  It's fundamentally impossible to guarantee that you'll remain
in that intermediate state forever - what do you do if a meteor hits
the synchronous standby and at the same time you lose power to the
master?  No amount of configuration will save you from coming back on
line with a visible-but-unreplicated transaction.  I'm not knocking
the system; I think what we have is impressively good.  But pretending
that corner cases can't happen gets us nowhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> wrote:
> Simon Riggs <simon@2ndquadrant.com> wrote:
>> No, only in the case where you choose not to failover to the
>> standby when you crash, which would be a fairly strange choice
>> after the effort to set up the standby. In a correctly configured
>> and operated cluster what I say above is fully correct and needs
>> no addendum.
> what do you do if a meteor hits the synchronous standby and at the
> same time you lose power to the master?  No amount of
> configuration will save you from coming back on line with a
> visible-but-unreplicated transaction. 
You don't even need to postulate an extreme condition like that; we
prefer to have a DBA pull the trigger on a failover, rather than
trust the STONITH call to software.  This is particularly true when
the master is local to its primary users and the replica is remote
to them.
-Kevin


On Fri, Mar 18, 2011 at 4:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The fundamental problem here is that once you update CLOG and flush
> the corresponding WAL record, there is no going backward.  You can
> hold the system in some intermediate state where the transaction still
> holds locks and is excluded from MVCC snapshots, but there's no way to
> back up.  So there are bound to be corner cases where the where the
> wait doesn't last as long as you want, and stuff leaks out around the
> edges.


I'm finding this whole idea of hiding the committed transaction until
the slave acks it kind of strange. It means there are times when the
slave is actually *ahead* of the master which would actually be kind
of hard to code against if you're trying to use the slave as a
possibly-not-up-to-date mirror.

I think promising that the COMMIT doesn't return until the transaction
and all previous transactions are replicated is enough. We don't have
to promise that nobody else will see it either. Those same
transactions eventually have to commit as well and if they want that
level of protection they can block waiting until they're replicated as
well which will imply that anything they depended on will be
replicated.

This is akin to the synchronous_commit=off case where other
transactions can see your data as soon as you commit even before the
xlog is fsynced. If you have synchronous_commit mode enabled then
you'll block until your xlog is fsynced and that will implicitly mean
the other transactions you saw were also fsynced.

--
greg


On 03/18/2011 06:35 PM, Greg Stark wrote:
> I think promising that the COMMIT doesn't return until the transaction
> and all previous transactions are replicated is enough. We don't have
> to promise that nobody else will see it either. Those same
> transactions eventually have to commit as well

No, they don't have to.  They can ROLLBACK, get aborted, lose connection
to the master, etc..  The issue here is that, given the MySQL scheme,
these transactions see a snapshot that's not durable, because at that
point in time, no standby guarantees to have stored the transaction to
be committed, yet.  So in case of a failover, you'd suddenly see a
different snapshot (and lose changes of that transaction).

> This is akin to the synchronous_commit=off case where other
> transactions can see your data as soon as you commit even before the
> xlog is fsynced. If you have synchronous_commit mode enabled then
> you'll block until your xlog is fsynced and that will implicitly mean
> the other transactions you saw were also fsynced.

Somewhat, yes.  And for exactly that reason, most users run with
synchronous_commit enabled.  They don't want to lose committed transactions.

Regards

Markus Wanner


Simon,

On 03/18/2011 05:19 PM, Simon Riggs wrote:
>>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>>>> In PostgreSQL other users cannot observe the commit until an
>>>> acknowledgement has been received.

On other nodes as well?  To me that means the standby needs to hold back
COMMIT of an ACKed transaction, until receives a re-ACK from the master,
that it committed the transaction there.  How else could the slave know
when to commit its ACKed transactions?

> No, only in the case where you choose not to failover to the standby
> when you crash, which would be a fairly strange choice after the effort
> to set up the standby. In a correctly configured and operated cluster
> what I say above is fully correct and needs no addendum.

If you don't failover, how can the standby be ahead of the master, given
it takes measures not to be during normal operation?

Eager to understand... ;-)

Regards

Markus


On 03/18/2011 05:27 PM, Kevin Grittner wrote:
> Basically, what Heikki addresses.  It has to be committed after
> crash and recovery, and deal with replicas which may or may not have
> been notified and may or may not have applied the transaction.

Huh?  I'm not quite following here.  Committing additional transactions
isn't a problem, reverting committed transactions is.

And yes, given that we only wait for ACK from a single standby, you'd
have to failover to exactly *that* standby to guarantee consistency.

> In fact, on further reflection, allowing other transactions to see
> work before the committing transaction returns could lead to broken
> behavior if that viewing transaction took some action based on the
> that, the master crashed, recovery was done using a standby, and
> that standby hadn't persisted the transaction.  So this behavior is
> necessary for good behavior.

I fully agree to that.

Regards

Markus


On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
> Simon,
> 
> On 03/18/2011 05:19 PM, Simon Riggs wrote:
> >>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
> >>>> In PostgreSQL other users cannot observe the commit until an
> >>>> acknowledgement has been received.
> 
> On other nodes as well?  To me that means the standby needs to hold back
> COMMIT of an ACKed transaction, until receives a re-ACK from the master,
> that it committed the transaction there.  How else could the slave know
> when to commit its ACKed transactions?

We could do that easily enough, actually, if we wished.

Do we wish?

> > No, only in the case where you choose not to failover to the standby
> > when you crash, which would be a fairly strange choice after the effort
> > to set up the standby. In a correctly configured and operated cluster
> > what I say above is fully correct and needs no addendum.
> 
> If you don't failover, how can the standby be ahead of the master, given
> it takes measures not to be during normal operation?
> 
> Eager to understand... ;-)
> 
> Regards
> 
> Markus

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Simon Riggs <simon@2ndQuadrant.com> wrote:
> On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
>> >>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>> >>>> In PostgreSQL other users cannot observe the commit until an
>> >>>> acknowledgement has been received.
>> 
>> On other nodes as well?  To me that means the standby needs to
>> hold back COMMIT of an ACKed transaction, until receives a re-ACK
>> from the master, that it committed the transaction there.  How
>> else could the slave know when to commit its ACKed transactions?
> 
> We could do that easily enough, actually, if we wished.
> 
> Do we wish?
+1
If we're going out of our way to suppress it on the master until the
COMMIT returns, it shouldn't be showing on the replicas before that.
-Kevin


On 03/18/2011 08:29 PM, Simon Riggs wrote:
> We could do that easily enough, actually, if we wished.
> 
> Do we wish?

I personally don't see any problem letting a standby show a snapshot
before the master.  I'd consider it unneeded network traffic.  But then
again, I'm completely biased.

Regards

Markus Wanner


On Fri, Mar 18, 2011 at 3:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
>> Simon,
>>
>> On 03/18/2011 05:19 PM, Simon Riggs wrote:
>> >>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>> >>>> In PostgreSQL other users cannot observe the commit until an
>> >>>> acknowledgement has been received.
>>
>> On other nodes as well?  To me that means the standby needs to hold back
>> COMMIT of an ACKed transaction, until receives a re-ACK from the master,
>> that it committed the transaction there.  How else could the slave know
>> when to commit its ACKed transactions?
>
> We could do that easily enough, actually, if we wished.
>
> Do we wish?

Seems like it would be nice, but isn't it dreadfully expensive?
Wouldn't you need to prevent the slave from applying the WAL until the
master has released the sync rep waiters?  You'd need a whole new
series of messages back and forth.

Since the current solution is intended to support data-loss-free
failover, but NOT to guarantee a consistent view of the world from a
SQL level, I doubt it's worth paying any price for this.  Certainly in
the hot_standby=off case it's a nonissue.  We might need to think
harder about it when and if someone impements an 'apply' level though,
because this would seem more of a concern in that case (though I
haven't thought through all the details).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> wrote:
> Since the current solution is intended to support data-loss-free
> failover, but NOT to guarantee a consistent view of the world from
> a SQL level, I doubt it's worth paying any price for this.
Well, that brings us back to the question of why we would want to
suppress the view of the data on the master until the replica
acknowledges the commit.  It *is* committed on the master, we're
just holding off on telling the committer about it until we can
honor the guarantee of replication.  If it can be seen on the
replica before the committer get such acknowledgment, why not on the
master?
-Kevin


On Fri, 2011-03-18 at 17:08 -0400, Aidan Van Dyk wrote:
> On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner <markus@bluegap.ch> wrote:
> > On 03/18/2011 08:29 PM, Simon Riggs wrote:
> >> We could do that easily enough, actually, if we wished.
> >>
> >> Do we wish?
> >
> > I personally don't see any problem letting a standby show a snapshot
> > before the master.  I'd consider it unneeded network traffic.  But then
> > again, I'm completely biased.
> 
> In fact, we *need* to have standbys show a snapshot before the master.
> 
> By the time the master acks the commit to the client, the snapshot
> must be visible to all client connected to both the master and the
> syncronous slave.
> 
> Even with just a single server postgresql cluster, other
> clients(backends) can see the commit before the commiting client
> receives the ACK.  Just that on a single server, the time period for
> that is small.
> 
> Sync rep increases that time period by the length of time from when
> the slave reaches the commit point in the WAL stream to when it's ack
> of that point get's back to the wal sender.  Ideally, that ACK time is
> small.
> 
> Adding another round trip in there just for a "go almost to $COMIT,
> ok, now go to $COMMIT" type of WAL/ack is going to be pessimal for
> performance, and still not improve the *guarentees* it can make.
> 
> It can only slightly reduce, but not eliminated that window where them
> master has WAL that the slave doesn't, and without a complete
> elimination (where you just switch the problem to be the slave has the
> data that the master doesn't), you haven't changed any of the
> guarantees sync rep can make (or not).

Well explained observation. Agreed.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



On Fri, 2011-03-18 at 16:24 -0500, Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>  
> > Since the current solution is intended to support data-loss-free
> > failover, but NOT to guarantee a consistent view of the world from
> > a SQL level, I doubt it's worth paying any price for this.
>  
> Well, that brings us back to the question of why we would want to
> suppress the view of the data on the master until the replica
> acknowledges the commit.  It *is* committed on the master, we're
> just holding off on telling the committer about it until we can
> honor the guarantee of replication.  If it can be seen on the
> replica before the committer get such acknowledgment, why not on the
> master?

I think the issue is explicit acknowledgement, not visibility.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



On Fri, Mar 18, 2011 at 5:24 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>
>> Since the current solution is intended to support data-loss-free
>> failover, but NOT to guarantee a consistent view of the world from
>> a SQL level, I doubt it's worth paying any price for this.
>
> Well, that brings us back to the question of why we would want to
> suppress the view of the data on the master until the replica
> acknowledges the commit.  It *is* committed on the master, we're
> just holding off on telling the committer about it until we can
> honor the guarantee of replication.  If it can be seen on the
> replica before the committer get such acknowledgment, why not on the
> master?

Well, the idea is that we don't want to let people depend on the value
until it's guaranteed to be durably committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> wrote:
> Well, the idea is that we don't want to let people depend on the
> value until it's guaranteed to be durably committed.
OK, so if you see it on the replica, you know it is in at least two
places.  I guess that makes sense.  It kinda "feels" wrong to see a
view of the replica which is ahead of the master, but I guess it's
the least of the evils.  I guess we should document it, though, so
nobody has a false expectation that seeing something on the replica
means that a connection looking at the master will see something
that current.
-Kevin


On Fri, Mar 18, 2011 at 5:48 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, the idea is that we don't want to let people depend on the
>> value until it's guaranteed to be durably committed.
>
> OK, so if you see it on the replica, you know it is in at least two
> places.  I guess that makes sense.  It kinda "feels" wrong to see a
> view of the replica which is ahead of the master, but I guess it's
> the least of the evils.  I guess we should document it, though, so
> nobody has a false expectation that seeing something on the replica
> means that a connection looking at the master will see something
> that current.

Yeah, it can go both ways: a snapshot taken on the standby can be
either earlier or later in the commit ordering than the master.
That's counterintuitive, but I see no reason to stress about it.  It's
perfectly reasonable to set up a server with synchronous replication
for enhanced durability and also enable hot standby just for
convenience, but without actually relying on it all that heavily, or
only for non-critical reporting purposes.  Synchronous replication,
like asynchronous replication, is basically a high-availability tool.
As long as it does that well, I'm not going to get worked up about the
fact that it doesn't address every other use case someone might want.
We can always add more frammishes in future releases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 03/18/2011 10:48 PM, Kevin Grittner wrote:
> the least of the evils.  I guess we should document it, though, so
> nobody has a false expectation that seeing something on the replica
> means that a connection looking at the master will see something
> that current.

Agreed.  Note, however, that even if there's no such guarantee, it's
highly unlikely for a user (or application) to ever notice this during
normal operation.

Regards

Markus Wanner


On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner <markus@bluegap.ch> wrote:
> On 03/18/2011 08:29 PM, Simon Riggs wrote:
>> We could do that easily enough, actually, if we wished.
>>
>> Do we wish?
>
> I personally don't see any problem letting a standby show a snapshot
> before the master.  I'd consider it unneeded network traffic.  But then
> again, I'm completely biased.

In fact, we *need* to have standbys show a snapshot before the master.

By the time the master acks the commit to the client, the snapshot
must be visible to all client connected to both the master and the
syncronous slave.

Even with just a single server postgresql cluster, other
clients(backends) can see the commit before the commiting client
receives the ACK.  Just that on a single server, the time period for
that is small.

Sync rep increases that time period by the length of time from when
the slave reaches the commit point in the WAL stream to when it's ack
of that point get's back to the wal sender.  Ideally, that ACK time is
small.

Adding another round trip in there just for a "go almost to $COMIT,
ok, now go to $COMMIT" type of WAL/ack is going to be pessimal for
performance, and still not improve the *guarentees* it can make.

It can only slightly reduce, but not eliminated that window where them
master has WAL that the slave doesn't, and without a complete
elimination (where you just switch the problem to be the slave has the
data that the master doesn't), you haven't changed any of the
guarantees sync rep can make (or not).

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.



On Fri, Mar 18, 2011 at 5:08 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> On 03/18/2011 08:29 PM, Simon Riggs wrote:
>>> We could do that easily enough, actually, if we wished.
>>>
>>> Do we wish?
>>
>> I personally don't see any problem letting a standby show a snapshot
>> before the master.  I'd consider it unneeded network traffic.  But then
>> again, I'm completely biased.
>
> In fact, we *need* to have standbys show a snapshot before the master.
>
> By the time the master acks the commit to the client, the snapshot
> must be visible to all client connected to both the master and the
> syncronous slave.

We might have a version of synchronous replication that works this way
some day, but it's not the version were shipping with 9.1.  The slave
acknowledges the WAL records when they hit the disk (i.e. fsync) not
when they are applied; WAL apply can lag arbitrarily.  The point is to
guarantee clients that the WAL is on disk somewhere and that it will
be replayed in the event of a failover.  Despite the fact that this
doesn't work as you're describing, it's a useful feature in its own
right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 03/22/2011 09:33 PM, Robert Haas wrote:
> We might have a version of synchronous replication that works this way
> some day, but it's not the version were shipping with 9.1.  The slave
> acknowledges the WAL records when they hit the disk (i.e. fsync) not
> when they are applied; WAL apply can lag arbitrarily.  The point is to
> guarantee clients that the WAL is on disk somewhere and that it will
> be replayed in the event of a failover.  Despite the fact that this
> doesn't work as you're describing, it's a useful feature in its own
> right.

In that sense, our approach may be more synchronous than most others,
because after the ACK is sent from the slave, the slave still needs to
apply the transaction data from WAL before it gets visible, while the
master needs to wait for the ACK to arrive at its side, before making it
visible there.

Ideally, these two latencies (disk seek and network induced) are just
about equal.  But of course, there's no such guarantee.  So whenever one
of the two is off by an order of magnitude or two (by use case or due to
a temporary overload), either the master or the slave may lag behind the
other machine.

What pleases me is that the guarantee from the slave is somewhat similar
to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
transaction *is* applied locally, it just guarantees that it *will* be
able to do so sometime in the future.  Kind of a mind twister, though...

Regards

Markus


On Wed, Mar 23, 2011 at 3:27 AM, Markus Wanner <markus@bluegap.ch> wrote:
> On 03/22/2011 09:33 PM, Robert Haas wrote:
>> We might have a version of synchronous replication that works this way
>> some day, but it's not the version were shipping with 9.1.  The slave
>> acknowledges the WAL records when they hit the disk (i.e. fsync) not
>> when they are applied; WAL apply can lag arbitrarily.  The point is to
>> guarantee clients that the WAL is on disk somewhere and that it will
>> be replayed in the event of a failover.  Despite the fact that this
>> doesn't work as you're describing, it's a useful feature in its own
>> right.
>
> In that sense, our approach may be more synchronous than most others,
> because after the ACK is sent from the slave, the slave still needs to
> apply the transaction data from WAL before it gets visible, while the
> master needs to wait for the ACK to arrive at its side, before making it
> visible there.
>
> Ideally, these two latencies (disk seek and network induced) are just
> about equal.  But of course, there's no such guarantee.  So whenever one
> of the two is off by an order of magnitude or two (by use case or due to
> a temporary overload), either the master or the slave may lag behind the
> other machine.
>
> What pleases me is that the guarantee from the slave is somewhat similar
> to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
> transaction *is* applied locally, it just guarantees that it *will* be
> able to do so sometime in the future.  Kind of a mind twister, though...

Yes.  What this won't do is let you build a big load-balancing network
(at least not without great caution about what you assume).  What it
will do is make it really, really hard to lose committed transactions.Both good things, but different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 03/23/2011 12:52 PM, Robert Haas wrote:
> Yes.  What this won't do is let you build a big load-balancing network
> (at least not without great caution about what you assume).

This sounds too strong to me.  Session-aware load balancing is pretty
common these days.  It's the default mode of PgBouncer, for example.
Not much caution required there, IMO.  Or what pitfalls did you have in
mind?

> What it
> will do is make it really, really hard to lose committed transactions.
> Both good things, but different.

..you can still get both at the same time.  At least as long as you are
happy with session-aware load balancing.  And who really needs finer
grained balancing?

(Note that no matter how fine-grained you balance, you are still bound
to a (single core of a) single node.  That changes with distributed
querying, and things really start to get interesting there... but we are
far from that, yet).

Regards

Markus


On Wed, Mar 23, 2011 at 8:16 AM, Markus Wanner <markus@bluegap.ch> wrote:
> On 03/23/2011 12:52 PM, Robert Haas wrote:
>> Yes.  What this won't do is let you build a big load-balancing network
>> (at least not without great caution about what you assume).
>
> This sounds too strong to me.  Session-aware load balancing is pretty
> common these days.  It's the default mode of PgBouncer, for example.
> Not much caution required there, IMO.  Or what pitfalls did you have in
> mind?

Well, just the one we were talking about: a COMMIT on one node doesn't
guarantee that the transactions is visible on the other node, just
that it will become visible there eventually, even if a crash happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Sat, Mar 19, 2011 at 4:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
>> Simon,
>>
>> On 03/18/2011 05:19 PM, Simon Riggs wrote:
>> >>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>> >>>> In PostgreSQL other users cannot observe the commit until an
>> >>>> acknowledgement has been received.
>>
>> On other nodes as well?  To me that means the standby needs to hold back
>> COMMIT of an ACKed transaction, until receives a re-ACK from the master,
>> that it committed the transaction there.  How else could the slave know
>> when to commit its ACKed transactions?
>
> We could do that easily enough, actually, if we wished.
>
> Do we wish?

No.

I'm not sure what's the problem with seeing from the standby the data which is
not visible on the master yet? And, I'm really not sure whether that problem can
be solved by making the data visible on the master before the standby. If we
really want to see the consistent data from each node, we should implement
and use a cluster-wide snapshot as well as Postgres-XC does.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center