Thread: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Simon Riggs

Date:

15 September 2010, 11:38:27

On Wed, 2010-09-15 at 20:14 +0900, Fujii Masao wrote:
> On Wed, Sep 15, 2010 at 7:35 PM, Heikki Linnakangas
> <heikki@postgresql.org> wrote:
> > Log Message:
> > -----------
> > Use a latch to make startup process wake up and replay immediately when
> > new WAL arrives via streaming replication. This reduces the latency, and
> > also allows us to use a longer polling interval, which is good for energy
> > efficiency.
> >
> > We still need to poll to check for the appearance of a trigger file, but
> > the interval is now 5 seconds (instead of 100ms), like when waiting for
> > a new WAL segment to appear in WAL archive.
>
> Good work!

No, not good work.

You both know very well that I'm working on this area also and these
commits are not agreed... yet. They might not be contended but they are
very likely to break my patch, again.

Please desist while we resolve which are the good ideas and which are
not. We won't know that if you keep breaking other people's patches in a
stream of commits that prevent anybody completing other options.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

David Fetter

Date:

15 September 2010, 12:32:45

On Wed, Sep 15, 2010 at 03:35:30PM +0100, Simon Riggs wrote:
> On Wed, 2010-09-15 at 20:14 +0900, Fujii Masao wrote:
> > On Wed, Sep 15, 2010 at 7:35 PM, Heikki Linnakangas
> > <heikki@postgresql.org> wrote:
> > > Log Message:
> > > -----------
> > > Use a latch to make startup process wake up and replay immediately when
> > > new WAL arrives via streaming replication. This reduces the latency, and
> > > also allows us to use a longer polling interval, which is good for energy
> > > efficiency.
> > >
> > > We still need to poll to check for the appearance of a trigger file, but
> > > the interval is now 5 seconds (instead of 100ms), like when waiting for
> > > a new WAL segment to appear in WAL archive.
> >
> > Good work!
>
> No, not good work.
>
> You both know very well that I'm working on this area also and these
> commits are not agreed... yet. They might not be contended but they are
> very likely to break my patch, again.
>
> Please desist while we resolve which are the good ideas and which are
> not. We won't know that if you keep breaking other people's patches in a
> stream of commits that prevent anybody completing other options.

Simon,

No matter how many times you try, you are not going to get a license
to stop all work on anything you might chance to think about.  It is
quite simply never going to happen, so you need to back off.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Simon Riggs

Date:

15 September 2010, 12:40:39

On Wed, 2010-09-15 at 07:59 -0700, David Fetter wrote:
> On Wed, Sep 15, 2010 at 03:35:30PM +0100, Simon Riggs wrote:

> > Please desist while we resolve which are the good ideas and which are
> > not. We won't know that if you keep breaking other people's patches in a
> > stream of commits that prevent anybody completing other options.

> No matter how many times you try, you are not going to get a license
> to stop all work on anything you might chance to think about.  It is
> quite simply never going to happen, so you need to back off.

I agree that asking people to stop work is not OK. However, I haven't
asked for development work to stop, only that commits into that area
stop until proper debate has taken place. Those might be minor commits,
but they might not. Had I made those commits, they would have been
called premature by others also.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Robert Haas

Date:

15 September 2010, 13:45:53

On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I agree that asking people to stop work is not OK. However, I haven't
> asked for development work to stop, only that commits into that area
> stop until proper debate has taken place. Those might be minor commits,
> but they might not. Had I made those commits, they would have been
> called premature by others also.

I do not believe that Heikki has done anything inappropriate.  We've
spent weeks discussing the latch facility and its various
applications.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Simon Riggs

Date:

15 September 2010, 14:04:51

On Wed, 2010-09-15 at 11:25 -0400, Tom Lane wrote:
> ... an unspecified patch with no firm delivery date.

I'm happy to post my current work, if it's considered helpful. The sole
intent of that work is to help the community understand the benefits of
the proposals I have made, so perhaps this patch does serve that
purpose.

The attached patch compiles, but I wouldn't bother trying to run it yet.
I'm still wading through the latch rewrite. It probably doesn't apply
cleanly to head anymore either, hence discussion.

I wouldn't normally waste people's time by posting a non-working patch,
the majority of the code is in about the right place of execution. There
aren't any unclear aspects in the design, so its worth looking at.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services

Attachment

sync_rep.v3.patch

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Simon Riggs

Date:

15 September 2010, 14:30:32

On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote:
> On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > I agree that asking people to stop work is not OK. However, I haven't
> > asked for development work to stop, only that commits into that area
> > stop until proper debate has taken place. Those might be minor commits,
> > but they might not. Had I made those commits, they would have been
> > called premature by others also.
>
> I do not believe that Heikki has done anything inappropriate.  We've
> spent weeks discussing the latch facility and its various
> applications.

Sounds reasonable, but my comments were about this commit, not the one
that happened on Saturday. This patch was posted about 32 hours ago, and
the commit need not have taken place yet. If I had posted such a patch
and committed it knowing other work is happening in that area we both
know that you would have objected.

It's not actually a major issue, but at some point I have to ask for no
more commits, so Fujii and I can finish our patches, compare and
contrast, so the best ideas can get into Postgres.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Robert Haas

Date:

15 September 2010, 14:58:37

On Wed, Sep 15, 2010 at 1:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote:
>> On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > I agree that asking people to stop work is not OK. However, I haven't
>> > asked for development work to stop, only that commits into that area
>> > stop until proper debate has taken place. Those might be minor commits,
>> > but they might not. Had I made those commits, they would have been
>> > called premature by others also.
>>
>> I do not believe that Heikki has done anything inappropriate.  We've
>> spent weeks discussing the latch facility and its various
>> applications.
>
> Sounds reasonable, but my comments were about this commit, not the one
> that happened on Saturday. This patch was posted about 32 hours ago, and
> the commit need not have taken place yet. If I had posted such a patch
> and committed it knowing other work is happening in that area we both
> know that you would have objected.

I've often felt that we ought to have a bit more delay between when
committers post patches and when they commit them.  I was told 24
hours and I've seen cases where people haven't even waited that long.
On the other hand, if we get to strict about it, it can easily get to
the point where it just gets in the way of progress, and certainly
some patches are far more controversial than others.  So I don't know
what the best thing to do is.  Still, I have to admit that I feel
fairly positive about the direction we're going with this particular
patch.  Clearing away these peripheral issues should make it easier
for us to have a rational discussion about the core issues around how
this is going to be configured and actually work at the protocol
level.

> It's not actually a major issue, but at some point I have to ask for no
> more commits, so Fujii and I can finish our patches, compare and
> contrast, so the best ideas can get into Postgres.

I don't think anyone is prepared to agree to that.  I think that
everyone is prepared to accept a limited amount of further delay in
pressing forward with the main part of sync rep, but I expect that no
one will be willing to freeze out incremental improvements in the
meantime, even if it does induce a certain amount of rebasing.  It's
also worth noting that Fujii Masao's patch has been around for months,
and yours isn't finished yet.  That's not to say that we don't want to
consider your ideas, because we do: and you've had more than your
share of good ones.  At the same time, it would be unfair and
unreasonable to expect work on a patch that is done, and has been done
for some time, to wait on one that isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Heikki Linnakangas

Date:

15 September 2010, 15:34:48

On 15/09/10 20:58, Robert Haas wrote:
> On Wed, Sep 15, 2010 at 1:30 PM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> On Wed, 2010-09-15 at 12:45 -0400, Robert Haas wrote:
>>> On Wed, Sep 15, 2010 at 11:24 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>>> I agree that asking people to stop work is not OK. However, I haven't
>>>> asked for development work to stop, only that commits into that area
>>>> stop until proper debate has taken place. Those might be minor commits,
>>>> but they might not. Had I made those commits, they would have been
>>>> called premature by others also.
>>>
>>> I do not believe that Heikki has done anything inappropriate.  We've
>>> spent weeks discussing the latch facility and its various
>>> applications.
>>
>> Sounds reasonable, but my comments were about this commit, not the one
>> that happened on Saturday. This patch was posted about 32 hours ago, and
>> the commit need not have taken place yet. If I had posted such a patch
>> and committed it knowing other work is happening in that area we both
>> know that you would have objected.
>
> I've often felt that we ought to have a bit more delay between when
> committers post patches and when they commit them.  I was told 24
> hours and I've seen cases where people haven't even waited that long.
> On the other hand, if we get to strict about it, it can easily get to
> the point where it just gets in the way of progress, and certainly
> some patches are far more controversial than others.  So I don't know
> what the best thing to do is.

With anything non-trivial, I try to "sleep on it" before committing.
More with complicated patches, but it's really up to your own comfort
level with the patch, and whether you think anyone might have different
opinions on it. I don't mind quick commits if it's something that has
been discussed in the past and the committer thinks it's
non-controversial. There's always the option of complaining afterwards.
If it comes to that, though, it wasn't really ripe for committing yet.
(That doesn't apply to gripes about typos or something like that,
because that happens to me way too often ;-) )

>  Still, I have to admit that I feel
> fairly positive about the direction we're going with this particular
> patch.  Clearing away these peripheral issues should make it easier
> for us to have a rational discussion about the core issues around how
> this is going to be configured and actually work at the protocol
> level.

Yeah, I don't think anyone has any qualms about the substance of these
patches.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Simon Riggs

Date:

15 September 2010, 16:18:19

On Wed, 2010-09-15 at 13:58 -0400, Robert Haas wrote:
> > It's not actually a major issue, but at some point I have to ask for
> no
> > more commits, so Fujii and I can finish our patches, compare and
> > contrast, so the best ideas can get into Postgres.
>
> I don't think anyone is prepared to agree to that.  I think that
> everyone is prepared to accept a limited amount of further delay in
> pressing forward with the main part of sync rep, but I expect that no
> one will be willing to freeze out incremental improvements in the
> meantime, even if it does induce a certain amount of rebasing.

> It's
> also worth noting that Fujii Masao's patch has been around for months,
> and yours isn't finished yet.  That's not to say that we don't want to
> consider your ideas, because we do: and you've had more than your
> share of good ones.  At the same time, it would be unfair and
> unreasonable to expect work on a patch that is done, and has been done
> for some time, to wait on one that isn't.

I understand your viewpoint there. I'm sure we all agree sync rep is a
very important feature that must get into the next release.

The only reason my patch exists is because debate around my ideas was
ruled out on various grounds. One of those was it would take so long to
develop we shouldn't risk not getting sync rep in this release. I am
amenable to such arguments (and I make the same one on MERGE, btw, where
I am getting seriously worried) but the reality is that there is
actually very little code here and we can definitely do this, whatever
ideas we pick. I've shown this by providing an almost working version in
about 4 days work. Will finishing it help?

We definitely have the time, so the question is, what are the best
ideas? We must discuss the ideas properly, not just plough forwards
claiming time pressure when it isn't actually an issue at all. We *need*
to put the tools down and talk in detail about the best way forwards.

Before, I had no patch. Now mine "isn't finished". At what point will my
ideas be reviewed without instant dismissal? If we accept your seniority
argument, then "never" because even if I finish it you'll say "Fujii was
there first".

If who mentioned it first was important, then I'd say I've been
discussing this for literally years (late 2006) and have regularly
explained the benefits of the master-side approach I've outlined on list
every time this has come up (every few months). I have also explained
the implementation details many times as well an I'm happy to say that
latches are pretty much exactly what I described earlier. (I called them
LSN queues, similar to lwlocks, IIRC). But thats not the whole deal.

If we simply wanted a patch that was "done" we would have gone with
Zoltan's wouldn't we, based on the seniority argument you use above?
Zoltan's patch didn't perform well at all. Fujii's performs much better.
However, my proposed approach offers even better performance, so
whatever argument you use to include Fujii's also applies to mine
doesn't it? But that's silly and divisive, its not about who's patch
"wins" is it?

Do we have to benchmark multiple patches to prove which is best? If
that's the criteria I'll finish my patch and demonstrate that.

But it doesn't make sense to start committing pieces of Fujii's patch,
so that I can't ever keep up and as a result "Simon never finished his
patch, but it sounded good".

Next steps should be: tools down, discuss what to do. Then go forwards.

We have time, so lets discuss all of the ideas on the table not just
some of them.

For me this is not about the number or names of parameters, its about
master-side control of sync rep and having very good performance.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Robert Haas

Date:

15 September 2010, 17:01:31

On Wed, Sep 15, 2010 at 3:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Will finishing it help?

Yes, I expect that to help a lot.

> Before, I had no patch. Now mine "isn't finished". At what point will my
> ideas be reviewed without instant dismissal? If we accept your seniority
> argument, then "never" because even if I finish it you'll say "Fujii was
> there first".

I said very clearly in my previous email that "I think that everyone
is prepared to accept a limited amount of further delay in pressing
forward with the main part of sync rep".  In other words, I think
everyone is willing to consider your ideas provided that they are
submitted in a form which everyone can understand and think through
sometime soon.  I am not, nor do I think anyone is, saying that we
don't wish to consider your ideas.  I'm actually really pleased that
you are only a day or two from having a working patch.  It can be much
easier to conceptualize a patch than to find the time to finish it
(unfortunately, this problem has overtaken me rather badly in the last
few weeks, which is why I have no new patches in this CommitFest) and
if you can finish it up and get it out in front of everyone I expect
that to be a good thing for this feature and our community.

> Do we have to benchmark multiple patches to prove which is best? If
> that's the criteria I'll finish my patch and demonstrate that.

I was thinking about that earlier today.  I think it's definitely
possible that we'll need to do some benchmarking, although I expect
that people will want to read the code first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Fujii Masao

Date:

17 September 2010, 02:33:23

On Thu, Sep 16, 2010 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> We definitely have the time, so the question is, what are the best
> ideas?

Before advancing the review of each patch, we must determine what
should be committed in 9.1, and what's in this CF.

"Synchronization level on per-transaction" feature is included in Simon's
patch, but not in mine. This is most important difference, which would
have wide-reaching impact on the implementation, e.g., protocol between
walsender and walreceiver. So, at first we should determine whether we'll
commit the feature in 9.1. Then we need to determine how far we should
implement in this CF. Thought?

Each patch provides "synchronization level on per-standby" feature. In
Simon's patch, that level is specified in the standbys's recovery.conf.
In mine, it's in the master's standbys.conf. I think that the former is simpler.
But if we support the capability to register the standbys, the latter would
be required. Which is the best?

Simon's patch seems to include simple quorum commit feature (correct
me if I'm wrong). That is, when there are multiple synchronous standbys,
the master waits until ACK has arrived from at least one standby. OTOH,
in my patch, the master waits until ACK has arrived from all the synchronous
standbys. Which should we choose? I think that we should commit my
straightforward approach first, and enable the quorum commit on that.
Thought?

Simon proposes to invoke walwriter in the standby. This is not included
in my patch, but looks good idea. ISTM that this is not essential feature
for synchronous replication, so how about detachmenting of the walwriter
part from the patch and reviewing it independently?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: [COMMITTERS] pgsql: Use a latch to make startup process wake up and replay

From

Simon Riggs

Date:

17 September 2010, 04:06:50

On Fri, 2010-09-17 at 14:33 +0900, Fujii Masao wrote:
> On Thu, Sep 16, 2010 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > We definitely have the time, so the question is, what are the best
> > ideas?
> 
> Before advancing the review of each patch, we must determine what
> should be committed in 9.1, and what's in this CF.

Thank you for starting the discussion.

> "Synchronization level on per-transaction" feature is included in Simon's
> patch, but not in mine. This is most important difference

Agreed. It's also a very important option for users.

> which would
> have wide-reaching impact on the implementation, e.g., protocol between
> walsender and walreceiver. So, at first we should determine whether we'll
> commit the feature in 9.1. Then we need to determine how far we should
> implement in this CF. Thought?

Yes, sync rep specified per-transaction changes many things at a low
level. Basically, we have a choice of two mostly incompatible
implementations, plus some other options common to both.

There is no danger that we won't commit in 9.1. We have time for
discussion and thought. We also have time for performance testing and
since many of my design proposals are performance related that seems
essential to properly reviewing the patches.

I don't think we can determine how far to implement without considering
both approaches in detail. With regard to your points below, I don't
think any of those points could be committed first.

> Each patch provides "synchronization level on per-standby" feature. In
> Simon's patch, that level is specified in the standbys's recovery.conf.
> In mine, it's in the master's standbys.conf. I think that the former is simpler.
> But if we support the capability to register the standbys, the latter would
> be required. Which is the best?

Either approach is OK for me. Providing both options is also possible.
My approach was just less code and less change to existing mechanisms,
so I did it that way.

There are some small optimisations possible on standby if the standby
knows what role it's being asked to play. It doesn't matter to me
whether we let standby tell master or master tell standby and the code
is about the same either way.

> Simon's patch seems to include simple quorum commit feature (correct
> me if I'm wrong). That is, when there are multiple synchronous standbys,
> the master waits until ACK has arrived from at least one standby. OTOH,
> in my patch, the master waits until ACK has arrived from all the synchronous
> standbys. Which should we choose? I think that we should commit my
> straightforward approach first, and enable the quorum commit on that.
> Thought?

Yes, my approach is simple. For those with Oracle knowledge, my approach
(first-reply-releases-waiter) is equivalent to Oracle's Maximum
Protection mode (= 'fsync' in my design). Providing even higher levels
of protection would not be the most common case.

Your approach of waiting for all replies is much slower and requires
more complex code, since we need to track intermediate states. It also
has additional complexities of behaviour, such as how long do we wait
for second acknowledgement when we already have one, and what happens
when a second ack is not received? More failure modes == less stable.
ISTM that it would require more effort to do this also, since every ack
needs to check all WAL sender data to see if it is the last ack. None of
that seems straightforward.

I don't agree we should commit your approach to that aspect.

In my proposal, such additional features would be possible as a plugin.
The majority of users would not this facility and the plugin leaves the
way open for high-end users that need this.

> Simon proposes to invoke walwriter in the standby. This is not included
> in my patch, but looks good idea. ISTM that this is not essential feature
> for synchronous replication, so how about detachmenting of the walwriter
> part from the patch and reviewing it independently?

I regard it as an essential feature for implementing 'recv' mode of sync
rep, which is the fastest mode. At present WALreceiver does all of
these: receive, write and fsync. Of those the fsync is the slowest and
increases response time significantly.

Of course 'recv' option doesn't need to be part of first commit, but
splitting commits doesn't seem likely to make this go quicker or easier
in the early stages. In particular, splitting some features out could
make it much harder to put back in again later. That point is why my
patch even exists.

I would like to express my regret that the main feature proposal from me
necessitates low level changes that cause our two patches to be in
conflict. Nobody should take this as a sign that there is a personal or
professional problem between Fujii-san and myself.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Configuring synchronous replication

From

Heikki Linnakangas

Date:

17 September 2010, 05:10:12

(changed subject again.)

On 17/09/10 10:06, Simon Riggs wrote:
> I don't think we can determine how far to implement without considering
> both approaches in detail. With regard to your points below, I don't
> think any of those points could be committed first.

Yeah, I think we need to decide on the desired feature set first, before 
we dig deeper into the the patches. The design and implementation will 
fall out of that.

That said, there's a few small things that can be progressed regardless 
of the details of synchronous replication. There's the changes to 
trigger failover with a signal, and it seems that we'll need some libpq 
changes to allow acknowledgments to be sent back to the master 
regardless of the rest of the design. We can discuss those in separate 
threads in parallel.

So the big question is what the user interface looks like. How does one 
configure synchronous replication, and what options are available. 
Here's a list of features that have been discussed. We don't necessarily 
need all of them in the first phase, but let's avoid painting ourselves 
in the corner.

* Support multiple standbys with various synchronization levels.

* What happens if a synchronous standby isn't connected at the moment? 
Return immediately vs. wait forever.

* Per-transaction control. Some transactions are important, others are not.

* Quorum commit. Wait until n standbys acknowledge. n=1 and n=all 
servers can be seen as important special cases of this.

* async, recv, fsync and replay levels of synchronization.

So what should the user interface be like? Given the 1st and 2nd 
requirement, we need standby registration. If some standbys are 
important and others are not, the master needs to distinguish between 
them to be able to determine that a transaction is safely delivered to 
the important standbys.

For per-transaction control, ISTM it would be enough to have a simple 
user-settable GUC like synchronous_commit. Let's call it 
"synchronous_replication_commit" for now. For non-critical transactions, 
you can turn it off. That's very simple for developers to understand and 
use. I don't think we need more fine-grained control than that at 
transaction level, in all the use cases I can think of you have a stream 
of important transactions, mixed with non-important ones like log 
messages that you want to finish fast in a best-effort fashion. I'm 
actually tempted to tie that to the existing synchronous_commit GUC, the 
use case seems exactly the same.

OTOH, if we do want fine-grained per-transaction control, a simple 
boolean or even an enum GUC doesn't really cut it. For truly 
fine-grained control you want to be able to specify exceptions like 
"wait until this is replayed in slave named 'reporting'" or 'don't wait 
for acknowledgment from slave named 'uk-server'". With standby 
registration, we can invent a syntax for specifying overriding rules in 
the transaction. Something like SET replication_exceptions = 
'reporting=replay, uk-server=async'.

For the control between async/recv/fsync/replay, I like to think in 
terms of
a) asynchronous vs synchronous
b) if it's synchronous, how synchronous is it? recv, fsync or replay?

I think it makes most sense to set sync vs. async in the master, and the 
level of synchronicity in the slave. Although I have sympathy for the 
argument that it's simpler if you configure it all from the master side 
as well.

Putting all of that together. I think Fujii-san's standby.conf is pretty 
close. What it needs is the additional GUC for transaction-level control.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 05:15:46

On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote:
> That said, there's a few small things that can be progressed
> regardless of the details of synchronous replication. There's the
> changes to trigger failover with a signal, and it seems that we'll
> need some libpq changes to allow acknowledgments to be sent back to
> the master regardless of the rest of the design. We can discuss those
> in separate threads in parallel. 

Agree to both of those points.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

17 September 2010, 06:10:52

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> * Support multiple standbys with various synchronization levels.
>
> * What happens if a synchronous standby isn't connected at the moment?
> Return immediately vs. wait forever.
>
> * Per-transaction control. Some transactions are important, others are not.
>
> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
> can be seen as important special cases of this.
>
> * async, recv, fsync and replay levels of synchronization.
>
> So what should the user interface be like? Given the 1st and 2nd
> requirement, we need standby registration. If some standbys are important
> and others are not, the master needs to distinguish between them to be able
> to determine that a transaction is safely delivered to the important
> standbys.

Well the 1st point can be handled in a distributed fashion, where the
sync level is setup at the slave. Ditto for second point, you can get
the exact same behavior control attached to the quorum facility.

What I think you're description is missing is the implicit feature that
you want to be able to setup the "ignore-or-wait" failure behavior per
standby. I'm not sure we need that, or more precisely that we need to
have that level of detail in the master's setup.

Maybe what we need instead is a more detailed quorum facility, but as
you're talking about something similar later in the mail, let's follow
you.

> For per-transaction control, ISTM it would be enough to have a simple
> user-settable GUC like synchronous_commit. Let's call it
> "synchronous_replication_commit" for now. For non-critical transactions, you
> can turn it off. That's very simple for developers to understand and use. I
> don't think we need more fine-grained control than that at transaction
> level, in all the use cases I can think of you have a stream of important
> transactions, mixed with non-important ones like log messages that you want
> to finish fast in a best-effort fashion. I'm actually tempted to tie that to
> the existing synchronous_commit GUC, the use case seems exactly the
> same.

Well, that would be an over simplification. In my applications I set the
"sessions" transaction with synchronous_commit = off, but the business
transactions to synchronous_commit = on. Now, among those last, I have
backoffice editing and money transactions. I'm not willing to be forced
to endure the same performance penalty for both when I know the
distributed durability needs aren't the same.

> OTOH, if we do want fine-grained per-transaction control, a simple boolean
> or even an enum GUC doesn't really cut it. For truly fine-grained control
> you want to be able to specify exceptions like "wait until this is replayed
> in slave named 'reporting'" or 'don't wait for acknowledgment from slave
> named 'uk-server'". With standby registration, we can invent a syntax for
> specifying overriding rules in the transaction. Something like SET
> replication_exceptions = 'reporting=replay, uk-server=async'.

Then you want to be able to have more than one reporting server and need
only one of them at the "replay" level, but you don't need to know which
it is. Or on the contrary you have a failover server and you want to be
sure this one is at the replay level whatever happens.

Then you want topology flexibility: you need to be able to replace a
reporting server with another, ditto for the failover one.

Did I tell you my current thinking on how to tackle that yet? :) Using a
distributed setup, where each slave has a weight (several votes per
transaction) and a level offering would allow that I think.

Now something similar to your idea that I can see a need for is being
able to have a multi-part quorum target: when you currently say that you
want 2 votes for sync, you would be able to say you want 2 votes for
recv, 2 for fsync and 1 for replay. Remember that any slave is setup to
offer only one level of synchronicity but can offer multiple votes.

How this would look like in the setup? Best would be to register the
different service levels your application need. Time to bikeshed a
little?
 sync_rep_services = {critical: recv=2, fsync=2, replay=1;                      important: fsync=3;
reporting: recv=2, apply=1} 

Well you get the idea, it could maybe get stored on a catalog somewhere
with nice SQL commands etc. The goal is then to be able to handle a much
simpler GUC in the application, sync_rep_service = important for
example. Reserved label would be off, the default value.

> For the control between async/recv/fsync/replay, I like to think in terms of
> a) asynchronous vs synchronous
> b) if it's synchronous, how synchronous is it? recv, fsync or replay?

Same here.

> I think it makes most sense to set sync vs. async in the master, and the
> level of synchronicity in the slave.

Yeah, exactly.

If you add a weight to each slave then a quorum commit, you don't change
the implementation complexity and you offer lot of setup flexibility. If
the slave sync-level and weight are SIGHUP, then it even become rather
easy to switch roles online or to add new servers or to organise a
maintenance window — the quorum to reach is a per-transaction GUC on the
master, too, right?

Regards,
--
dim

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 06:20:23

On Fri, 2010-09-17 at 09:15 +0100, Simon Riggs wrote:
> On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote:
> > That said, there's a few small things that can be progressed
> > regardless of the details of synchronous replication. There's the
> > changes to trigger failover with a signal, and it seems that we'll
> > need some libpq changes to allow acknowledgments to be sent back to
> > the master regardless of the rest of the design. We can discuss those
> > in separate threads in parallel. 
> 
> Agree to both of those points.

But I don't agree that those things should be committed just yet.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

17 September 2010, 06:30:34

On 17/09/10 12:10, Dimitri Fontaine wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> * Support multiple standbys with various synchronization levels.
>>
>> * What happens if a synchronous standby isn't connected at the moment?
>> Return immediately vs. wait forever.
>>
>> * Per-transaction control. Some transactions are important, others are not.
>>
>> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
>> can be seen as important special cases of this.
>>
>> * async, recv, fsync and replay levels of synchronization.
>>
>> So what should the user interface be like? Given the 1st and 2nd
>> requirement, we need standby registration. If some standbys are important
>> and others are not, the master needs to distinguish between them to be able
>> to determine that a transaction is safely delivered to the important
>> standbys.
>
> Well the 1st point can be handled in a distributed fashion, where the
> sync level is setup at the slave.

If the synchronicity is configured in the standby, how does the master 
know that there's a synchronous slave out there that it should wait for, 
if that slave isn't connected at the moment?

>> OTOH, if we do want fine-grained per-transaction control, a simple boolean
>> or even an enum GUC doesn't really cut it. For truly fine-grained control
>> you want to be able to specify exceptions like "wait until this is replayed
>> in slave named 'reporting'" or 'don't wait for acknowledgment from slave
>> named 'uk-server'". With standby registration, we can invent a syntax for
>> specifying overriding rules in the transaction. Something like SET
>> replication_exceptions = 'reporting=replay, uk-server=async'.
>
> Then you want to be able to have more than one reporting server and need
> only one of them at the "replay" level, but you don't need to know which
> it is. Or on the contrary you have a failover server and you want to be
> sure this one is at the replay level whatever happens.
>
> Then you want topology flexibility: you need to be able to replace a
> reporting server with another, ditto for the failover one.
>
> Did I tell you my current thinking on how to tackle that yet? :) Using a
> distributed setup, where each slave has a weight (several votes per
> transaction) and a level offering would allow that I think.

Yeah, the quorum stuff. That's all good, but doesn't change the way you 
would do per-transaction control. By specifying overrides on a 
per-transaction basis, you can have as fine-grained control as you 
possibly can. Anything you can specify in a configuration file can then 
also be specified per-transaction with overrides. The syntax just needs 
to be flexible enough.

If we buy into the concept of per-transaction exceptions, we can put 
that issue aside for the moment, and just consider how to configure 
things in a config file. Anything you can express in the config file can 
also be expressed per-transaction with the exceptions GUC.

> Now something similar to your idea that I can see a need for is being
> able to have a multi-part quorum target: when you currently say that you
> want 2 votes for sync, you would be able to say you want 2 votes for
> recv, 2 for fsync and 1 for replay. Remember that any slave is setup to
> offer only one level of synchronicity but can offer multiple votes.
>
> How this would look like in the setup? Best would be to register the
> different service levels your application need. Time to bikeshed a
> little?
>
>    sync_rep_services = {critical: recv=2, fsync=2, replay=1;
>                         important: fsync=3;
>                         reporting: recv=2, apply=1}
>
> Well you get the idea, it could maybe get stored on a catalog somewhere
> with nice SQL commands etc. The goal is then to be able to handle a much
> simpler GUC in the application, sync_rep_service = important for
> example. Reserved label would be off, the default value

So ignoring the quorum stuff for a moment, the general idea is that you 
have predefined sets of configurations (or exceptions to the general 
config) specified in a config file, and in the application you just 
choose among those with "sync_rep_service=XXX". Yeah, I like that, it 
allows you to isolate the details of the topology from the application.

> If you add a weight to each slave then a quorum commit, you don't change
> the implementation complexity and you offer lot of setup flexibility. If
> the slave sync-level and weight are SIGHUP, then it even become rather
> easy to switch roles online or to add new servers or to organise a
> maintenance window — the quorum to reach is a per-transaction GUC on the
> master, too, right?

I haven't bought into the quorum idea yet, but yeah, if we have quorum 
support, then it would be configurable on a per-transaction basis too 
with the above mechanism.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 06:49:47

On Fri, 2010-09-17 at 11:09 +0300, Heikki Linnakangas wrote:
> (changed subject again.)
> 
> On 17/09/10 10:06, Simon Riggs wrote:
> > I don't think we can determine how far to implement without considering
> > both approaches in detail. With regard to your points below, I don't
> > think any of those points could be committed first.
> 
> Yeah, I think we need to decide on the desired feature set first, before 
> we dig deeper into the the patches. The design and implementation will 
> fall out of that.

Well, we've discussed these things many times and talking hasn't got us
very far on its own. We need measurements and neutral assessments.

The patches are simple and we have time.

This isn't just about UI, there are significant and important
differences between the proposals in terms of the capability and control
they offer.

I propose we develop both patches further and performance test them.
Many of the features I have proposed are performance related and people
need to be able to see what is important, and what is not. But not
through mere discussion, we need numbers to show which things matter and
which things don't. And those need to be derived objectively.

> * Support multiple standbys with various synchronization levels.
> 
> * What happens if a synchronous standby isn't connected at the moment? 
> Return immediately vs. wait forever.
> 
> * Per-transaction control. Some transactions are important, others are not.
> 
> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all 
> servers can be seen as important special cases of this.
> 
> * async, recv, fsync and replay levels of synchronization.

That's a reasonable starting list of points, there may be others.

> So what should the user interface be like? Given the 1st and 2nd 
> requirement, we need standby registration. If some standbys are 
> important and others are not, the master needs to distinguish between 
> them to be able to determine that a transaction is safely delivered to 
> the important standbys.

My patch provides those two requirements without standby registration,
so we very clearly don't "need" standby registration.

The question is do we want standby registration on master and if so,
why?

> For per-transaction control, ISTM it would be enough to have a simple 
> user-settable GUC like synchronous_commit. Let's call it 
> "synchronous_replication_commit" for now. 

If you wish to change the name of the GUC away from the one I have
proposed, fine. Please note that aspect isn't important to me and I will
happily concede all such points to the majority view.

> For non-critical transactions, 
> you can turn it off. That's very simple for developers to understand and 
> use. I don't think we need more fine-grained control than that at 
> transaction level, in all the use cases I can think of you have a stream 
> of important transactions, mixed with non-important ones like log 
> messages that you want to finish fast in a best-effort fashion. 

Sounds like we're getting somewhere. See below.

> I'm 
> actually tempted to tie that to the existing synchronous_commit GUC, the 
> use case seems exactly the same.

http://archives.postgresql.org/pgsql-hackers/2008-07/msg01001.php
Check the date!

I think that particular point is going to confuse us. It will draw much
bike shedding and won't help us decide between patches. It's a nicety
that can be left to a time after we have the core feature committed.

> OTOH, if we do want fine-grained per-transaction control, a simple 
> boolean or even an enum GUC doesn't really cut it. For truly 
> fine-grained control you want to be able to specify exceptions like 
> "wait until this is replayed in slave named 'reporting'" or 'don't wait 
> for acknowledgment from slave named 'uk-server'". With standby 
> registration, we can invent a syntax for specifying overriding rules in 
> the transaction. Something like SET replication_exceptions = 
> 'reporting=replay, uk-server=async'.
> 
> For the control between async/recv/fsync/replay, I like to think in 
> terms of
> a) asynchronous vs synchronous
> b) if it's synchronous, how synchronous is it? recv, fsync or replay?
> 
> I think it makes most sense to set sync vs. async in the master, and the 
> level of synchronicity in the slave. Although I have sympathy for the 
> argument that it's simpler if you configure it all from the master side 
> as well.

I have catered for such requests by suggesting a plugin that allows you
to implement that complexity without overburdening the core code.

This strikes me as an "ad absurdum" argument. Since the above
over-complexity would doubtless be seen as insane by Tom et al, it
attempts to persuade that we don't need recv, fsync and apply either.

Fujii has long talked about 4 levels of service also. Why change? I had
thought that part was pretty much agreed between all of us.

Without performance tests to demonstrate "why", these do sound hard to
understand. But we should note that DRBD offers recv ("B") and fsync
("C") as separate options. And Oracle implements all 3 of recv, fsync
and apply. Neither of them describe those options so simply and easily
as the way we are proposing with a 4 valued enum (with async as the
fourth option).

If we have only one option for sync_rep = 'on' which of recv | fsync |
apply would it implement? You don't mention that. Which do you choose?
For what reason do you make that restriction? The code doesn't get any
simpler, in my patch at least, from my perspective it would be a
restriction without benefit.

I no longer seek to persuade by words alone. The existence of my patch
means that I think that only measurements and tests will show why I have
been saying these things. We need performance tests. I'm not ready for
them today, but will be very soon. I suspect you aren't either since
from earlier discussions you didn't appear to have much about overall
throughput, only about response times for single transactions. I'm happy
to be proved wrong there.

> Putting all of that together. I think Fujii-san's standby.conf is pretty 
> close.

> What it needs is the additional GUC for transaction-level control.

The difference between the patches is not a simple matter of a GUC.

My proposal allows a single standby to provide efficient replies to
multiple requested durability levels all at the same time. With
efficient use of network resources. ISTM that because the other patch
cannot provide that you'd like to persuade us that we don't need that,
ever. You won't sell me on that point, cos I can see lots of uses for
it.

Another use case for you:

* customer orders are important, but we want lots of them, so we use
recv mode for those.

* pricing data hardly ever changes, but when it does we need it to be
applied across the cluster so we don't get read mismatches, so those
rare transactions use apply mode.

If you don't want multiple modes at once, you don't need to use that
feature. But there is no reason to prevent people having the choice,
when a design exists that can provide it.

(A separate and later point, is that I would one day like to annotate
specific tables and functions with different modes, so a sysadmin can
point out which data is important at table level - which is what MySQL
provides by allowing choice of storage engine for particular tables.
Nobody cares about the specific engine, they care about the durability
implications of those choices. This isn't part of the current proposal,
just a later statement of direction.)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

17 September 2010, 07:00:53

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> If the synchronicity is configured in the standby, how does the master know
> that there's a synchronous slave out there that it should wait for, if that
> slave isn't connected at the moment?

That's what quorum is trying to solve. The master knows how many votes
per sync level the transaction needs. If no slave is acknowledging any
vote, that's all you need to know to ROLLBACK (after the timeout),
right? — if setup says so, on the master.

> Yeah, the quorum stuff. That's all good, but doesn't change the way you
> would do per-transaction control.

That's when I bought in on the feature. It's all dynamic and
distributed, and it offers per-transaction control.

Regards,
--
Dimitri Fontaine
PostgreSQL DBA, Architecte

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 07:03:24

On Fri, 2010-09-17 at 12:30 +0300, Heikki Linnakangas wrote:

> If the synchronicity is configured in the standby, how does the master 
> know that there's a synchronous slave out there that it should wait for, 
> if that slave isn't connected at the moment?

That isn't a question you need standby registration to answer.

In my proposal, the user requests a certain level of confirmation and
will wait until timeout to see if it is received. The standby can crash
and restart, come back and provide the answer, and it will still work.

So it is the user request that informs the master that there would
normally be a synchronous slave out there it should wait for.

So far, I have added the point that if a user requests a level of
confirmation that is currently unavailable, then it will use the highest
level of confirmation available now. That stops us from waiting for
timeout for every transaction we run if standby goes down hard, which
just freezes the application for long periods to no real benefit. It
also prevents applications from requesting durability levels the cluster
cannot satisfy, in the opinion of the sysadmin, since the sysadmin
specifies the max level on each standby.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

17 September 2010, 07:41:36

On 17/09/10 12:49, Simon Riggs wrote:
> This isn't just about UI, there are significant and important
> differences between the proposals in terms of the capability and control
> they offer.

Sure. The point of focusing on the UI is that the UI demonstrates what 
capability and control a proposal offers.

>> So what should the user interface be like? Given the 1st and 2nd
>> requirement, we need standby registration. If some standbys are
>> important and others are not, the master needs to distinguish between
>> them to be able to determine that a transaction is safely delivered to
>> the important standbys.
>
> My patch provides those two requirements without standby registration,
> so we very clearly don't "need" standby registration.

It's still not clear to me how you would configure things like "wait for 
ack from reporting slave, but not other slaves" or "wait until replayed 
in the server on the west coast" in your proposal. Maybe it's possible, 
but doesn't seem very intuitive, requiring careful configuration in both 
the master and the slaves.

In your proposal, you also need to be careful not to connect e.g a test 
slave with "synchronous_replication_service = apply" to the master, or 
it will possible shadow a real production slave, acknowledging 
transactions that are not yet received by the real slave. It's certainly 
possible to screw up with standby registration too, but you have more 
direct control of the master behavior in the master, instead of 
distributing it across all slaves.

> The question is do we want standby registration on master and if so,
> why?

Well, aside from how to configure synchronous replication, standby 
registration would help with retaining the right amount of WAL in the 
master. wal_keep_segments doesn't guarantee that enough is retained, and 
OTOH when all standbys are connected you retain much more than might be 
required.

Giving names to slaves also allows you to view their status in the 
master in a more intuitive format. Something like:

postgres=# SELECT * FROM pg_slave_status ;    name    | connected |  received  |   fsyncd   |  applied
------------+-----------+------------+------------+------------ reporting  | t         | 0/26000020 | 0/26000020 |
0/25550020ha-standby | t         | 0/26000020 | 0/26000020 | 0/26000020 testserver | f         |            |
0/15000020|

(3 rows)

>> For the control between async/recv/fsync/replay, I like to think in
>> terms of
>> a) asynchronous vs synchronous
>> b) if it's synchronous, how synchronous is it? recv, fsync or replay?
>>
>> I think it makes most sense to set sync vs. async in the master, and the
>> level of synchronicity in the slave. Although I have sympathy for the
>> argument that it's simpler if you configure it all from the master side
>> as well.
>
> I have catered for such requests by suggesting a plugin that allows you
> to implement that complexity without overburdening the core code.

Well, plugins are certainly one possibility, but then we need to design 
the plugin API. I've been thinking along the lines of a proxy, which can 
implement whatever logic you want to decide when to send the 
acknowledgment. With a proxy as well, if we push any features people 
that want to a proxy or plugin, we need to make sure that the 
proxy/plugin has all the necessary information available.

> This strikes me as an "ad absurdum" argument. Since the above
> over-complexity would doubtless be seen as insane by Tom et al, it
> attempts to persuade that we don't need recv, fsync and apply either.
>
> Fujii has long talked about 4 levels of service also. Why change? I had
> thought that part was pretty much agreed between all of us.

Now you lost me. I agree that we need 4 levels of service (at least 
ultimately, not necessarily in the first phase).

> Without performance tests to demonstrate "why", these do sound hard to
> understand. But we should note that DRBD offers recv ("B") and fsync
> ("C") as separate options. And Oracle implements all 3 of recv, fsync
> and apply. Neither of them describe those options so simply and easily
> as the way we are proposing with a 4 valued enum (with async as the
> fourth option).
>
> If we have only one option for sync_rep = 'on' which of recv | fsync |
> apply would it implement? You don't mention that. Which do you choose?

You would choose between recv, fsync and apply in the slave, with a GUC.

> I no longer seek to persuade by words alone. The existence of my patch
> means that I think that only measurements and tests will show why I have
> been saying these things. We need performance tests.

I don't expect any meaningful differences in terms of performance 
between any of the discussed options. The big question right now is what 
features we provide and how they're configured. Performance will depend 
primarily on the mode you use, and secondarily on the implementation of 
the mode. It would be completely premature to do performance testing yet 
IMHO.

>> Putting all of that together. I think Fujii-san's standby.conf is pretty
>> close.
>
>> What it needs is the additional GUC for transaction-level control.
>
> The difference between the patches is not a simple matter of a GUC.
>
> My proposal allows a single standby to provide efficient replies to
> multiple requested durability levels all at the same time. With
> efficient use of network resources. ISTM that because the other patch
> cannot provide that you'd like to persuade us that we don't need that,
> ever. You won't sell me on that point, cos I can see lots of uses for
> it.

Simon, how the replies are sent is an implementation detail I haven't 
given much thought yet. The reason we delved into that discussion 
earlier was that you seemed to contradict yourself with the claims that 
you don't need to send more than one reply per transaction, and that the 
standby doesn't need to know the synchronization level. Other than that 
the curiosity about that contradiction, it doesn't seem like a very 
interesting detail to me right now. It's not a question that drives the 
rest of the design, but the other way round.

But FWIW, something like your proposal of sending 3 XLogRecPtrs in each 
reply seems like a good approach. I'm not sure about using walwriter. I 
can see that it helps with getting the 'recv' and 'replay' 
acknowledgments out faster, but I still have the scars from starting 
bgwriter during recovery.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

17 September 2010, 08:20:20

Simon Riggs <simon@2ndQuadrant.com> writes:
> So far, I have added the point that if a user requests a level of
> confirmation that is currently unavailable, then it will use the highest
> level of confirmation available now. That stops us from waiting for
> timeout for every transaction we run if standby goes down hard, which
> just freezes the application for long periods to no real benefit. It
> also prevents applications from requesting durability levels the cluster
> cannot satisfy, in the opinion of the sysadmin, since the sysadmin
> specifies the max level on each standby.

That sounds like the commit-or-rollback when slave are gone question. I
think this behavior should be user-setable, again per-transaction. I
agree with you that the general case looks like your proposed default,
but we already know that some will need "don't ack if not replied before
the timeout", and they even will go as far as asking for it to be
reported as a serialisation error of some sort, I guess…

Regards,
--
Dimitri Fontaine
PostgreSQL DBA, Architecte

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 08:31:45

On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote:
> On 17/09/10 12:49, Simon Riggs wrote:
> > This isn't just about UI, there are significant and important
> > differences between the proposals in terms of the capability and control
> > they offer.
> 
> Sure. The point of focusing on the UI is that the UI demonstrates what 
> capability and control a proposal offers.

My patch does not include server registration. It could be added later
on top of my patch without any issues.

The core parts of my patch are the fine grained transaction-level
control and the ability to mix them dynamically with good performance.

To me server registration is not a core issue. I'm not actively against
it, I just don't see the need for it at all. Certainly not committed
first, especially since its not actually needed by either of our
patches.

Standby registration doesn't provide *any* parameter that can't be
supplied from standby recovery.conf. 

The only thing standby registration allows you to do is know whether
there was supposed to be a standby there, but yet it isn't there now. I
don't see that point as being important because it seems strange to me
to want to wait for a standby that ought to be there, but isn't anymore.
What happens if it never comes back? Manual intervention required.

(We agree on how to handle a standby that *is* "connected", yet never
returns a reply or takes too long to do so).

> >> So what should the user interface be like? Given the 1st and 2nd
> >> requirement, we need standby registration. If some standbys are
> >> important and others are not, the master needs to distinguish between
> >> them to be able to determine that a transaction is safely delivered to
> >> the important standbys.
> >
> > My patch provides those two requirements without standby registration,
> > so we very clearly don't "need" standby registration.
> 
> It's still not clear to me how you would configure things like "wait for 
> ack from reporting slave, but not other slaves" or "wait until replayed 
> in the server on the west coast" in your proposal. Maybe it's possible, 
> but doesn't seem very intuitive, requiring careful configuration in both 
> the master and the slaves.

In the use cases we discussed we had simple 2 or 3 server configs.

master
standby1 - preferred sync target - set to recv, fsync or apply
standby2 - non-preferred sync target, maybe test server - set to async

So in the two cases you mention we might set

"wait for ack from reporting slave"
master: sync_replication = 'recv'   #as default, can be changed
reporting-slave: sync_replication_service = 'recv' #gives max level

"wait until replayed in the server on the west coast"
master: sync_replication = 'recv'   #as default, can be changed
west-coast: sync_replication_service = 'apply' #gives max level

The absence of registration in my patch makes some things easier and
some things harder. For example, you can add a new standby without
editing the config on the master.

If you had 2 standbys, both offering the same level of protection, my
proposal would *not* allow you to specify that you preferred one master
over another. But we could add a priority parameter as well if that's an
issue. 

> In your proposal, you also need to be careful not to connect e.g a test 
> slave with "synchronous_replication_service = apply" to the master, or 
> it will possible shadow a real production slave, acknowledging 
> transactions that are not yet received by the real slave. It's certainly 
> possible to screw up with standby registration too, but you have more 
> direct control of the master behavior in the master, instead of 
> distributing it across all slaves.
> 
> > The question is do we want standby registration on master and if so,
> > why?
> 
> Well, aside from how to configure synchronous replication, standby 
> registration would help with retaining the right amount of WAL in the 
> master. wal_keep_segments doesn't guarantee that enough is retained, and 
> OTOH when all standbys are connected you retain much more than might be 
> required.
> 
> Giving names to slaves also allows you to view their status in the 
> master in a more intuitive format. Something like:

We can give servers a name without registration. It actually makes more
sense to set the name in the standby and it can be passed through from
standby when we connect.

I very much like the idea of server names and think this next SRF looks
really cool.

> postgres=# SELECT * FROM pg_slave_status ;
>      name    | connected |  received  |   fsyncd   |  applied
> ------------+-----------+------------+------------+------------
>   reporting  | t         | 0/26000020 | 0/26000020 | 0/25550020
>   ha-standby | t         | 0/26000020 | 0/26000020 | 0/26000020
>   testserver | f         |            | 0/15000020 |
> (3 rows)

That could be added on top of my patch also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Robert Haas

Date:

17 September 2010, 08:38:39

On Fri, Sep 17, 2010 at 6:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>>> So what should the user interface be like? Given the 1st and 2nd
>>> requirement, we need standby registration. If some standbys are
>>> important and others are not, the master needs to distinguish between
>>> them to be able to determine that a transaction is safely delivered to
>>> the important standbys.
>>
>> My patch provides those two requirements without standby registration,
>> so we very clearly don't "need" standby registration.
>
> It's still not clear to me how you would configure things like "wait for ack
> from reporting slave, but not other slaves" or "wait until replayed in the
> server on the west coast" in your proposal. Maybe it's possible, but doesn't
> seem very intuitive, requiring careful configuration in both the master and
> the slaves.

Agreed.  I think this will be much simpler if all the configuration is
in one place (on the master).

> In your proposal, you also need to be careful not to connect e.g a test
> slave with "synchronous_replication_service = apply" to the master, or it
> will possible shadow a real production slave, acknowledging transactions
> that are not yet received by the real slave. It's certainly possible to
> screw up with standby registration too, but you have more direct control of
> the master behavior in the master, instead of distributing it across all
> slaves.

Similarly agreed.

>> The question is do we want standby registration on master and if so,
>> why?
>
> Well, aside from how to configure synchronous replication, standby
> registration would help with retaining the right amount of WAL in the
> master. wal_keep_segments doesn't guarantee that enough is retained, and
> OTOH when all standbys are connected you retain much more than might be
> required.

+1.

> Giving names to slaves also allows you to view their status in the master in
> a more intuitive format. Something like:
>
> postgres=# SELECT * FROM pg_slave_status ;
>    name    | connected |  received  |   fsyncd   |  applied
> ------------+-----------+------------+------------+------------
>  reporting  | t         | 0/26000020 | 0/26000020 | 0/25550020
>  ha-standby | t         | 0/26000020 | 0/26000020 | 0/26000020
>  testserver | f         |            | 0/15000020 |
> (3 rows)

+1.

Having said all of the above, I am not in favor your (Heikki's)
proposal to configure sync/async on the slave and the level on the
master.  That seems like a somewhat bizarre division of labor,
splitting what is essentially one setting across two machines.

>>> For the control between async/recv/fsync/replay, I like to think in
>>> terms of
>>> a) asynchronous vs synchronous
>>> b) if it's synchronous, how synchronous is it? recv, fsync or replay?
>>>
>>> I think it makes most sense to set sync vs. async in the master, and the
>>> level of synchronicity in the slave. Although I have sympathy for the
>>> argument that it's simpler if you configure it all from the master side
>>> as well.
>>
>> I have catered for such requests by suggesting a plugin that allows you
>> to implement that complexity without overburdening the core code.
>
> Well, plugins are certainly one possibility, but then we need to design the
> plugin API. I've been thinking along the lines of a proxy, which can
> implement whatever logic you want to decide when to send the acknowledgment.
> With a proxy as well, if we push any features people that want to a proxy or
> plugin, we need to make sure that the proxy/plugin has all the necessary
> information available.

I'm not really sold on the proxy idea. That seems like it adds a lot
of configuration complexity, not to mention additional hops.  Of
course, the plug-in idea also won't be suitable for any but the most
advanced users.  I think of the two I prefer the idea of a plug-in,
slightly, but maybe this doesn't have to be done in version 1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Robert Haas

Date:

17 September 2010, 08:44:44

On Fri, Sep 17, 2010 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The only thing standby registration allows you to do is know whether
> there was supposed to be a standby there, but yet it isn't there now. I
> don't see that point as being important because it seems strange to me
> to want to wait for a standby that ought to be there, but isn't anymore.
> What happens if it never comes back? Manual intervention required.
>
> (We agree on how to handle a standby that *is* "connected", yet never
> returns a reply or takes too long to do so).

Doesn't Oracle provide a mode where it shuts down if this occurs?

> The absence of registration in my patch makes some things easier and
> some things harder. For example, you can add a new standby without
> editing the config on the master.

That's actually one of the reasons why I like the idea of
registration.  It seems rather scary to add a new standby without
editing the config on the master.  Actually, adding a new fully-async
slave without touching the master seems reasonable, but adding a new
sync slave without touching the master gives me the willies.  The
behavior of the system could change quite sharply when you do this,
and it might not be obvious what has happened.  (Imagine DBA #1 makes
the change and DBA #2 is then trying to figure out what's happened -
he checks the configs of all the machines he knows about and finds
them all unchanged... head-scratching ensues.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Fujii Masao

Date:

17 September 2010, 08:57:08

On Fri, Sep 17, 2010 at 7:41 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> The question is do we want standby registration on master and if so,
>> why?
>
> Well, aside from how to configure synchronous replication, standby
> registration would help with retaining the right amount of WAL in the
> master. wal_keep_segments doesn't guarantee that enough is retained, and
> OTOH when all standbys are connected you retain much more than might be
> required.

Yep.

And standby registration is required when we support "wait forever when
synchronous standby isn't connected at the moment" option that Heikki
explained upthread. Though I don't think that standby registration is
required in the first phase since "wait forever" option is not used in
basic use case. Synchronous replication is basically used to reduce the
downtime, and "wait forever" option opposes that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Configuring synchronous replication

From

Fujii Masao

Date:

17 September 2010, 09:20:25

On Fri, Sep 17, 2010 at 8:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The only thing standby registration allows you to do is know whether
> there was supposed to be a standby there, but yet it isn't there now. I
> don't see that point as being important because it seems strange to me
> to want to wait for a standby that ought to be there, but isn't anymore.

According to what I heard, some people want to guarantee that all the
transactions are *always* written in *all* the synchronous standbys.
IOW, they want to keep the transaction waiting until it has been written
in all the synchronous standbys. Standby registration is required to
support such a use case. Without the registration, the master cannot
determine whether the transaction has been written in all the synchronous
standbys.

> What happens if it never comes back? Manual intervention required.

Yep.

> In the use cases we discussed we had simple 2 or 3 server configs.
>
> master
> standby1 - preferred sync target - set to recv, fsync or apply
> standby2 - non-preferred sync target, maybe test server - set to async
>
> So in the two cases you mention we might set
>
> "wait for ack from reporting slave"
> master: sync_replication = 'recv'   #as default, can be changed
> reporting-slave: sync_replication_service = 'recv' #gives max level
>
> "wait until replayed in the server on the west coast"
> master: sync_replication = 'recv'   #as default, can be changed
> west-coast: sync_replication_service = 'apply' #gives max level

What synchronization level does each combination of sync_replication
and sync_replication_service lead to? I'd like to see something like
the following table.
sync_replication | sync_replication_service | result
------------------+--------------------------+--------async            | async                    | ???async
|recv                     | ???async            | fsync                    | ???async            | apply
   | ???recv             | async                    | ???... 

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 09:41:22

On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:

> What synchronization level does each combination of sync_replication
> and sync_replication_service lead to? I'd like to see something like
> the following table.
> 
>  sync_replication | sync_replication_service | result
> ------------------+--------------------------+--------
>  async            | async                    | ???
>  async            | recv                     | ???
>  async            | fsync                    | ???
>  async            | apply                    | ???
>  recv             | async                    | ???
>  ... 

Good question.

There are only 4 possible outcomes. There is no combination, so we don't
need a table like that above.

The "service" specifies the highest request type available from that
specific standby. If someone requests a higher service than is currently
offered by this standby, they will either 
a) get that service from another standby that does offer that level
b) automatically downgrade the sync rep mode to the highest available.

For example, if you request recv but there is only one standby and it
only offers async, then you get downgraded to async.

In all cases, if you request async then we act same as 9.0.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Fujii Masao

Date:

17 September 2010, 09:44:08

On Fri, Sep 17, 2010 at 5:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
> can be seen as important special cases of this.

I think that we should skip quorum commit at the first phase
because the design seems to be still poorly-thought-out.

I'm concerned about the case where the faster synchronous standby
goes down and the lagged synchronous one remains when n=1. In this
case, some transactions marked as committed in a client might not
be replicated to the remaining synchronous standby yet. What if
the master goes down at this point? How can we determine whether
promoting the remaining standby to the master causes data loss?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Configuring synchronous replication

From

Robert Haas

Date:

17 September 2010, 09:49:38

On Fri, Sep 17, 2010 at 8:43 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Sep 17, 2010 at 5:09 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
>> can be seen as important special cases of this.
>
> I think that we should skip quorum commit at the first phase
> because the design seems to be still poorly-thought-out.
>
> I'm concerned about the case where the faster synchronous standby
> goes down and the lagged synchronous one remains when n=1. In this
> case, some transactions marked as committed in a client might not
> be replicated to the remaining synchronous standby yet. What if
> the master goes down at this point? How can we determine whether
> promoting the remaining standby to the master causes data loss?

Yep. That issue has been raised before, and I think it's quite valid.
That's not to say the feature isn't valid, but I think trying to
include it in the first commit is going to lead to endless wrangling
about design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 09:56:29

On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
> On Fri, Sep 17, 2010 at 8:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The only thing standby registration allows you to do is know whether
> > there was supposed to be a standby there, but yet it isn't there now. I
> > don't see that point as being important because it seems strange to me
> > to want to wait for a standby that ought to be there, but isn't anymore.
> 
> According to what I heard, some people want to guarantee that all the
> transactions are *always* written in *all* the synchronous standbys.
> IOW, they want to keep the transaction waiting until it has been written
> in all the synchronous standbys. Standby registration is required to
> support such a use case. Without the registration, the master cannot
> determine whether the transaction has been written in all the synchronous
> standbys.

You don't need standby registration at all. You can do that with a
single parameter, already proposed:

quorum_commit = N.

But most people said they didn't want it. If they do we can put it back
later.

I don't think we're getting anywhere here. I just don't see any *need*
to have it. Some people might *want* to set things up that way, and if
that's true, that's enough for me to agree with them. The trouble is, I
know some people have said they *want* to set it in the standby and we
definitely *need* to set it somewhere. After this discussion, I think
"both" is easily done and quite cool.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 09:56:34

On Fri, 2010-09-17 at 20:56 +0900, Fujii Masao wrote:
> On Fri, Sep 17, 2010 at 7:41 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> >> The question is do we want standby registration on master and if so,
> >> why?
> >
> > Well, aside from how to configure synchronous replication, standby
> > registration would help with retaining the right amount of WAL in the
> > master. wal_keep_segments doesn't guarantee that enough is retained, and
> > OTOH when all standbys are connected you retain much more than might be
> > required.
> 
> Yep.

Setting wal_keep_segments is difficult, but its not a tunable.

The sysadmin needs to tell us what is the maximum number of files she'd
like to keep. Otherwise we may fill up a disk, use space intended for
use by another app, etc..

The server cannot determine what limits the sysadmin may wish to impose.
The only sane default is 0, because "store everything, forever" makes no
sense. Similarly, if we register a server, it goes down and we forget to
deregister it then we will attempt to store everything, forever and our
system will go down.

The bigger problem is base backups, not server restarts. We don't know
how to get that right because we don't register base backups
automatically. If we did dynamically alter the number of WALs we store
then we'd potentially screw up new base backups. Server registration
won't help with that at all, so you'd need to add a base backup
registration scheme as well. But even if you had that, you'd still need
a "max" setting defined by sysadmin.

So the only sane thing to do is to set wal_keep_segments as high as
possible. And doing that doesn't need server reg.

> And standby registration is required when we support "wait forever when
> synchronous standby isn't connected at the moment" option that Heikki
> explained upthread. Though I don't think that standby registration is
> required in the first phase since "wait forever" option is not used in
> basic use case. Synchronous replication is basically used to reduce the
> downtime, and "wait forever" option opposes that.

Agreed, but I'd say "if" we support that. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 09:56:40

On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote:
> On 17/09/10 12:49, Simon Riggs wrote:
> > Fujii has long talked about 4 levels of service also. Why change? I had
> > thought that part was pretty much agreed between all of us.
> 
> Now you lost me. I agree that we need 4 levels of service (at least 
> ultimately, not necessarily in the first phase).

OK, good.

> > Without performance tests to demonstrate "why", these do sound hard to
> > understand. But we should note that DRBD offers recv ("B") and fsync
> > ("C") as separate options. And Oracle implements all 3 of recv, fsync
> > and apply. Neither of them describe those options so simply and easily
> > as the way we are proposing with a 4 valued enum (with async as the
> > fourth option).
> >
> > If we have only one option for sync_rep = 'on' which of recv | fsync |
> > apply would it implement? You don't mention that. Which do you choose?
> 
> You would choose between recv, fsync and apply in the slave, with a GUC.

So you would have both registration on the master and parameter settings
on the standby? I doubt you mean that, so possibly need more explanation
there for me to understand what you mean and also why you would do that.

> > I no longer seek to persuade by words alone. The existence of my patch
> > means that I think that only measurements and tests will show why I have
> > been saying these things. We need performance tests.
> 

> I don't expect any meaningful differences in terms of performance 
> between any of the discussed options. The big question right now is...

This is the critical point. Politely, I would observe that *You* do not
think there is a meaningful difference. *I* do, and evidence suggests
that both Oracle and DRBD think so too. So we differ on what the "big
question" is here.

It's sounding to me that if we don't know these things, then we're quite
a long way from committing something. This is basic research.

> what 
> features we provide and how they're configured. Performance will depend 
> primarily on the mode you use, and secondarily on the implementation of 
> the mode. It would be completely premature to do performance testing yet 
> IMHO.

If a patch is "ready" then we should be able to performance test it
*before* we commit it. From what you say it sounds like Fujii's patch
might yet require substantial tuning, so it might even be the case that
my patch is closer in terms of readiness to commit. Whatever the case,
we have two patches and I can't see any benefit in avoiding performance
tests.

> >> Putting all of that together. I think Fujii-san's standby.conf is pretty
> >> close.
> >
> >> What it needs is the additional GUC for transaction-level control.
> >
> > The difference between the patches is not a simple matter of a GUC.
> >
> > My proposal allows a single standby to provide efficient replies to
> > multiple requested durability levels all at the same time. With
> > efficient use of network resources. ISTM that because the other patch
> > cannot provide that you'd like to persuade us that we don't need that,
> > ever. You won't sell me on that point, cos I can see lots of uses for
> > it.
> 
> Simon, how the replies are sent is an implementation detail I haven't 
> given much thought yet. 

It seems clear we've thought about different details around these
topics. Now I understand your work on latches, I see it is an important
contribution and I very much respect that. IMHO, each of us has seen
something important that the other has not.

> The reason we delved into that discussion 
> earlier was that you seemed to contradict yourself with the claims that 
> you don't need to send more than one reply per transaction, and that the 
> standby doesn't need to know the synchronization level. Other than that 
> the curiosity about that contradiction, it doesn't seem like a very 
> interesting detail to me right now. It's not a question that drives the 
> rest of the design, but the other way round.

There was no contradiction. You just didn't understand how it could be
possible, so dismissed it.

It's a detail, yes. Some are critical, some are not. (e.g. latches.)

My view is that it is critical and drives the design. So I don't agree
with you on "the other way around".

> But FWIW, something like your proposal of sending 3 XLogRecPtrs in each 
> reply seems like a good approach. I'm not sure about using walwriter. I 
> can see that it helps with getting the 'recv' and 'replay' 
> acknowledgments out faster, but 

> I still have the scars from starting 
> bgwriter during recovery.

I am happy to apologise for those problems. I was concentrating on HS at
the time, not on that aspect. You sorted out those problems for me and I
thank you for that. 

With that in mind, I will remove the aspect of my patch that relate to
starting wal writer. Small amount of code only. That means we will
effectively disable recv mode for now, but I definitely want to be able
to put it back later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 10:02:14

On Fri, 2010-09-17 at 21:43 +0900, Fujii Masao wrote:
> On Fri, Sep 17, 2010 at 5:09 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all servers
> > can be seen as important special cases of this.
> 
> I think that we should skip quorum commit at the first phase
> because the design seems to be still poorly-thought-out.

Agreed

> I'm concerned about the case where the faster synchronous standby
> goes down and the lagged synchronous one remains when n=1. In this
> case, some transactions marked as committed in a client might not
> be replicated to the remaining synchronous standby yet. What if
> the master goes down at this point? How can we determine whether
> promoting the remaining standby to the master causes data loss?

In that config if the faster sync standby goes down then your
application performance goes down dramatically. That would be fragile.

So you would set up like this
master - requests are > async
standby1 - fast - so use recv | fsync | apply
standby2 - async

So if standby1 goes down we don't wait for standby2, but we do continue
to stream to it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

17 September 2010, 10:09:59

On 17/09/10 15:56, Simon Riggs wrote:
> On Fri, 2010-09-17 at 13:41 +0300, Heikki Linnakangas wrote:
>> On 17/09/10 12:49, Simon Riggs wrote:
>>> Without performance tests to demonstrate "why", these do sound hard to
>>> understand. But we should note that DRBD offers recv ("B") and fsync
>>> ("C") as separate options. And Oracle implements all 3 of recv, fsync
>>> and apply. Neither of them describe those options so simply and easily
>>> as the way we are proposing with a 4 valued enum (with async as the
>>> fourth option).
>>>
>>> If we have only one option for sync_rep = 'on' which of recv | fsync |
>>> apply would it implement? You don't mention that. Which do you choose?
>>
>> You would choose between recv, fsync and apply in the slave, with a GUC.
>
> So you would have both registration on the master and parameter settings
> on the standby? I doubt you mean that, so possibly need more explanation
> there for me to understand what you mean and also why you would do that.

Yes, that's what I meant. No-one else seems to think that's a good idea :-).

>> I don't expect any meaningful differences in terms of performance
>> between any of the discussed options. The big question right now is...
>
> This is the critical point. Politely, I would observe that *You* do not
> think there is a meaningful difference. *I* do, and evidence suggests
> that both Oracle and DRBD think so too. So we differ on what the "big
> question" is here.

We must be talking about different things again. There's certainly big 
differences in the different synchronization levels and configurations, 
but I don't expect there to be big performance differences between 
patches to implement those levels. Once we got rid of the polling loops, 
I expect the network and disk latencies to dominate.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Aidan Van Dyk

Date:

17 September 2010, 10:36:09

* Robert Haas <robertmhaas@gmail.com> [100917 07:44]:
> On Fri, Sep 17, 2010 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > The only thing standby registration allows you to do is know whether
> > there was supposed to be a standby there, but yet it isn't there now. I
> > don't see that point as being important because it seems strange to me
> > to want to wait for a standby that ought to be there, but isn't anymore.
> > What happens if it never comes back? Manual intervention required.
> > The absence of registration in my patch makes some things easier and
> > some things harder. For example, you can add a new standby without
> > editing the config on the master.
> 
> That's actually one of the reasons why I like the idea of
> registration.  It seems rather scary to add a new standby without
> editing the config on the master.  Actually, adding a new fully-async
> slave without touching the master seems reasonable, but adding a new
> sync slave without touching the master gives me the willies.  The
> behavior of the system could change quite sharply when you do this,
> and it might not be obvious what has happened.  (Imagine DBA #1 makes
> the change and DBA #2 is then trying to figure out what's happened -
> he checks the configs of all the machines he knows about and finds
> them all unchanged... head-scratching ensues.)

So, those both give me the willies too...

I've had a rack loose all power.  Now, let's say I've got two servers
(plus trays of disks for each) in the same rack.  Ya, I know, I should
move them to separate racks, preferably in separate buildings on the
same campus, but realistically...

I want to have them configured in a fsync WAL/style sync rep, I want to
make sure that if the master comes up first after I get power back, it's
not going to be claiming transactions are committed while the slave
(which happens to have 4x the disks because it keeps PITR backups for a
period too) it still chugging away on SCSI probes yet, not gotten to
having PostgreSQL up yet...

And I want to make sure the dev box that was testing another slave setup
on, which is running in some test area by some other DBA, but not in the
same rack, *can't* through some mis-configuration make my master think
that it's production slave has properly fsync'ed the replicated WAL.

</hopes & dreams>

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Configuring synchronous replication

From

Aidan Van Dyk

Date:

17 September 2010, 10:59:56

* Fujii Masao <masao.fujii@gmail.com> [100917 07:57]:
>                 Synchronous replication is basically used to reduce the
> downtime, and "wait forever" option opposes that.

Hm... I'm not sure I'ld agree with that.  I'ld rather have some
downtime, and my data available, then have less downtime, but find that
I'm missing valuable data that was committed, but happend to not be
replicated because no slave was available "yet".

Sync rep is about "data availability", "data recoverability", *and*
"downtime".  The three are definitely related, but each use has their
own tradeoffs.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 12:22:42

On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:

> I want to have them configured in a fsync WAL/style sync rep, I want to
> make sure that if the master comes up first after I get power back, it's
> not going to be claiming transactions are committed while the slave
> (which happens to have 4x the disks because it keeps PITR backups for a
> period too) it still chugging away on SCSI probes yet, not gotten to
> having PostgreSQL up yet...

Nobody has mentioned the ability to persist the not-committed state
across a crash before, and I think it's an important discussion point. 

We already have it: its called "two phase commit". (2PC)

If you run 2PC on 3 servers and one goes down, you can just commit the
in-flight transactions and continue. But it doesn't work on hot standby.

It could: If we want that we could prepare the transaction on the master
and don't allow commit until we get positive confirmation from standby.
All of the machinery is there.

I'm not sure if that's a 5th sync rep mode, or that idea is actually
good enough to replace all the ideas we've had up until now. I would say
probably not, but we should think about this.

A slightly modified idea would be avoid writing the transaction prepare
file as a separate file, just write the WAL for the prepare. We then
remember the LSN of the prepare so we can re-access the WAL copy of it
by re-reading the WAL files on master. Make sure we don't get rid of WAL
that refers to waiting transactions. That would then give us the option
to commit or abort depending upon whether we receive a reply within
timeout.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Robert Haas

Date:

17 September 2010, 12:24:08

On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:
>
>> I want to have them configured in a fsync WAL/style sync rep, I want to
>> make sure that if the master comes up first after I get power back, it's
>> not going to be claiming transactions are committed while the slave
>> (which happens to have 4x the disks because it keeps PITR backups for a
>> period too) it still chugging away on SCSI probes yet, not gotten to
>> having PostgreSQL up yet...
>
> Nobody has mentioned the ability to persist the not-committed state
> across a crash before, and I think it's an important discussion point.

Eh?  I think all Aidan is asking for is the ability to have a mode
where sync rep is really always sync, or nothing commits.  Rather than
timing out and continuing merrily on its way...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Aidan Van Dyk

Date:

17 September 2010, 12:30:42

* Robert Haas <robertmhaas@gmail.com> [100917 11:24]:
> On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:
> >
> >> I want to have them configured in a fsync WAL/style sync rep, I want to
> >> make sure that if the master comes up first after I get power back, it's
> >> not going to be claiming transactions are committed while the slave
> >> (which happens to have 4x the disks because it keeps PITR backups for a
> >> period too) it still chugging away on SCSI probes yet, not gotten to
> >> having PostgreSQL up yet...
> >
> > Nobody has mentioned the ability to persist the not-committed state
> > across a crash before, and I think it's an important discussion point.
> 
> Eh?  I think all Aidan is asking for is the ability to have a mode
> where sync rep is really always sync, or nothing commits.  Rather than
> timing out and continuing merrily on its way...

Right, I'm not asking for a "new" mode.  I'm just hope that there will
be a way to guarantee my "sync rep" is actually replicating.  Having it
"not replicate" simply because no slave has (yet) connected means I have
to dance jigs around pg_hba.conf so that it won't allow non-replication
connections until I've manual verified that the replication slave
is connected...

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 12:50:42

On Fri, 2010-09-17 at 11:30 -0400, Aidan Van Dyk wrote:
> * Robert Haas <robertmhaas@gmail.com> [100917 11:24]:
> > On Fri, Sep 17, 2010 at 11:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > > On Fri, 2010-09-17 at 09:36 -0400, Aidan Van Dyk wrote:
> > >
> > >> I want to have them configured in a fsync WAL/style sync rep, I want to
> > >> make sure that if the master comes up first after I get power back, it's
> > >> not going to be claiming transactions are committed while the slave
> > >> (which happens to have 4x the disks because it keeps PITR backups for a
> > >> period too) it still chugging away on SCSI probes yet, not gotten to
> > >> having PostgreSQL up yet...
> > >
> > > Nobody has mentioned the ability to persist the not-committed state
> > > across a crash before, and I think it's an important discussion point.
> > 
> > Eh?  I think all Aidan is asking for is the ability to have a mode
> > where sync rep is really always sync, or nothing commits.  Rather than
> > timing out and continuing merrily on its way...
> 
> Right, I'm not asking for a "new" mode.  I'm just hope that there will
> be a way to guarantee my "sync rep" is actually replicating.  Having it
> "not replicate" simply because no slave has (yet) connected means I have
> to dance jigs around pg_hba.conf so that it won't allow non-replication
> connections until I've manual verified that the replication slave
> is connected...

I agree that aspect is a problem.

One solution, to me, would be to have a directive included in the
pg_hba.conf that says entries below it are only allowed if it passes the
test. So your hba file looks like this

local    postgres    postgres
host    replication    ...
need    replication 
host    any        any    

So the "need" test is an extra option in the first column. We might want
additional "need" tests before we allow other rules also. Text following
the "need" verb will be additional info for that test, sufficient to
allow some kind of execution on the backend.

I definitely don't like the idea that anyone that commits will just sit
there waiting until the standby comes up. That just sounds an insane way
of doing it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

17 September 2010, 13:01:08

On Fri, 2010-09-17 at 16:09 +0300, Heikki Linnakangas wrote:

> >> I don't expect any meaningful differences in terms of performance
> >> between any of the discussed options. The big question right now is...
> >
> > This is the critical point. Politely, I would observe that *You* do not
> > think there is a meaningful difference. *I* do, and evidence suggests
> > that both Oracle and DRBD think so too. So we differ on what the "big
> > question" is here.
> 
> We must be talking about different things again. There's certainly big 
> differences in the different synchronization levels and configurations, 
> but I don't expect there to be big performance differences between 
> patches to implement those levels. Once we got rid of the polling loops, 
> I expect the network and disk latencies to dominate.

So IIUC you seem to agree with
* 4 levels of synchronous replication (specified on master)
* transaction-controlled replication from the master
* sending 3 LSN values back from standby

Well, then that pretty much is my patch, except for the parameter UI.
Did I misunderstand?

We also agree that we need a standby to master protocol change; I used
Zoltan's directly and I've had zero problems with it in testing.

The only disagreement has been about
* the need for standby registration (I understand "want")
which seems to boil down to whether we wait for servers that *ought* to
be there, but currently aren't.

* whether to have wal writer active (I'm happy to add that later in this
release, so we get the "recv" option also)

* whether we have a parameter for quorum commit > 1 (happy to add later)
Not sure if there is debate about whether quorum_commit = 1 is the
default.

* whether we provide replication_exceptions as core feature or as a
plugin

The only area of doubt is when we send replies, which you haven't
thought about yet. So presumably you've no design-level objection to
what I've proposed.

Things we all seem to like are

* different standbys can offer different sync levels

* standby names

* a set returning function which tells you current LSNs of all standbys

* the rough idea of being able to specify a "service" and have that
equate to a more complex config underneath the covers, without needing
to have the application know the details - I think we need more details
on that before we could say "we agree".

So seems like a good days work.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Jesper Krogh

Date:

17 September 2010, 13:25:17

On 2010-09-17 10:09, Heikki Linnakangas wrote:<br /><span style="white-space: pre;">> I think it makes most sense
toset sync vs. async in the master, and <br /> > the level of synchronicity in the slave. Although I have
sympathy<br/> > for the argument that it's simpler if you configure it all from the <br /> > master side as
well.</span><br/><br /> Just a comment as a sysadmin, It would be hugely beneficial if the<br /> master and slaves all
wasable to run from the "exact same" configuration<br /> file. This would leave out any doubt of the configuration of
the"complete cluster"<br /> in terms of debugging. Slave would be able to just "copy" over the masters<br />
configuration,etc. etc. <br /><br /> I dont know if it is doable or has any huge backsides. <br /><br /> -- <br />
Jesper<br/>

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

17 September 2010, 16:32:28

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
>> According to what I heard, some people want to guarantee that all the
>> transactions are *always* written in *all* the synchronous standbys.
>
> You don't need standby registration at all. You can do that with a
> single parameter, already proposed:
>
> quorum_commit = N.

I think you also need another parameter to control the behavior upon
timeout. You received less than N votes, now what? You're current idea
seems to be COMMIT, Aidan says ROLLBACK, and I say that's to be a GUC
set at the transaction level.

As far as registration goes, I see no harm to have the master maintain a
list of known standby systems, of course, it's just maintaining that
list from the master that I don't understand the use case for.

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

17 September 2010, 16:36:42

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
>> What synchronization level does each combination of sync_replication
>> and sync_replication_service lead to?
>
> There are only 4 possible outcomes. There is no combination, so we don't
> need a table like that above.
>
> The "service" specifies the highest request type available from that
> specific standby. If someone requests a higher service than is currently
> offered by this standby, they will either 
> a) get that service from another standby that does offer that level
> b) automatically downgrade the sync rep mode to the highest available.

I like the a) part, I can't say the same about the b) part. There's no
reason to accept to COMMIT a transaction when the requested durability
is known not to have been reached, unless the user said so.

> For example, if you request recv but there is only one standby and it
> only offers async, then you get downgraded to async.

If so you choose, but with a net slowdown as you're now reaching the
timeout for each transaction, with what I have in mind, and I don't see
how you can avoid that. Even if you setup the replication from the
master, you still can mess it up the same way, right?

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Simon Riggs

Date:

18 September 2010, 05:51:23

On Fri, 2010-09-17 at 21:32 +0200, Dimitri Fontaine wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
> >> According to what I heard, some people want to guarantee that all the
> >> transactions are *always* written in *all* the synchronous standbys.
> >
> > You don't need standby registration at all. You can do that with a
> > single parameter, already proposed:
> >
> > quorum_commit = N.
> 
> I think you also need another parameter to control the behavior upon
> timeout. You received less than N votes, now what? You're current idea
> seems to be COMMIT, Aidan says ROLLBACK, and I say that's to be a GUC
> set at the transaction level.

I've said COMMIT with no option because I believe that we have only two
choices: commit or wait (perhaps forever), and IMHO waiting is not good.

We can't ABORT, because we sent a commit to the standby. If we abort,
then we're saying the standby can't ever come back because it will have
received and potentially replayed a different transaction history. I had
some further thoughts around that but you end up with the byzantine
generals problem always.

Waiting might sound attractive. In practice, waiting will make all of
your connections lock up and it will look to users as if their master
has stopped working as well. (It has!). I can't imagine why anyone would
ever want an option to select that; its the opposite of high
availability. Just sounds like a serious footgun.

Having said that Oracle offers Maximum Protection mode, which literally
shuts down the master when you lose a standby. I can't say anything
apart from "LOL".

> As far as registration goes, I see no harm to have the master maintain a
> list of known standby systems, of course, it's just maintaining that
> list from the master that I don't understand the use case for.

Yes, the master needs to know about all currently connected standbys.
The only debate is what happens about ones that "ought" to be there.

Given my comments above, I don't see the need.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

18 September 2010, 08:57:57

Simon Riggs <simon@2ndQuadrant.com> writes:
> I've said COMMIT with no option because I believe that we have only two
> choices: commit or wait (perhaps forever), and IMHO waiting is not good.
>
> We can't ABORT, because we sent a commit to the standby.

Ah yes, I keep forgetting Sync Rep is not about 2PC. Sorry about that.

> Waiting might sound attractive. In practice, waiting will make all of
> your connections lock up and it will look to users as if their master
> has stopped working as well. (It has!). I can't imagine why anyone would
> ever want an option to select that; its the opposite of high
> availability. Just sounds like a serious footgun.

I guess that if there's a timeout GUC it can still be set to infinite
somehow. Unclear as the use case might be.

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Robert Haas

Date:

18 September 2010, 16:59:43

On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Waiting might sound attractive. In practice, waiting will make all of
> your connections lock up and it will look to users as if their master
> has stopped working as well. (It has!). I can't imagine why anyone would
> ever want an option to select that; its the opposite of high
> availability. Just sounds like a serious footgun.

Nevertheless, it seems that some people do want exactly that behavior,
no matter how crazy it may seem to you.  I'm not exactly sure what
we're in disagreement about, TBH.  You've previously said that you
don't think standby registration is necessary, but that you don't
object to it if others want it.  So it seems like this might be mostly
academic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Josh Berkus

Date:

18 September 2010, 18:42:29

All,

I'm answering this strictly from the perspective of my company's
customers and what they've asked for.  It does not reflect on what
features are reflected in whatever patch.

> * Support multiple standbys with various synchronization levels.

Essential. We already have two customers who want to have one synch and
several async standbys.

> * What happens if a synchronous standby isn't connected at the moment?
> Return immediately vs. wait forever.

Essential.
Actually, we need a replication_return_timeout.  That is, wait X seconds
on the standby and then give up.  Again, in the systems I'm working
with, we'd want to wait 5 seconds and then abort replication.

> * Per-transaction control. Some transactions are important, others are not.

Low priority.
I see this as a 9.2 feature.  Nobody I know is asking for it yet, and I
think we need to get the other stuff right first.

> * Quorum commit. Wait until n standbys acknowledge. n=1 and n=all
> servers can be seen as important special cases of this.

Medium priority.  This would go together with having a registry of
standbies.   The only reason I don't call this low priority is that it
would catapult PostgreSQL into the realm of CAP databases, assuming that
we could deal with the re-mastering issue as well.

> * async, recv, fsync and replay levels of synchronization.

Fsync vs. Replay is low priority (as in, we could live with just one or
the other), but the others are all high priority.  Again, this should be
settable *per standby*.

> So what should the user interface be like? Given the 1st and 2nd
> requirement, we need standby registration. If some standbys are
> important and others are not, the master needs to distinguish between
> them to be able to determine that a transaction is safely delivered to
> the important standbys.

There are considerable benefits to having a standby registry with a
table-like interface.  Particularly, one where we could change
replication via UPDATE (or ALTER STANDBY) statements.

a) we could eliminate a bunch of GUCs and control standby behavior
instead via the table interface.

b) DBAs and monitoring tools could see at a glance what the status of
their replication network was.

c) we could easily add new features (like quorum groups) without
breaking prior setups.

d) it would become easy rather than a PITA to construct GUI replication
management tools.

e) as previously mentioned, we could use it to have far more intelligent
control over what WAL segments to keep, both on the master and in some
distributed archive.

Note, however, that the data from this pseudo-table would need to be
replicated to the standby servers somehow in order to support re-mastering.

Take all the above with a grain of salt, though.  The important thing is
to get *some kind* of synch rep into 9.1, and get 9.1 out on time.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Configuring synchronous replication

From

Robert Haas

Date:

18 September 2010, 19:20:55

On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
> There are considerable benefits to having a standby registry with a
> table-like interface.  Particularly, one where we could change
> replication via UPDATE (or ALTER STANDBY) statements.

I think that using a system catalog for this is going to be a
non-starter, but we could use a flat file that is designed to be
machine-editable (and thus avoid repeating the mistake we've made with
postgresql.conf).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Josh Berkus

Date:

18 September 2010, 20:08:59

> I think that using a system catalog for this is going to be a
> non-starter, 

Technically improbable?  Darn.

> but we could use a flat file that is designed to be
> machine-editable (and thus avoid repeating the mistake we've made with
> postgresql.conf).

Well, even if we can't update it through the command line, at least the
existing configuration (and node status) ought to be queryable.


--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

19 September 2010, 17:10:44

On Sat, 2010-09-18 at 14:42 -0700, Josh Berkus wrote:

> > * Per-transaction control. Some transactions are important, others are not.
> 
> Low priority.
> I see this as a 9.2 feature.  Nobody I know is asking for it yet, and I
> think we need to get the other stuff right first.

I understand completely why anybody that has never used sync replication
would think per-transaction control is a small deal. I fully expect your
clients to try sync rep and then 5 minutes later say "Oh Crap, this sync
rep is so slow it's unusable. Isn't there a way to tune it?".

I've designed a way to tune sync rep so it is usable and useful. And
putting that feature into 9.1 costs very little, if anything. My patch
to do this is actually smaller than any other attempt to implement this
and I claim faster too. You don't need to use the per-transaction
controls, but they'll be there if you need them.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Josh Berkus

Date:

19 September 2010, 23:32:08

> I've designed a way to tune sync rep so it is usable and useful. And
> putting that feature into 9.1 costs very little, if anything. My patch
> to do this is actually smaller than any other attempt to implement this
> and I claim faster too. You don't need to use the per-transaction
> controls, but they'll be there if you need them.

Well, if you already have the code, that's a different story ...


--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

20 September 2010, 03:27:34

On 18/09/10 22:59, Robert Haas wrote:
> On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> Waiting might sound attractive. In practice, waiting will make all of
>> your connections lock up and it will look to users as if their master
>> has stopped working as well. (It has!). I can't imagine why anyone would
>> ever want an option to select that; its the opposite of high
>> availability. Just sounds like a serious footgun.
>
> Nevertheless, it seems that some people do want exactly that behavior,
> no matter how crazy it may seem to you.

Yeah, I agree with both of you. I have a hard time imaging a situation 
where you would actually want that. It's not high availability, it's 
high durability. When a transaction is acknowledged as committed, you 
know it's never ever going to disappear even if a meteor strikes the 
current master server within the next 10 milliseconds. In practice, 
people want high availability instead.

That said, the timeout option also feels a bit wishy-washy to me. With a 
timeout, acknowledgment of a commit means "your transaction is safely 
committed in the master and slave. Or not, if there was some glitch with 
the slave". That doesn't seem like a very useful guarantee; if you're 
happy with that why not just use async replication?

However, the "wait forever" behavior becomes useful if you have a 
monitoring application outside the DB that decides when enough is enough 
and tells the DB that the slave can be considered dead. So "wait 
forever" actually means "wait until I tell you that you can give up". 
The monitoring application can STONITH to ensure that the slave stays 
down, before letting the master proceed with the commit.

With that in mind, we have to make sure that a transaction that's 
waiting for acknowledgment of the commit from a slave is woken up if the 
configuration changes.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

20 September 2010, 03:30:10

On 19/09/10 01:20, Robert Haas wrote:
> On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus<josh@agliodbs.com>  wrote:
>> There are considerable benefits to having a standby registry with a
>> table-like interface.  Particularly, one where we could change
>> replication via UPDATE (or ALTER STANDBY) statements.
>
> I think that using a system catalog for this is going to be a
> non-starter, but we could use a flat file that is designed to be
> machine-editable (and thus avoid repeating the mistake we've made with
> postgresql.conf).

Yeah, that needs some careful design. We also need to record transient 
information about each slave, like how far it has received WAL already. 
Ideally that information would survive database restart too, but maybe 
we can live without that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Markus Wanner

Date:

20 September 2010, 04:30:49

Hi,

On 09/17/2010 01:56 PM, Fujii Masao wrote:
> And standby registration is required when we support "wait forever when
> synchronous standby isn't connected at the moment" option that Heikki
> explained upthread.

That requirement can be reduced to say that the master only needs to 
known how many synchronous standbys *should* be connected.

IIUC that's pretty much exactly the quorum_commit GUC that Simon 
proposed, because it doesn't make sense to have more synchronous 
standbys connected than quorum_commit (as Simon pointed out downthread).

I'm unsure about what's better, the full list (giving a good overview, 
but more to configure) or the single sum GUC (being very flexible and 
closer to how things work internally). But that seems to be a UI 
question exclusively.

Regarding the "wait forever" option: I don't think continuing is a 
viable alternative, as it silently ignores the requested level of 
persistence. The only alternative I can see is to abort with an error. 
As far as comparison is allowed, that's what Postgres-R currently does 
if there's no majority of nodes. It allows to emit an error message and 
helpful hints, as opposed to letting the admin figure out what and where 
it's hanging. Not throwing false errors has the same requirements as 
"waiting forever", so that's an orthogonal issue, IMO.

Regards

Markus Wanner

Re: Configuring synchronous replication

From

Simon Riggs

Date:

20 September 2010, 06:17:51

On Mon, 2010-09-20 at 09:27 +0300, Heikki Linnakangas wrote:
> On 18/09/10 22:59, Robert Haas wrote:
> > On Sat, Sep 18, 2010 at 4:50 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
> >> Waiting might sound attractive. In practice, waiting will make all of
> >> your connections lock up and it will look to users as if their master
> >> has stopped working as well. (It has!). I can't imagine why anyone would
> >> ever want an option to select that; its the opposite of high
> >> availability. Just sounds like a serious footgun.
> >
> > Nevertheless, it seems that some people do want exactly that behavior,
> > no matter how crazy it may seem to you.
> 
> Yeah, I agree with both of you. I have a hard time imaging a situation 
> where you would actually want that. It's not high availability, it's 
> high durability. When a transaction is acknowledged as committed, you 
> know it's never ever going to disappear even if a meteor strikes the 
> current master server within the next 10 milliseconds. In practice, 
> people want high availability instead.
> 
> That said, the timeout option also feels a bit wishy-washy to me. With a 
> timeout, acknowledgment of a commit means "your transaction is safely 
> committed in the master and slave. Or not, if there was some glitch with 
> the slave". That doesn't seem like a very useful guarantee; if you're 
> happy with that why not just use async replication?
> 
> However, the "wait forever" behavior becomes useful if you have a 
> monitoring application outside the DB that decides when enough is enough 
> and tells the DB that the slave can be considered dead. So "wait 
> forever" actually means "wait until I tell you that you can give up". 
> The monitoring application can STONITH to ensure that the slave stays 
> down, before letting the master proceed with the commit.

err... what is the difference between a timeout and stonith? None. We
still proceed without the slave in both cases after the decision point. 

In all cases, we would clearly have a user accessible function to stop
particular sessions, or all sessions, from waiting for standby to
return.

You would have 3 choices:
* set automatic timeout
* set wait forever and then wait for manual resolution
* set wait forever and then trust to external clusterware

Many people have asked for timeouts and I agree it's probably the
easiest thing to do if you just have 1 standby.

> With that in mind, we have to make sure that a transaction that's 
> waiting for acknowledgment of the commit from a slave is woken up if the 
> configuration changes.

There's a misunderstanding here of what I've said and its a subtle one.

My patch supports a timeout of 0, i.e. wait forever. Which means I agree
that functionality is desired and should be included. This operates by
saying that if a currently-connected-standby goes down we will wait
until the timeout. So I agree all 3 choices should be available to
users.

Discussion has been about what happens to ought-to-have-been-connected
standbys. Heikki had argued we need standby registration because if a
server *ought* to have been there, yet isn't currently there when we
wait for sync rep, we would still wait forever for it to return. To do
this you require standby registration.

But there is a hidden issue there: If you care about high availability
AND sync rep you have two standbys. If one goes down, the other is still
there. In general, if you want high availability on N servers then you
have N+1 standbys. If one goes down, the other standbys provide the
required level of durability and we do not wait.

So the only case where standby registration is required is where you
deliberately choose to *not* have N+1 redundancy and then yet still
require all N standbys to acknowledge. That is a suicidal config and
nobody would sanely choose that. It's not a large or useful use case for
standby reg. (But it does raise the question again of whether we need
quorum commit).

My take is that if the above use case occurs it is because one standby
has just gone down and the standby is, for a hopefully short period, in
a degraded state and that the service responds to that. So in my
proposal, if a standby is not there *now* we don't wait for it. 

Which cuts out a huge bag of code, specification and such like that
isn't required to support sane use cases. More stuff to get wrong and
regret in later releases. The KISS principle, just like we apply in all
other cases.

If we did have standby registration, then I would implement it in a
table, not in an external config file. That way when we performed a
failover the data would be accessible on the new master. But I don't
suggest we have CREATE/ALTER STANDBY syntax. We already have
CREATE/ALTER SERVER if we wanted to do it in SQL. If we did that, ISTM
we should choose functions.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

20 September 2010, 09:16:57

On 20/09/10 12:17, Simon Riggs wrote:
> err... what is the difference between a timeout and stonith?

STONITH ("Shoot The Other Node In The Head") means that the other node 
is somehow disabled so that it won't unexpectedly come back alive. A 
timeout means that the slave hasn't been seen for a while, but it might 
reconnect just after the timeout has expired.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

20 September 2010, 09:51:27

On Mon, 2010-09-20 at 15:16 +0300, Heikki Linnakangas wrote:
> On 20/09/10 12:17, Simon Riggs wrote:
> > err... what is the difference between a timeout and stonith?
> 
> STONITH ("Shoot The Other Node In The Head") means that the other node 
> is somehow disabled so that it won't unexpectedly come back alive. A 
> timeout means that the slave hasn't been seen for a while, but it might 
> reconnect just after the timeout has expired.

You've edited my reply to change the meaning of what was a rhetorical
question, as well as completely ignoring the main point of my reply.

Please respond to the main point: Following some thought and analysis,
AFAICS there is no sensible use case that requires standby registration.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

20 September 2010, 10:26:36

On 20/09/10 15:50, Simon Riggs wrote:
> On Mon, 2010-09-20 at 15:16 +0300, Heikki Linnakangas wrote:
>> On 20/09/10 12:17, Simon Riggs wrote:
>>> err... what is the difference between a timeout and stonith?
>>
>> STONITH ("Shoot The Other Node In The Head") means that the other node
>> is somehow disabled so that it won't unexpectedly come back alive. A
>> timeout means that the slave hasn't been seen for a while, but it might
>> reconnect just after the timeout has expired.
>
> You've edited my reply to change the meaning of what was a rhetorical
> question, as well as completely ignoring the main point of my reply.
>
> Please respond to the main point: Following some thought and analysis,
> AFAICS there is no sensible use case that requires standby registration.

Ok, I had completely missed your point then.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Robert Haas

Date:

20 September 2010, 10:28:57

On Mon, Sep 20, 2010 at 8:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Please respond to the main point: Following some thought and analysis,
> AFAICS there is no sensible use case that requires standby registration.

I disagree.  You keep analyzing away the cases that require standby
registration, but I don't believe that they're not real.  Aidan Van
Dyk's case upthread of wanting to make sure that the standby is up and
replicating synchronously before the master starts processing
transactions seems perfectly legitimate to me.  Sure, it's paranoid,
but so what?  We're all about paranoia, at least as far as data loss
is concerned.  So the "wait forever" case is, in my opinion,
sufficient to demonstrate that we need it, but it's not even my
primary reason for wanting to have it.

The most important reason why I think we should have standby
registration is for simplicity of configuration.  Yes, it adds another
configuration file, but that configuration file contains ALL of the
information about which standbys are synchronous.  Without standby
registration, this information will inevitably be split between the
master config and the various slave configs and you'll have to look at
all the configurations to be certain you understand how it's going to
end up working.  As a particular manifestation of this, and as
previously argued and +1'd upthread, the ability to change the set of
standbys to which the master is replicating synchronously without
changing the configuration on the master or any of the existing slaves
seems seems dangerous.

Another reason why I think we should have standby registration is to
allow eventually allow the "streaming WAL backwards" configuration
which has previously been discussed.  IOW, you could stream the WAL to
the slave in advance of fsync-ing it on the master.  After a power
failure, the machines in the cluster can talk to each other and figure
out which one has the furthest-advanced WAL pointer and stream from
that machine to all the others.  This is an appealing configuration
for people using sync rep because it would allow the fsyncs to be done
in parallel rather than sequentially as is currently necessary - but
if you're using it, you're certainly not going to want the master to
enter normal running without waiting to hear from the slave.

Just to be clear, that is a list of three independent reasons any one
of which I think is sufficient for wanting standby registration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

20 September 2010, 17:10:54

Hi,

I'm somewhat sorry to have to play this game, as I sure don't feel
smarter by composing this email. Quite the contrary.

Robert Haas <robertmhaas@gmail.com> writes:
>   So the "wait forever" case is, in my opinion,
> sufficient to demonstrate that we need it, but it's not even my
> primary reason for wanting to have it.

You're talking about standby registration on the master. You can solve
this case without it, because when a slave is not connected it's not
giving any feedback (vote, weight, ack) to the master. All you have to
do is have the quorum setup in a way that disconnecting your slave means
you can't reach the quorum any more. Have it SIGHUP and you can even
choose to fix the setup, rather than fix the standby.

So no need for registration here, it's just another way to solve the
problem. Not saying it's better or worse, just another.

Now we could have a summary function on the master showing all the known
slaves, their last time of activity, their known current setup, etc, all
from the master, but read-only. Would that be useful enough?

> The most important reason why I think we should have standby
> registration is for simplicity of configuration.  Yes, it adds another
> configuration file, but that configuration file contains ALL of the
> information about which standbys are synchronous.  Without standby
> registration, this information will inevitably be split between the
> master config and the various slave configs and you'll have to look at
> all the configurations to be certain you understand how it's going to
> end up working.

So, here, we have two quite different things to be concerned
about. First is the configuration, and I say that managing a distributed
setup will be easier for the DBA.

Then there's how to obtain a nice view about the distributed system,
which again we can achieve from the master without manually registering
the standbys. After all, the information you want needs to be there.

>  As a particular manifestation of this, and as
> previously argued and +1'd upthread, the ability to change the set of
> standbys to which the master is replicating synchronously without
> changing the configuration on the master or any of the existing slaves
> seems seems dangerous.

Well, you still need to open the HBA for the new standby to be able to
connect, and to somehow take a base backup, right? We're not exactly
transparent there, yet, are we?

> Another reason why I think we should have standby registration is to
> allow eventually allow the "streaming WAL backwards" configuration
> which has previously been discussed.  IOW, you could stream the WAL to
> the slave in advance of fsync-ing it on the master.  After a power
> failure, the machines in the cluster can talk to each other and figure
> out which one has the furthest-advanced WAL pointer and stream from
> that machine to all the others.  This is an appealing configuration
> for people using sync rep because it would allow the fsyncs to be done
> in parallel rather than sequentially as is currently necessary - but
> if you're using it, you're certainly not going to want the master to
> enter normal running without waiting to hear from the slave.

I love the idea. 

Now it seems to me that all you need here is the master sending one more
information with each WAL "segment", the currently fsync'ed position,
which pre-9.1 is implied as being the current LSN from the stream,
right?

Here I'm not sure to follow you in details, but it seems to me
registering the standbys is just another way of achieving the same. To
be honest, I don't understand a bit how it helps implement your idea.

Regards,
-- 
Dimitri Fontaine
PostgreSQL DBA, Architecte

Re: Configuring synchronous replication

From

Robert Haas

Date:

20 September 2010, 18:16:52

On Mon, Sep 20, 2010 at 4:10 PM, Dimitri Fontaine
<dfontaine@hi-media.com> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>>   So the "wait forever" case is, in my opinion,
>> sufficient to demonstrate that we need it, but it's not even my
>> primary reason for wanting to have it.
>
> You're talking about standby registration on the master. You can solve
> this case without it, because when a slave is not connected it's not
> giving any feedback (vote, weight, ack) to the master. All you have to
> do is have the quorum setup in a way that disconnecting your slave means
> you can't reach the quorum any more. Have it SIGHUP and you can even
> choose to fix the setup, rather than fix the standby.

I suppose that could work.

>> The most important reason why I think we should have standby
>> registration is for simplicity of configuration.  Yes, it adds another
>> configuration file, but that configuration file contains ALL of the
>> information about which standbys are synchronous.  Without standby
>> registration, this information will inevitably be split between the
>> master config and the various slave configs and you'll have to look at
>> all the configurations to be certain you understand how it's going to
>> end up working.
>
> So, here, we have two quite different things to be concerned
> about. First is the configuration, and I say that managing a distributed
> setup will be easier for the DBA.

Yeah, I disagree with that, but I suppose it's a question of opinion.

> Then there's how to obtain a nice view about the distributed system,
> which again we can achieve from the master without manually registering
> the standbys. After all, the information you want needs to be there.

I think that without standby registration it will be tricky to display
information like "the last time that standby foo was connected".
Yeah, you could set a standby name on the standby server and just have
the master remember details for every standby name it's ever seen, but
then how do you prune the list?

Heikki mentioned another application for having a list of the current
standbys only (rather than "every standby that has ever existed")
upthread: you can compute the exact amount of WAL you need to keep
around.

>>  As a particular manifestation of this, and as
>> previously argued and +1'd upthread, the ability to change the set of
>> standbys to which the master is replicating synchronously without
>> changing the configuration on the master or any of the existing slaves
>> seems seems dangerous.
>
> Well, you still need to open the HBA for the new standby to be able to
> connect, and to somehow take a base backup, right? We're not exactly
> transparent there, yet, are we?

Sure, but you might have that set relatively open on a trusted network.

>> Another reason why I think we should have standby registration is to
>> allow eventually allow the "streaming WAL backwards" configuration
>> which has previously been discussed.  IOW, you could stream the WAL to
>> the slave in advance of fsync-ing it on the master.  After a power
>> failure, the machines in the cluster can talk to each other and figure
>> out which one has the furthest-advanced WAL pointer and stream from
>> that machine to all the others.  This is an appealing configuration
>> for people using sync rep because it would allow the fsyncs to be done
>> in parallel rather than sequentially as is currently necessary - but
>> if you're using it, you're certainly not going to want the master to
>> enter normal running without waiting to hear from the slave.
>
> I love the idea.
>
> Now it seems to me that all you need here is the master sending one more
> information with each WAL "segment", the currently fsync'ed position,
> which pre-9.1 is implied as being the current LSN from the stream,
> right?

I don't see how that would help you.

> Here I'm not sure to follow you in details, but it seems to me
> registering the standbys is just another way of achieving the same. To
> be honest, I don't understand a bit how it helps implement your idea.

Well, if you need to talk to "all the other standbys" and see who has
the furtherst-advanced xlog pointer, it seems like you have to have a
list somewhere of who they all are.  Maybe there's some way to get
this to work without standby registration, but I don't really
understand the resistance to the idea, and I fear it's going to do
nothing good for our reputation for ease of use (or lack thereof).
The idea of making this all work without standby registration strikes
me as akin to the notion of having someone decide whether they're
running a three-legged race by checking whether their leg is currently
tied to someone else's leg.  You can probably make that work by
patching around the various failure cases, but why isn't simpler to
just tell the poor guy "Hi, Joe.  You're running a three-legged race
with Jane today.  Hans and Juanita will be following you across the
field, too, but don't worry about whether they're keeping up."?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Thom Brown

Date:

20 September 2010, 18:43:49

On 20 September 2010 22:14, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, if you need to talk to "all the other standbys" and see who has
> the furtherst-advanced xlog pointer, it seems like you have to have a
> list somewhere of who they all are.

When they connect to the master to get the stream, don't they in
effect, already talk to the primary with the XLogRecPtr being relayed?Can the connection IP, port, XLogRecPtr and
requesttime of the
 
standby be stored from this communication to track the states of each
standby?  They would in effect be registering upon WAL stream
request... and no doubt this is a horrifically naive view of how it
works.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Configuring synchronous replication

From

Robert Haas

Date:

20 September 2010, 19:24:43

On Mon, Sep 20, 2010 at 5:42 PM, Thom Brown <thom@linux.com> wrote:
> On 20 September 2010 22:14, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, if you need to talk to "all the other standbys" and see who has
>> the furtherst-advanced xlog pointer, it seems like you have to have a
>> list somewhere of who they all are.
>
> When they connect to the master to get the stream, don't they in
> effect, already talk to the primary with the XLogRecPtr being relayed?
>  Can the connection IP, port, XLogRecPtr and request time of the
> standby be stored from this communication to track the states of each
> standby?  They would in effect be registering upon WAL stream
> request... and no doubt this is a horrifically naive view of how it
> works.

Sure, but the point is that we can want DISCONNECTED slaves to affect
master behavior in a variety of ways (master retains WAL for when they
reconnect, master waits for them to connect before acking commits,
master shuts down if they're not there, master tries to stream WAL
backwards from them before entering normal running).  I just work
here, but it seems to me that such things will be easier if the master
has an explicit notion of what's out there.  Can we make it all work
without that?  Possibly, but I think it will be harder to understand.
With standby registration, you can DECLARE the behavior you want.  You
can tell the master "replicate synchronously to Bob".  And that's it.
Without standby registration, what's being proposed is basically that
you can tell the master "replicate synchronously to one server" and
you can tell Bob "you are a server to which the master can replicate
synchronously" and you can tell the other servers "you are not a
server to which Bob can replicate synchronously".  That works, but to
me it seems less straightforward.

And that's actually a relatively simple example.  Suppose you want to
tell the master "keep enough WAL for Bob to catch up when he
reconnects, but if he gets more than 1GB behind, forget about him".
I'm sure someone can devise a way of making that work without standby
registration, too, but I'm not too sure off the top of my head what it
will be.  With standby registration, you can just write something like
this in standbys.conf (syntax invented):

[bob]
wal_keep_segments=64

I feel like that's really nice and simple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Simon Riggs

Date:

21 September 2010, 03:36:39

On Mon, 2010-09-20 at 22:42 +0100, Thom Brown wrote:
> On 20 September 2010 22:14, Robert Haas <robertmhaas@gmail.com> wrote:
> > Well, if you need to talk to "all the other standbys" and see who has
> > the furtherst-advanced xlog pointer, it seems like you have to have a
> > list somewhere of who they all are.
> 
> When they connect to the master to get the stream, don't they in
> effect, already talk to the primary with the XLogRecPtr being relayed?
>  Can the connection IP, port, XLogRecPtr and request time of the
> standby be stored from this communication to track the states of each
> standby?  They would in effect be registering upon WAL stream
> request... and no doubt this is a horrifically naive view of how it
> works.

It's not viable to record information at the chunk level in that way.

But the overall idea is fine. We can track who was connected and how to
access their LSNs. They don't need to be registered ahead of time on the
master to do that. They can register and deregister each time they
connect.

This discussion is reminiscent of the discussion we had when Fujii first
suggested that the standby should connect to the master. At first I
though "don't be stupid, the master needs to connect to the standby!".
It stood everything I had thought about on its head and that hurt, but
there was no logical reason to oppose. We could have used standby
registration on the master to handle that, but we didn't. I'm happy that
we have a more flexible system as a result.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Fujii Masao

Date:

21 September 2010, 04:58:34

On Sat, Sep 18, 2010 at 4:36 AM, Dimitri Fontaine
<dfontaine@hi-media.com> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
>>> What synchronization level does each combination of sync_replication
>>> and sync_replication_service lead to?
>>
>> There are only 4 possible outcomes. There is no combination, so we don't
>> need a table like that above.
>>
>> The "service" specifies the highest request type available from that
>> specific standby. If someone requests a higher service than is currently
>> offered by this standby, they will either
>> a) get that service from another standby that does offer that level
>> b) automatically downgrade the sync rep mode to the highest available.
>
> I like the a) part, I can't say the same about the b) part. There's no
> reason to accept to COMMIT a transaction when the requested durability
> is known not to have been reached, unless the user said so.

Yep, I can imagine that some people want to ensure that *all* the
transactions are synchronously replicated to the synchronous standby,
without regard to sync_replication. So I'm not sure if automatic
downgrade/upgrade of the mode makes sense. We should introduce new
parameter specifying whether to allow automatic degrade/upgrade or not?
It seems complicated though.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

21 September 2010, 05:24:45

Robert Haas <robertmhaas@gmail.com> writes:
>> So, here, we have two quite different things to be concerned
>> about. First is the configuration, and I say that managing a distributed
>> setup will be easier for the DBA.
>
> Yeah, I disagree with that, but I suppose it's a question of opinion.

I'd be willing to share your thoughts if it was only for the initial
setup. This one is hard enough to sketch on the paper that you prefer an
easy way to implement it afterwards, and in some cases a central setup
would be just that.

The problem is that I'm concerned with upgrading the setup once the
system is live. Not at the best time for that in the project, either,
but when you finally get the budget to expand the number of servers.

From experience with skytools, no manual registering works best. But…

> I think that without standby registration it will be tricky to display
> information like "the last time that standby foo was connected".
> Yeah, you could set a standby name on the standby server and just have
> the master remember details for every standby name it's ever seen, but
> then how do you prune the list?

… I now realize there are 2 parts under the registration bit. What I
don't see helping is manual registration. For some use cases you're
talking about maintaining a list of known servers sounds important, and
that's also what londiste is doing.

Pruning the list would be done with some admin function. You need one to
see the current state already, add some other one to unregister a known
standby.

In londiste, that's how it works, and events are kept in the queues for
all known subscribers. For the ones that won't ever connect again,
that's of course a problem, so you SELECT pgq.unregister_consumer(…);.

> Heikki mentioned another application for having a list of the current
> standbys only (rather than "every standby that has ever existed")
> upthread: you can compute the exact amount of WAL you need to keep
> around.

Well, either way, the system can not decide on its own whether a
currently not available standby is going to join the party again later
on.

>> Now it seems to me that all you need here is the master sending one more
>> information with each WAL "segment", the currently fsync'ed position,
>> which pre-9.1 is implied as being the current LSN from the stream,
>> right?
>
> I don't see how that would help you.

I think you want to refrain to apply any WAL segment you receive at the
standby and instead only advance as far as the master is known to have
reached. And you want this information to be safe against slave restart,
too: don't replay any WAL you have in pg_xlog or in the archive.

The other part of your proposal is another story (having slaves talk to
each-other at master crash).

> Well, if you need to talk to "all the other standbys" and see who has
> the furtherst-advanced xlog pointer, it seems like you have to have a
> list somewhere of who they all are.

Ah sorry I was thinking on the other part of the proposal only (sending
WAL segments that are not been fsync'ed yet on the master). So, yes.

But I thought you were saying that replicating a (shared?) catalog of
standbys is technically hard (or impossible), so how would you go about
it? As it's all about making things simpler for the users, you're not
saying that they should keep the main setup in sync manually on all the
standbys servers, right?

> Maybe there's some way to get
> this to work without standby registration, but I don't really
> understand the resistance to the idea

In fact I'm now realising what I don't like is having to manually do the
registration work: as I already have to setup the slaves, it only
appears like a useless burden on me, giving information the system
already has.

Automatic registration I'm fine with, I now realize.

Regards,
--
Dimitri Fontaine
PostgreSQL DBA, Architecte

Re: Configuring synchronous replication

From

Fujii Masao

Date:

21 September 2010, 05:29:57

On Sun, Sep 19, 2010 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> There are considerable benefits to having a standby registry with a
>> table-like interface.  Particularly, one where we could change
>> replication via UPDATE (or ALTER STANDBY) statements.
>
> I think that using a system catalog for this is going to be a
> non-starter, but we could use a flat file that is designed to be
> machine-editable (and thus avoid repeating the mistake we've made with
> postgresql.conf).

Yep, the standby registration information should be accessible and
changable while the server is not running. So using only system
catalog is not an answer.

My patch has implemented standbys.conf which was proposed before.
This format is the almost same as the pg_hba.conf. Is this
machine-editable, you think? If not, we should the format to
something like xml?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Configuring synchronous replication

From

Thom Brown

Date:

21 September 2010, 05:34:49

On 21 September 2010 09:29, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Sep 19, 2010 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Sep 18, 2010 at 5:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> There are considerable benefits to having a standby registry with a
>>> table-like interface.  Particularly, one where we could change
>>> replication via UPDATE (or ALTER STANDBY) statements.
>>
>> I think that using a system catalog for this is going to be a
>> non-starter, but we could use a flat file that is designed to be
>> machine-editable (and thus avoid repeating the mistake we've made with
>> postgresql.conf).
>
> Yep, the standby registration information should be accessible and
> changable while the server is not running. So using only system
> catalog is not an answer.
>
> My patch has implemented standbys.conf which was proposed before.
> This format is the almost same as the pg_hba.conf. Is this
> machine-editable, you think? If not, we should the format to
> something like xml?

I really don't think an XML config would improve anything.  In fact it
would just introduce more ways to break the config by the mere fact it
has to be well-formed.  I'd be in favour of one similar to
pg_hba.conf, because then, at least, we'd still only have 2 formats of
configuration.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Configuring synchronous replication

From

Dave Page

Date:

21 September 2010, 05:37:53

On Tue, Sep 21, 2010 at 9:34 AM, Thom Brown <thom@linux.com> wrote:
> I really don't think an XML config would improve anything.  In fact it
> would just introduce more ways to break the config by the mere fact it
> has to be well-formed.  I'd be in favour of one similar to
> pg_hba.conf, because then, at least, we'd still only have 2 formats of
> configuration.

Want to spend a few days hacking on a config editor for pgAdmin, and
then re-evaluate that comment?

:-)

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Thom Brown

Date:

21 September 2010, 05:53:26

On 21 September 2010 09:37, Dave Page <dpage@pgadmin.org> wrote:
> On Tue, Sep 21, 2010 at 9:34 AM, Thom Brown <thom@linux.com> wrote:
>> I really don't think an XML config would improve anything.  In fact it
>> would just introduce more ways to break the config by the mere fact it
>> has to be well-formed.  I'd be in favour of one similar to
>> pg_hba.conf, because then, at least, we'd still only have 2 formats of
>> configuration.
>
> Want to spend a few days hacking on a config editor for pgAdmin, and
> then re-evaluate that comment?

It would be quicker to add in support for a config format we don't use
yet than to duplicate support for a new config in the same format as
an existing one?  Plus it's a compromise between user-screw-up-ability
and machine-readability.

My fear would be standby.conf would be edited by users who don't
really know XML and then we'd have 3 different styles of config to
tell the user to edit.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Configuring synchronous replication

From

Fujii Masao

Date:

21 September 2010, 05:59:23

On Mon, Sep 20, 2010 at 3:27 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> However, the "wait forever" behavior becomes useful if you have a monitoring
> application outside the DB that decides when enough is enough and tells the
> DB that the slave can be considered dead. So "wait forever" actually means
> "wait until I tell you that you can give up". The monitoring application can
> STONITH to ensure that the slave stays down, before letting the master
> proceed with the commit.

This is also useful for preventing a failover from causing some data loss
by promoting the lagged standby to the master. To avoid any data loss, we
must STONITH the standby before any transactions resume on the master, when
replication connection is terminated or the crash of the standby happens.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

21 September 2010, 06:11:31

On 21/09/10 11:52, Thom Brown wrote:
> My fear would be standby.conf would be edited by users who don't
> really know XML and then we'd have 3 different styles of config to
> tell the user to edit.

I'm not a big fan of XML either. That said, the format could use some 
hierarchy. If we add many more per-server options, one server per line 
will quickly become unreadable.

Perhaps something like the ini-file syntax Robert Haas just made up 
elsewhere in this thread:

-------
globaloption1 = value

[servername1]
synchronization_level = async
option1 = value

[servername2]
synchronization_level = replay
option2 = value1
-------

I'm not sure I like the ini-file style much, but the two-level structure 
it provides seems like a perfect match.

Then again, maybe we should go with something like json or yaml that 
would allow deeper hierarchies for the sake of future expandability. Oh, 
and there Dimitri's idea of "service levels" for per-transaction control 
(http://archives.postgresql.org/message-id/m2sk1868hb.fsf@hi-media.com):

>   sync_rep_services = {critical: recv=2, fsync=2, replay=1;
>                        important: fsync=3;
>                        reporting: recv=2, apply=1}

We'll need to accommodate something like that too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Tom Lane

Date:

21 September 2010, 12:13:28

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 21/09/10 11:52, Thom Brown wrote:
>> My fear would be standby.conf would be edited by users who don't
>> really know XML and then we'd have 3 different styles of config to
>> tell the user to edit.

> I'm not a big fan of XML either.
> ...
> Then again, maybe we should go with something like json or yaml

The fundamental problem with all those "machine editable" formats is
that they aren't "people editable".  If you have to have a tool (other
than a text editor) to change a config file, you're going to be very
unhappy when things are broken at 3AM and you're trying to fix it
while ssh'd in from your phone.

I think the "ini file" format suggestion is probably a good one; it
seems to fit this problem, and it's something that people are used to.
We could probably shoehorn the info into a pg_hba-like format, but
I'm concerned about whether we'd be pushing that format beyond what
it can reasonably handle.
        regards, tom lane

Re: Configuring synchronous replication

From

Robert Haas

Date:

21 September 2010, 12:23:48

On Tue, Sep 21, 2010 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> On 21/09/10 11:52, Thom Brown wrote:
>>> My fear would be standby.conf would be edited by users who don't
>>> really know XML and then we'd have 3 different styles of config to
>>> tell the user to edit.
>
>> I'm not a big fan of XML either.
>> ...
>> Then again, maybe we should go with something like json or yaml
>
> The fundamental problem with all those "machine editable" formats is
> that they aren't "people editable".  If you have to have a tool (other
> than a text editor) to change a config file, you're going to be very
> unhappy when things are broken at 3AM and you're trying to fix it
> while ssh'd in from your phone.

Agreed.  Although, if things are broken at 3AM and I'm trying to fix
it while ssh'd in from my phone, I reserve the right to be VERY
unhappy no matter what format the file is in.  :-)

> I think the "ini file" format suggestion is probably a good one; it
> seems to fit this problem, and it's something that people are used to.
> We could probably shoehorn the info into a pg_hba-like format, but
> I'm concerned about whether we'd be pushing that format beyond what
> it can reasonably handle.

It's not clear how many attributes we'll want to associate with a
server.  Simon seems to think we can keep it to zero; I think it's
positive but I can't say for sure how many there will eventually be.
It may also be that a lot of the values will be optional things that
are frequently left unspecified.  Both of those make me think that a
columnar format is probably not best.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Simon Riggs

Date:

21 September 2010, 15:05:21

On Tue, 2010-09-21 at 16:58 +0900, Fujii Masao wrote:
> On Sat, Sep 18, 2010 at 4:36 AM, Dimitri Fontaine
> <dfontaine@hi-media.com> wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> >> On Fri, 2010-09-17 at 21:20 +0900, Fujii Masao wrote:
> >>> What synchronization level does each combination of sync_replication
> >>> and sync_replication_service lead to?
> >>
> >> There are only 4 possible outcomes. There is no combination, so we don't
> >> need a table like that above.
> >>
> >> The "service" specifies the highest request type available from that
> >> specific standby. If someone requests a higher service than is currently
> >> offered by this standby, they will either
> >> a) get that service from another standby that does offer that level
> >> b) automatically downgrade the sync rep mode to the highest available.
> >
> > I like the a) part, I can't say the same about the b) part. There's no
> > reason to accept to COMMIT a transaction when the requested durability
> > is known not to have been reached, unless the user said so.

Hmm, no reason? The reason is that the alternative is that the session
would hang until a standby arrived that offered that level of service.
Why would you want that behaviour? Would you really request that option?

> Yep, I can imagine that some people want to ensure that *all* the
> transactions are synchronously replicated to the synchronous standby,
> without regard to sync_replication. So I'm not sure if automatic
> downgrade/upgrade of the mode makes sense. We should introduce new
> parameter specifying whether to allow automatic degrade/upgrade or not?
> It seems complicated though.

I agree, but I'm not against any additional parameter if people say they
really want them *after* the consequences of those choices have been
highlighted.

IMHO we should focus on the parameters that deliver key use cases.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Bruce Momjian

Date:

21 September 2010, 20:37:59

Robert Haas wrote:
> On Tue, Sep 21, 2010 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> >> On 21/09/10 11:52, Thom Brown wrote:
> >>> My fear would be standby.conf would be edited by users who don't
> >>> really know XML and then we'd have 3 different styles of config to
> >>> tell the user to edit.
> >
> >> I'm not a big fan of XML either.
> >> ...
> >> Then again, maybe we should go with something like json or yaml
> >
> > The fundamental problem with all those "machine editable" formats is
> > that they aren't "people editable". ?If you have to have a tool (other
> > than a text editor) to change a config file, you're going to be very
> > unhappy when things are broken at 3AM and you're trying to fix it
> > while ssh'd in from your phone.
> 
> Agreed.  Although, if things are broken at 3AM and I'm trying to fix
> it while ssh'd in from my phone, I reserve the right to be VERY
> unhappy no matter what format the file is in.  :-)
> 
> > I think the "ini file" format suggestion is probably a good one; it
> > seems to fit this problem, and it's something that people are used to.
> > We could probably shoehorn the info into a pg_hba-like format, but
> > I'm concerned about whether we'd be pushing that format beyond what
> > it can reasonably handle.
> 
> It's not clear how many attributes we'll want to associate with a
> server.  Simon seems to think we can keep it to zero; I think it's
> positive but I can't say for sure how many there will eventually be.
> It may also be that a lot of the values will be optional things that
> are frequently left unspecified.  Both of those make me think that a
> columnar format is probably not best.

Crazy idea, but could we use format like postgresql.conf by extending
postgresql.conf syntax, e.g.:
server1.failover = falseserver1.keep_connect = true

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Configuring synchronous replication

From

Josh Berkus

Date:

21 September 2010, 21:05:08

> That said, the timeout option also feels a bit wishy-washy to me. With a
> timeout, acknowledgment of a commit means "your transaction is safely
> committed in the master and slave. Or not, if there was some glitch with
> the slave". That doesn't seem like a very useful guarantee; if you're
> happy with that why not just use async replication?

Ah, I wasn't clear.  My thought was that a standby which exceeds the
timeout would be marked as "nonresponsive" and no longer included in the
list of standbys which needed to be synchronized.  That is, the timeout
would be a timeout which says "this standby is down".

> So the only case where standby registration is required is where you
> deliberately choose to *not* have N+1 redundancy and then yet still
> require all N standbys to acknowledge. That is a suicidal config and
> nobody would sanely choose that. It's not a large or useful use case for
> standby reg. (But it does raise the question again of whether we need
> quorum commit).

Thinking of this as a sysadmin, what I want is to have *one place* I can
go an troubleshoot my standby setup.  If I have 12 synch standbys and
they're creating too much load on the master, and I want to change half
of them to async, I don't want to have to ssh into 6 different machines
to do so.  If one standby needs to be taken out of the network because
it's too slow, I want to be able to log in to the master and instantly
identify which standby is lagging and remove it there.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

21 September 2010, 21:26:12

> Crazy idea, but could we use format like postgresql.conf by extending
> postgresql.conf syntax, e.g.:
> 
>     server1.failover = false
>     server1.keep_connect = true
> 

Why is this in the config file at all. It should be:

synchronous_replication = TRUE/FALSE

then

ALTER CLUSTER ENABLE REPLICATION FOR FOO;
ALTER CLUSTER SET keep_connect ON FOO TO TRUE;

Or some such thing.

Sincerely,

Joshua D. Drake


> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
> 
>   + It's impossible for everything to be true. +

-- 
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

22 September 2010, 05:11:09

On 22/09/10 03:25, Joshua D. Drake wrote:
> Why is this in the config file at all. It should be:
>
> synchronous_replication = TRUE/FALSE

Umm, what does this do?

> then
>
> ALTER CLUSTER ENABLE REPLICATION FOR FOO;
> ALTER CLUSTER SET keep_connect ON FOO TO TRUE;
>
> Or some such thing.

I like a configuration file more because you can easily add comments, 
comment out lines, etc. It also makes it easier to have a different 
configuration in master and standby. We don't support cascading slaves, 
yet, but you might still want a different configuration in master and 
slave, waiting for the moment that the slave is promoted to a new master.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

22 September 2010, 05:19:10

On 21/09/10 18:12, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 21/09/10 11:52, Thom Brown wrote:
>>> My fear would be standby.conf would be edited by users who don't
>>> really know XML and then we'd have 3 different styles of config to
>>> tell the user to edit.
>
>> I'm not a big fan of XML either.
>> ...
>> Then again, maybe we should go with something like json or yaml
>
> The fundamental problem with all those "machine editable" formats is
> that they aren't "people editable".  If you have to have a tool (other
> than a text editor) to change a config file, you're going to be very
> unhappy when things are broken at 3AM and you're trying to fix it
> while ssh'd in from your phone.

I'm not very familiar with any of those formats, but I agree it needs to 
be easy to edit by hand first and foremost.

> I think the "ini file" format suggestion is probably a good one; it
> seems to fit this problem, and it's something that people are used to.
> We could probably shoehorn the info into a pg_hba-like format, but
> I'm concerned about whether we'd be pushing that format beyond what
> it can reasonably handle.

The ini file format seems to be enough for the features proposed this 
far, but I'm a bit concerned that even that might not be flexible enough 
for future features. I guess we'll cross the bridge when we get there 
and go with an ini file for now. It should be possible to extend it in 
various ways, and in the worst case that we have to change to a 
completely different format, we can provide a how to guide on converting 
existing config files to the new format.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Markus Wanner

Date:

22 September 2010, 05:22:26

Hi,

On 09/21/2010 08:05 PM, Simon Riggs wrote:
> Hmm, no reason? The reason is that the alternative is that the session
> would hang until a standby arrived that offered that level of service.
> Why would you want that behaviour? Would you really request that option?

I think I now agree with Simon on that point. It's only an issue in
multi-master replication, where continued operation would lead to a
split-brain situation.

With master-slave, you only need to make sure your master stays the
master even if the standby crash(es) are followed by a master crash. If
your cluster-ware is too clever and tries a fail-over on a slave that's
quicker to come up, you get the same split-brain situation.

Put another way: if you let your master continue, don't ever try a
fail-over after a full-cluster crash.

Regards

Markus Wanner

Re: Configuring synchronous replication

From

Andrew Dunstan

Date:

22 September 2010, 05:47:12

<br /><br /> On 09/22/2010 04:18 AM, Heikki Linnakangas wrote: <blockquote cite="mid:4C99BBED.1030109@enterprisedb.com"
type="cite">On21/09/10 18:12, Tom Lane wrote: <br /><blockquote type="cite">Heikki Linnakangas<a
class="moz-txt-link-rfc2396E"
href="mailto:heikki.linnakangas@enterprisedb.com"><heikki.linnakangas@enterprisedb.com></a> writes: <br
/><blockquotetype="cite">On 21/09/10 11:52, Thom Brown wrote: <br /><blockquote type="cite">My fear would be
standby.confwould be edited by users who don't <br /> really know XML and then we'd have 3 different styles of config
to<br /> tell the user to edit. <br /></blockquote></blockquote><br /><blockquote type="cite">I'm not a big fan of XML
either.<br /> ... <br /> Then again, maybe we should go with something like json or yaml <br /></blockquote><br /> The
fundamentalproblem with all those "machine editable" formats is <br /> that they aren't "people editable".  If you have
tohave a tool (other <br /> than a text editor) to change a config file, you're going to be very <br /> unhappy when
thingsare broken at 3AM and you're trying to fix it <br /> while ssh'd in from your phone. <br /></blockquote><br />
I'mnot very familiar with any of those formats, but I agree it needs to be easy to edit by hand first and foremost. <br
/><br/><blockquote type="cite">I think the "ini file" format suggestion is probably a good one; it <br /> seems to fit
thisproblem, and it's something that people are used to. <br /> We could probably shoehorn the info into a pg_hba-like
format,but <br /> I'm concerned about whether we'd be pushing that format beyond what <br /> it can reasonably handle.
<br/></blockquote><br /> The ini file format seems to be enough for the features proposed this far, but I'm a bit
concernedthat even that might not be flexible enough for future features. I guess we'll cross the bridge when we get
thereand go with an ini file for now. It should be possible to extend it in various ways, and in the worst case that we
haveto change to a completely different format, we can provide a how to guide on converting existing config files to
thenew format. <br /></blockquote><br /> The ini file format is not flexible enough, IMNSHO. If we're going to adopt a
newconfig file format it should have these characteristics, among others:<br /><ul><li>well known (let's not invent a
newone)<li>supports hierarchical structure<li>reasonably readable</ul> I realize that the last is very subjective.
Personally,I'm very comfortable with XML, but then I do a *lot* of work with it, and have for many years. I know I'm in
aminority on that, and some people just go bananas when they see it. Since we're just about to add a JSON parser to the
backend,by the look of it, that looks like a reasonable bet. Maybe it uses a few too many quotes, but that's not really
sohard to get your head around, even if it offends you a bit aesthetically. And it is certainly fairly widely known.<br
/><br/> cheers<br /><br /> andrew<br />

Re: Configuring synchronous replication

From

Dave Page

Date:

22 September 2010, 05:55:00

On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> The ini file format is not flexible enough, IMNSHO. If we're going to adopt
> a new config file format it should have these characteristics, among others:
>
> well known (let's not invent a new one)
> supports hierarchical structure
> reasonably readable

The ini format meets all of those requirements - and it's certainly
far more readable/editable than XML and friends.


-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Andrew Dunstan

Date:

22 September 2010, 08:07:26


On 09/22/2010 04:54 AM, Dave Page wrote:
> On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> The ini file format is not flexible enough, IMNSHO. If we're going to adopt
>> a new config file format it should have these characteristics, among others:
>>
>> well known (let's not invent a new one)
>> supports hierarchical structure
>> reasonably readable
> The ini format meets all of those requirements - and it's certainly
> far more readable/editable than XML and friends.
>

No, it's really not hierarchical. It only has goes one level deep.

cheers

andrew

Re: Configuring synchronous replication

From

Dave Page

Date:

22 September 2010, 08:20:37

On Wed, Sep 22, 2010 at 12:07 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 09/22/2010 04:54 AM, Dave Page wrote:
>>
>> On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan<andrew@dunslane.net>
>>  wrote:
>>>
>>> The ini file format is not flexible enough, IMNSHO. If we're going to
>>> adopt
>>> a new config file format it should have these characteristics, among
>>> others:
>>>
>>> well known (let's not invent a new one)
>>> supports hierarchical structure
>>> reasonably readable
>>
>> The ini format meets all of those requirements - and it's certainly
>> far more readable/editable than XML and friends.
>>
>
> No, it's really not hierarchical. It only has goes one level deep.

I guess pgAdmin/wxWidgets are broken then :-)

[Servers]
Count=5
[Servers/1]
Server=localhost
Description=PostgreSQL 8.3
ServiceID=
DiscoveryID=/PostgreSQL/8.3
Port=5432
StorePwd=true
Restore=false
Database=postgres
Username=postgres
LastDatabase=postgres
LastSchema=public
DbRestriction=
Colour=#FFFFFF
SSL=0
Group=PPAS
Rolename=
[Servers/1/Databases]
[Servers/1/Databases/postgres]
SchemaRestriction=
[Servers/1/Databases/pphq]
SchemaRestriction=
[Servers/1/Databases/template_postgis]
SchemaRestriction=
[Servers/2]
...
...

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Peter Eisentraut

Date:

22 September 2010, 08:50:55

On ons, 2010-09-22 at 12:20 +0100, Dave Page wrote:
> > No, it's really not hierarchical. It only has goes one level deep.
> 
> I guess pgAdmin/wxWidgets are broken then :-)
> 
> [Servers]
> Count=5
> [Servers/1]
> Server=localhost

Well, by that logic, even what we have now for postgresql.conf is
hierarchical.

I think the criterion was rather meant to be

- can represent hierarchies without repeating intermediate node names

(Note: no opinion on which format is better for the task at hand)

Re: Configuring synchronous replication

From

Andrew Dunstan

Date:

22 September 2010, 08:57:08

On 09/22/2010 07:20 AM, Dave Page wrote:
> On Wed, Sep 22, 2010 at 12:07 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>>
>> On 09/22/2010 04:54 AM, Dave Page wrote:
>>> On Wed, Sep 22, 2010 at 9:47 AM, Andrew Dunstan<andrew@dunslane.net>
>>>   wrote:
>>>> The ini file format is not flexible enough, IMNSHO. If we're going to
>>>> adopt
>>>> a new config file format it should have these characteristics, among
>>>> others:
>>>>
>>>> well known (let's not invent a new one)
>>>> supports hierarchical structure
>>>> reasonably readable
>>> The ini format meets all of those requirements - and it's certainly
>>> far more readable/editable than XML and friends.
>>>
>> No, it's really not hierarchical. It only has goes one level deep.
> I guess pgAdmin/wxWidgets are broken then :-)
>
> [Servers]
> Count=5
> [Servers/1]
> Server=localhost
> Description=PostgreSQL 8.3
> ServiceID=
> DiscoveryID=/PostgreSQL/8.3
> Port=5432
> StorePwd=true
> Restore=false
> Database=postgres
> Username=postgres
> LastDatabase=postgres
> LastSchema=public
> DbRestriction=
> Colour=#FFFFFF
> SSL=0
> Group=PPAS
> Rolename=
> [Servers/1/Databases]
> [Servers/1/Databases/postgres]
> SchemaRestriction=
> [Servers/1/Databases/pphq]
> SchemaRestriction=
> [Servers/1/Databases/template_postgis]
> SchemaRestriction=
> [Servers/2]
> ...
> ...

Well, that's not what I'd call a hierarchy, in any sane sense. I've 
often had to dig all over the place in ini files to find related bits of 
information in disparate parts of the file. Compared to a meaningful 
tree structure this is utterly woeful. In a sensible hierarchical 
format, all the information relating to, say, Servers/1 above, wopuld be 
under a stanza with that heading, instead of having separate and 
unnested stanzas like Servers/1/Databases/template_postgis.

If you could nest stanzas in ini file format it would probably do, but 
you can't, leading to the above major ugliness.

cheers

andrew

Re: Configuring synchronous replication

From

Dave Page

Date:

22 September 2010, 08:57:30

On Wed, Sep 22, 2010 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On ons, 2010-09-22 at 12:20 +0100, Dave Page wrote:
>> > No, it's really not hierarchical. It only has goes one level deep.
>>
>> I guess pgAdmin/wxWidgets are broken then :-)
>>
>> [Servers]
>> Count=5
>> [Servers/1]
>> Server=localhost
>
> Well, by that logic, even what we have now for postgresql.conf is
> hierarchical.

Well, yes - if you consider add-in GUCs which use prefixing like foo.setting=...

> I think the criterion was rather meant to be
>
> - can represent hierarchies without repeating intermediate node names

If this were data, I could understand that as it could lead to
tremendous bloat, but as a config file, I'd rather have the
readability of the ini format, despite the repeated node names, than
have to hack XML files by hand.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Andrew Dunstan

Date:

22 September 2010, 09:25:49

<br /><br /> On 09/22/2010 07:57 AM, Dave Page wrote: <blockquote
cite="mid:AANLkTi=rJuRLv2jv=M5js9YPGLp94bciAWx0FtRei=YD@mail.gmail.com"type="cite"><pre wrap="">On Wed, Sep 22, 2010 at
12:50PM, Peter Eisentraut <a class="moz-txt-link-rfc2396E" href="mailto:peter_e@gmx.net"><peter_e@gmx.net></a>
wrote:
</pre><blockquote type="cite"><pre wrap="">On ons, 2010-09-22 at 12:20 +0100, Dave Page wrote:
</pre><blockquote type="cite"><blockquote type="cite"><pre wrap="">No, it's really not hierarchical. It only has goes
onelevel deep.
 
</pre></blockquote><pre wrap="">
I guess pgAdmin/wxWidgets are broken then :-)

[Servers]
Count=5
[Servers/1]
Server=localhost
</pre></blockquote><pre wrap="">
Well, by that logic, even what we have now for postgresql.conf is
hierarchical.
</pre></blockquote><pre wrap="">
Well, yes - if you consider add-in GUCs which use prefixing like foo.setting=...

</pre><blockquote type="cite"><pre wrap="">I think the criterion was rather meant to be

- can represent hierarchies without repeating intermediate node names
</pre></blockquote><pre wrap="">
If this were data, I could understand that as it could lead to
tremendous bloat, but as a config file, I'd rather have the
readability of the ini format, despite the repeated node names, than
have to hack XML files by hand.

</pre></blockquote><br /> XML is not the only alternative - please don't use it as a straw man. For example, here is a
fragmentfrom the Bacula docs using their hierarchical format:<br /><br /><blockquote><pre>FileSet { Name = Test Include
{  File = /home/xxx/test   Options {      regex = ".*\.c$"   } }
 
}

</pre></blockquote> Or here is a piece from the buildfarm client config (which is in fact perl, but could also be JSON
orsimilar fairly easily):<br /><br /><blockquote>mail_events =><br /> {<br />     all => [], <br />     fail
=>[],<br />     change => ['<a class="moz-txt-link-abbreviated" href="mailto:foo@bar.com">foo@bar.com</a>', '<a
class="moz-txt-link-abbreviated"href="mailto:baz@blurfl.org">baz@blurfl.org</a>' ],<br />     green => [], <br />
},<br/> build_env =><br /> {<br />     CCACHE_DIR => "/home/andrew/pgfarmbuild/ccache/$branch", <br /> },<br
/><br/></blockquote><pre>
 
</pre> cheers<br /><br /> andrew<br /><pre>
</pre>

Re: Configuring synchronous replication

From

Dave Page

Date:

22 September 2010, 09:32:08

On Wed, Sep 22, 2010 at 1:25 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> XML is not the only alternative - please don't use it as a straw man. For
> example, here is a fragment from the Bacula docs using their hierarchical
> format:
>
> FileSet {
>   Name = Test
>   Include {
>     File = /home/xxx/test
>     Options {
>        regex = ".*\.c$"
>     }
>   }
> }
>
> Or here is a piece from the buildfarm client config (which is in fact perl,
> but could also be JSON or similar fairly easily):
>
> mail_events =>
> {
>     all => [],
>     fail => [],
>     change => ['foo@bar.com', 'baz@blurfl.org' ],
>     green => [],
> },
> build_env =>
> {
>     CCACHE_DIR => "/home/andrew/pgfarmbuild/ccache/$branch",
> },

Both of which I've also used in the past, and also find uncomfortable
and awkward for configuration files.


--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Andrew Dunstan

Date:

22 September 2010, 10:01:34


On 09/22/2010 08:32 AM, Dave Page wrote:
> On Wed, Sep 22, 2010 at 1:25 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> XML is not the only alternative - please don't use it as a straw man. For
>> example, here is a fragment from the Bacula docs using their hierarchical
>> format:
>>
>> FileSet {
>>    Name = Test
>>    Include {
>>      File = /home/xxx/test
>>      Options {
>>         regex = ".*\.c$"
>>      }
>>    }
>> }
>>
>> Or here is a piece from the buildfarm client config (which is in fact perl,
>> but could also be JSON or similar fairly easily):
>>
>> mail_events =>
>> {
>>      all =>  [],
>>      fail =>  [],
>>      change =>  ['foo@bar.com', 'baz@blurfl.org' ],
>>      green =>  [],
>> },
>> build_env =>
>> {
>>      CCACHE_DIR =>  "/home/andrew/pgfarmbuild/ccache/$branch",
>> },
> Both of which I've also used in the past, and also find uncomfortable
> and awkward for configuration files.
>
>

I can't imagine trying to configure Bacula using ini file format - the 
mind just boggles. Frankly, I'd rather stick with our current config 
format than change to something as inadequate as ini file format.

cheers

andrew

Re: Configuring synchronous replication

From

Robert Haas

Date:

22 September 2010, 10:50:48

On Wed, Sep 22, 2010 at 9:01 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
> I can't imagine trying to configure Bacula using ini file format - the mind
> just boggles. Frankly, I'd rather stick with our current config format than
> change to something as inadequate as ini file format.

Perhaps we need to define a little better what information we think we
might eventually need to represent in the config file. With one
exception, nobody has suggested anything that would actually require
hierarchical structure. The exception is defining the policy for
deciding when a commit has been sufficiently acknowledged by an
adequate quorum of standbys, and it seems to me that doing that in its
full generality is going to require not so much a hierarchical
structure as a small programming language. The efforts so far have
centered around reducing the use cases that $AUTHOR cares about to a
set of GUCs which would satisfy that person's needs, but not
necessarily everyone else's needs. I think efforts to encode
arbitrary algorithms using configuration settings are doomed to
failure, so I'm unimpressed by the argument that we should design the
config file to support our attempts to do so. For everything else, no
one has suggested that we need anything more complex than,
essentially, a group of GUCs per server. So we could do:

[server]
guc=value

server.guc=value

...or something else. Designing this to support:

server.hypothesis.experimental.unproven.imaginary.what-in-the-world-could-this-possibly-be
= 42

...seems pretty speculative at this point, unless someone can imagine
what we'd want it for.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Simon Riggs

Date:

22 September 2010, 12:07:45

On Tue, 2010-09-21 at 17:04 -0700, Josh Berkus wrote:

> > That said, the timeout option also feels a bit wishy-washy to me. With a
> > timeout, acknowledgment of a commit means "your transaction is safely
> > committed in the master and slave. Or not, if there was some glitch with
> > the slave". That doesn't seem like a very useful guarantee; if you're
> > happy with that why not just use async replication?
> 
> Ah, I wasn't clear.  My thought was that a standby which exceeds the
> timeout would be marked as "nonresponsive" and no longer included in the
> list of standbys which needed to be synchronized.  That is, the timeout
> would be a timeout which says "this standby is down".
> 
> > So the only case where standby registration is required is where you
> > deliberately choose to *not* have N+1 redundancy and then yet still
> > require all N standbys to acknowledge. That is a suicidal config and
> > nobody would sanely choose that. It's not a large or useful use case for
> > standby reg. (But it does raise the question again of whether we need
> > quorum commit).

This is becoming very confusing. Some people advocating "standby
registration" have claimed it allows capabilities which aren't possible
any other way; all but one of those claims has so far been wrong - the
remaining case is described above. If I'm the one that is wrong, please
tell me where I erred.

> Thinking of this as a sysadmin, what I want is to have *one place* I can
> go an troubleshoot my standby setup.  If I have 12 synch standbys and
> they're creating too much load on the master, and I want to change half
> of them to async, I don't want to have to ssh into 6 different machines
> to do so.  If one standby needs to be taken out of the network because
> it's too slow, I want to be able to log in to the master and instantly
> identify which standby is lagging and remove it there.

The above case is one where I can see your point and it does sound
easier in that case. But I then think: "What happens after failover?".
We would then need to have 12 different standby.conf files, one on each
standby that describes what the setup would look like if that standby
became the master. And guess what, every time we made a change on the
master, you'd need to re-edit all 12 standby.conf files to reflect the
new configuration. So we're still back to having to edit in multiple
places, ISTM.

Please, please, somebody write down what the design proposal is *before*
we make a decision on whether it is a sensible way to proceed. It would
be good to see a few options written down and some objective analysis of
which way is best to let people decide.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Bruce Momjian

Date:

22 September 2010, 13:24:47

Robert Haas wrote:
> [server]
> guc=value
> 
> or
> 
> server.guc=value ^^^^^^^^^^^^^^^^

Yes, this was my idea too.  It uses our existing config file format.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Configuring synchronous replication

From

Thom Brown

Date:

22 September 2010, 13:44:24

On 22 September 2010 17:23, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> [server]
>> guc=value
>>
>> or
>>
>> server.guc=value
>  ^^^^^^^^^^^^^^^^
>
> Yes, this was my idea too.  It uses our existing config file format.
>

So...

sync_rep_services = {critical: recv=2, fsync=2, replay=1;                    important: fsync=3;
reporting:recv=2, apply=1} 

becomes ...

sync_rep_services.critical.recv = 2
sync_rep_services.critical.fsync = 2
sync_rep_services.critical.replay = 2
sync_rep_services.important.fsync = 3
sync_rep_services.reporting.recv = 2
sync_rep_services.reporting.apply = 1

I actually started to give this example to demonstrate how cumbersome
it would look... but now that I've just typed it out, I've changed my
mind.  I actually like it!

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Configuring synchronous replication

From

Bruce Momjian

Date:

22 September 2010, 13:47:33

Thom Brown wrote:
> On 22 September 2010 17:23, Bruce Momjian <bruce@momjian.us> wrote:
> > Robert Haas wrote:
> >> [server]
> >> guc=value
> >>
> >> or
> >>
> >> server.guc=value
> > ?^^^^^^^^^^^^^^^^
> >
> > Yes, this was my idea too. ?It uses our existing config file format.
> >
> 
> So...
> 
> sync_rep_services = {critical: recv=2, fsync=2, replay=1;
>                      important: fsync=3;
>                      reporting: recv=2, apply=1}
> 
> becomes ...
> 
> sync_rep_services.critical.recv = 2
> sync_rep_services.critical.fsync = 2
> sync_rep_services.critical.replay = 2
> sync_rep_services.important.fsync = 3
> sync_rep_services.reporting.recv = 2
> sync_rep_services.reporting.apply = 1
> 
> I actually started to give this example to demonstrate how cumbersome
> it would look... but now that I've just typed it out, I've changed my
> mind.  I actually like it!

It can be prone to mistyping, but it seems simple enough.  We already
through a nice error for mistypes in the sever logs.  :-)

I don't think we support 3rd level specifications, but we could.  Looks
very Java-ish.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

22 September 2010, 13:51:20

On Wed, 2010-09-22 at 17:43 +0100, Thom Brown wrote:

> So...
>
> sync_rep_services = {critical: recv=2, fsync=2, replay=1;
>                      important: fsync=3;
>                      reporting: recv=2, apply=1}
>
> becomes ...
>
> sync_rep_services.critical.recv = 2
> sync_rep_services.critical.fsync = 2
> sync_rep_services.critical.replay = 2
> sync_rep_services.important.fsync = 3
> sync_rep_services.reporting.recv = 2
> sync_rep_services.reporting.apply = 1
>
> I actually started to give this example to demonstrate how cumbersome
> it would look... but now that I've just typed it out, I've changed my
> mind.  I actually like it!

With respect, this is ugly. Very ugly. Why do we insist on cryptic
parameters within a config file which should be set within the database
by a super user.

I mean really?

ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS
CRITICAL;
ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2;
ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2;
ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2;

Or some such thing. I saw Heiiki's reply but really the idea that we are
shoving this all into the postgresql.conf is cumbersome.

Sincerely,

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

Robert Haas

Date:

22 September 2010, 14:00:42

On Wed, Sep 22, 2010 at 12:51 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Wed, 2010-09-22 at 17:43 +0100, Thom Brown wrote:
>
>> So...
>>
>> sync_rep_services = {critical: recv=2, fsync=2, replay=1;
>>                      important: fsync=3;
>>                      reporting: recv=2, apply=1}
>>
>> becomes ...
>>
>> sync_rep_services.critical.recv = 2
>> sync_rep_services.critical.fsync = 2
>> sync_rep_services.critical.replay = 2
>> sync_rep_services.important.fsync = 3
>> sync_rep_services.reporting.recv = 2
>> sync_rep_services.reporting.apply = 1
>>
>> I actually started to give this example to demonstrate how cumbersome
>> it would look... but now that I've just typed it out, I've changed my
>> mind.  I actually like it!
>
> With respect, this is ugly. Very ugly. Why do we insist on cryptic
> parameters within a config file which should be set within the database
> by a super user.
>
> I mean really?
>
> ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS
> CRITICAL;
> ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2;
> ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2;
> ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2;
>
> Or some such thing. I saw Heiiki's reply but really the idea that we are
> shoving this all into the postgresql.conf is cumbersome.

I think it should be a separate config file, and I think it should be
a config file that can be edited using DDL commands as you propose.
But it CAN'T be a system catalog, because, among other problems, that
rules out cascading slaves, which are a feature a lot of people
probably want to eventually have.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

22 September 2010, 14:03:07

On 22/09/10 20:00, Robert Haas wrote:
> But it CAN'T be a system catalog, because, among other problems, that
> rules out cascading slaves, which are a feature a lot of people
> probably want to eventually have.

FWIW it could be a system catalog backed by a flat file. But I'm not in 
favor of that for the other reasons I stated earlier.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Aidan Van Dyk

Date:

22 September 2010, 14:03:11

On Wed, Sep 22, 2010 at 8:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Not speaking to the necessity of standby registration, but...

>> Thinking of this as a sysadmin, what I want is to have *one place* I can
>> go an troubleshoot my standby setup.  If I have 12 synch standbys and
>> they're creating too much load on the master, and I want to change half
>> of them to async, I don't want to have to ssh into 6 different machines
>> to do so.  If one standby needs to be taken out of the network because
>> it's too slow, I want to be able to log in to the master and instantly
>> identify which standby is lagging and remove it there.
>
> The above case is one where I can see your point and it does sound
> easier in that case. But I then think: "What happens after failover?".
> We would then need to have 12 different standby.conf files, one on each
> standby that describes what the setup would look like if that standby
> became the master. And guess what, every time we made a change on the
> master, you'd need to re-edit all 12 standby.conf files to reflect the
> new configuration. So we're still back to having to edit in multiple
> places, ISTM.

An interesting option here might be to have "replication.conf"
(instead of standby.conf) which would list all servers, and a
postgresql.conf setting which would set the "local name" the master
would then ignore.  Then all PG servers (master+slave) would be able
to have identical replication.conf files, only having to know their
own "name".  Their own name could be GUC, from postgresql.conf, or
from command line options, or default to hostname, whatever.

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

22 September 2010, 14:09:36

On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:

> > I mean really?
> >
> > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS
> > CRITICAL;
> > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2;
> > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2;
> > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2;
> >
> > Or some such thing. I saw Heiiki's reply but really the idea that we are
> > shoving this all into the postgresql.conf is cumbersome.
>
> I think it should be a separate config file, and I think it should be
> a config file that can be edited using DDL commands as you propose.
> But it CAN'T be a system catalog, because, among other problems, that
> rules out cascading slaves, which are a feature a lot of people
> probably want to eventually have.

I guarantee you there is a way around the cascade slave problem.

I believe there will be "some" postgresql.conf pollution. I don't see
any other way around that but the conf should be limited to things that
literally have to be expressed in a conf for specific static purposes.

I was talking with Bruce on Jabber and one of his concerns with my
approach is "polluting the SQL space for non-admins". I certainly
appreciate that my solution puts code in more places and that it may be
more of a burden for the hackers.

However, we aren't building this for hackers. Most hackers don't even
use the product. We are building it for our community, which are by far
user space developers and dbas.

Sincerely,

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

Bruce Momjian

Date:

22 September 2010, 14:11:23

Heikki Linnakangas wrote:
> On 22/09/10 20:00, Robert Haas wrote:
> > But it CAN'T be a system catalog, because, among other problems, that
> > rules out cascading slaves, which are a feature a lot of people
> > probably want to eventually have.
> 
> FWIW it could be a system catalog backed by a flat file. But I'm not in 
> favor of that for the other reasons I stated earlier.

I thought we just eliminated flat file backing store for tables to
improve replication behavior --- I don't see returning to that as a win.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

22 September 2010, 14:12:49

On 22/09/10 20:02, Heikki Linnakangas wrote:
> On 22/09/10 20:00, Robert Haas wrote:
>> But it CAN'T be a system catalog, because, among other problems, that
>> rules out cascading slaves, which are a feature a lot of people
>> probably want to eventually have.
>
> FWIW it could be a system catalog backed by a flat file. But I'm not in
> favor of that for the other reasons I stated earlier.

Huh, I just realized that my reply didn't make any sense. For some 
reason I thought you were saying that it can't be a catalog because 
backends need to access it without attaching to a database.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Robert Haas

Date:

22 September 2010, 14:17:11

On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:
>
>> > I mean really?
>> >
>> > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS
>> > CRITICAL;
>> > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2;
>> > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2;
>> > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2;
>> >
>> > Or some such thing. I saw Heiiki's reply but really the idea that we are
>> > shoving this all into the postgresql.conf is cumbersome.
>>
>> I think it should be a separate config file, and I think it should be
>> a config file that can be edited using DDL commands as you propose.
>> But it CAN'T be a system catalog, because, among other problems, that
>> rules out cascading slaves, which are a feature a lot of people
>> probably want to eventually have.
>
> I guarantee you there is a way around the cascade slave problem.

And that would be...?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Tom Lane

Date:

22 September 2010, 14:27:58

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:
>>> But it CAN'T be a system catalog, because, among other problems, that
>>> rules out cascading slaves, which are a feature a lot of people
>>> probably want to eventually have.
>> 
>> I guarantee you there is a way around the cascade slave problem.

> And that would be...?

Indeed.  If it's a catalog then it has to be exactly the same on the
master and every slave; which is probably a constraint we don't want
for numerous reasons, not only cascade arrangements.
        regards, tom lane

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

22 September 2010, 14:39:55

On Wed, 2010-09-22 at 13:26 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> >> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:
> >>> But it CAN'T be a system catalog, because, among other problems, that
> >>> rules out cascading slaves, which are a feature a lot of people
> >>> probably want to eventually have.
> >>
> >> I guarantee you there is a way around the cascade slave problem.
>
> > And that would be...?
>
> Indeed.  If it's a catalog then it has to be exactly the same on the
> master and every slave; which is probably a constraint we don't want
> for numerous reasons, not only cascade arrangements.

Unless I am missing something the catalog only needs information for its
specific cluster. E.g; My Master is, I am master for.

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

22 September 2010, 15:50:01

"Joshua D. Drake" <jd@commandprompt.com> writes:
> Unless I am missing something the catalog only needs information for its
> specific cluster. E.g; My Master is, I am master for.

I think the "cluster" here is composed of all and any server partaking
into the replication network, whatever its role and cascading level,
because we only support one master. As soon as the setup is replicated
too, you can edit the setup from the one true master and from nowhere
else, so the single authority must contain the whole setup.

Now that doesn't mean all lines in the setup couldn't refer to a
provider which could be different from the master in the case of
cascading.

What I don't understand is why the replication network topology can't
get serialized into a catalog?

Then again, assuming that a catalog ain't possible, I guess any file
based setup will mean manual syncing of the whole setup at all the
servers participating in the replication? If that's the case, I'll say
it again, it looks like a nightmare to admin and I'd much prefer having
a distributed setup, where any standby's setup is simple and directed to
a single remote node, its provider.

Please note also that such an arrangement doesn't preclude from having a
way to register the standbys (automatically please) and requiring some
action to enable the replication from their provider, and possibly from
the master. But as there's already the hba to setup, I'd think paranoid
sites are covered already.

Regards,
-- 
dim

Re: Configuring synchronous bikeshedding

From

Josh Berkus

Date:

22 September 2010, 15:50:21

All:

I feel compelled to point out that, to date, there have been three times
as many comments on what format the configuration file should be as
there have been on what options it should support and how large numbers
of replicas should be managed.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 

P.S. You folks aren't imaginative enough.  Tab-delimited files. Random
dot images.  Ogham!

Re: Configuring synchronous replication

From

Yeb Havinga

Date:

22 September 2010, 16:46:19

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>   
>> On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>     
>>> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:
>>>       
>>>> But it CAN'T be a system catalog, because, among other problems, that
>>>> rules out cascading slaves, which are a feature a lot of people
>>>> probably want to eventually have.
>>>>         
>>> I guarantee you there is a way around the cascade slave problem.
>>>       
>
>   
>> And that would be...?
>>     
>
> Indeed.  If it's a catalog then it has to be exactly the same on the
> master and every slave; which is probably a constraint we don't want
> for numerous reasons, not only cascade arrangements.
>   
It might be an idea to store the replication information outside of all 
clusters involved in the replication, to not depend on any failure of 
the master or any of the slaves. We've been using Apache's zookeeper 
http://hadoop.apache.org/zookeeper/ to keep track of configuration-like 
knowledge that must be distributed over a number of servers. While 
Zookeeper itself is probably not fit (java) to use in core Postgres to 
keep track of configuration information, what it provides seems like the 
perfect solution, especially group membership and a replicated 
directory-like database (with per directory node a value).

regards,
Yeb Havinga

Re: Configuring synchronous bikeshedding

From

Thom Brown

Date:

22 September 2010, 17:06:22

On 22 September 2010 19:50, Josh Berkus <josh@agliodbs.com> wrote:
> All:
>
> I feel compelled to point out that, to date, there have been three times
> as many comments on what format the configuration file should be as
> there have been on what options it should support and how large numbers
> of replicas should be managed.

I know, it's terrible!... I think it should be green.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Configuring synchronous bikeshedding

From

"Joshua D. Drake"

Date:

22 September 2010, 17:08:57

On Wed, 2010-09-22 at 21:05 +0100, Thom Brown wrote:
> On 22 September 2010 19:50, Josh Berkus <josh@agliodbs.com> wrote:
> > All:
> >
> > I feel compelled to point out that, to date, there have been three times
> > as many comments on what format the configuration file should be as
> > there have been on what options it should support and how large numbers
> > of replicas should be managed.
>
> I know, it's terrible!... I think it should be green.

Remove the shadow please.

>
> --
> Thom Brown
> Twitter: @darkixion
> IRC (freenode): dark_ixion
> Registered Linux user: #516935
>

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

Josh Berkus

Date:

22 September 2010, 19:31:38

> The above case is one where I can see your point and it does sound
> easier in that case. But I then think: "What happens after failover?".
> We would then need to have 12 different standby.conf files, one on each
> standby that describes what the setup would look like if that standby
> became the master. And guess what, every time we made a change on the
> master, you'd need to re-edit all 12 standby.conf files to reflect the
> new configuration. So we're still back to having to edit in multiple
> places, ISTM.

Unless we can make the standby.conf files identical on all servers in
the group.  If we can do that, then conf file management utilities,
fileshares, or a simple automated rsync could easily take care of things.

But ... any setup which involves each standby being *required* to have a
different configuration on each standby server, which has to be edited
separately, is going to be fatally difficult to manage for anyone who
has more than a couple of standbys.  So I'd like to look at what it
takes to get away from that.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

22 September 2010, 20:51:05

On Wed, 2010-09-22 at 17:43 +0100, Thom Brown wrote:

> So...
> 
> sync_rep_services = {critical: recv=2, fsync=2, replay=1;
>                      important: fsync=3;
>                      reporting: recv=2, apply=1}
> 
> becomes ...
> 
> sync_rep_services.critical.recv = 2
> sync_rep_services.critical.fsync = 2
> sync_rep_services.critical.replay = 2
> sync_rep_services.important.fsync = 3
> sync_rep_services.reporting.recv = 2
> sync_rep_services.reporting.apply = 1
> 
> I actually started to give this example to demonstrate how cumbersome
> it would look... but now that I've just typed it out, I've changed my
> mind.  I actually like it!

With respect, this is ugly. Very ugly. Why do we insist on cryptic
parameters within a config file which should be set within the database
by a super user.

I mean really?

ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS
CRITICAL;
ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2;
ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2;
ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2;

Or some such thing. I saw Heiiki's reply but really the idea that we are
shoving this all into the postgresql.conf is cumbersome.

Sincerely,

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

22 September 2010, 21:09:18

On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:

> > I mean really?
> >
> > ALTER CLUSTER ENABLE [SYNC] REPLICATION ON db.foobar.com PORT 5432 ALIAS
> > CRITICAL;
> > ALTER CLUSTER SET REPLICATION CRITICAL RECEIVE FOR 2;
> > ALTER CLUSTER SET REPLICATION CRITICAL FSYNC FOR 2;
> > ALTER CLUSTER SET REPLICATION CRITICAL REPLAY FOR 2;
> >
> > Or some such thing. I saw Heiiki's reply but really the idea that we are
> > shoving this all into the postgresql.conf is cumbersome.
> 
> I think it should be a separate config file, and I think it should be
> a config file that can be edited using DDL commands as you propose.
> But it CAN'T be a system catalog, because, among other problems, that
> rules out cascading slaves, which are a feature a lot of people
> probably want to eventually have.

I guarantee you there is a way around the cascade slave problem.

I believe there will be "some" postgresql.conf pollution. I don't see
any other way around that but the conf should be limited to things that
literally have to be expressed in a conf for specific static purposes.

I was talking with Bruce on Jabber and one of his concerns with my
approach is "polluting the SQL space for non-admins". I certainly
appreciate that my solution puts code in more places and that it may be
more of a burden for the hackers.

However, we aren't building this for hackers. Most hackers don't even
use the product. We are building it for our community, which are by far
user space developers and dbas.

Sincerely,

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

"Joshua D. Drake"

Date:

22 September 2010, 21:39:41

On Wed, 2010-09-22 at 13:26 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Sep 22, 2010 at 1:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> >> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:
> >>> But it CAN'T be a system catalog, because, among other problems, that
> >>> rules out cascading slaves, which are a feature a lot of people
> >>> probably want to eventually have.
> >> 
> >> I guarantee you there is a way around the cascade slave problem.
> 
> > And that would be...?
> 
> Indeed.  If it's a catalog then it has to be exactly the same on the
> master and every slave; which is probably a constraint we don't want
> for numerous reasons, not only cascade arrangements.

Unless I am missing something the catalog only needs information for its
specific cluster. E.g; My Master is, I am master for.

Joshua D. Drake


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous bikeshedding

From

"Joshua D. Drake"

Date:

23 September 2010, 00:08:44

On Wed, 2010-09-22 at 21:05 +0100, Thom Brown wrote:
> On 22 September 2010 19:50, Josh Berkus <josh@agliodbs.com> wrote:
> > All:
> >
> > I feel compelled to point out that, to date, there have been three times
> > as many comments on what format the configuration file should be as
> > there have been on what options it should support and how large numbers
> > of replicas should be managed.
> 
> I know, it's terrible!... I think it should be green.

Remove the shadow please.

> 
> -- 
> Thom Brown
> Twitter: @darkixion
> IRC (freenode): dark_ixion
> Registered Linux user: #516935
> 

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Configuring synchronous replication

From

Simon Riggs

Date:

23 September 2010, 03:20:49

On Mon, 2010-09-20 at 18:24 -0400, Robert Haas wrote:

> I feel like that's really nice and simple.

There are already 5 separate places to configure to make streaming rep
work in a 2 node cluster (master.pg_hba.conf, master.postgresql.conf,
standby.postgresql.conf, standby.recovery.conf, password file/ssh key).
I haven't heard anyone say we would be removing controls from those
existing areas, so it isn't clear to me how adding a 6th place will make
things "nice and simple". 

Put simply, Standby registration is not required for most use cases. If
some people want it, I'm happy that it can be optional. Personally, I
want to make very sure that any behaviour that involves waiting around
indefinitely can be turned off and should be off by default.

ISTM very simple to arrange things so you can set parameters on the
master OR on the standby, whichever is most convenient or desirable.
Passing parameters around at handshake is pretty trivial.

I do also understand that some parameters *must* be set in certain
locations to gain certain advantages. Those can be documented.

I would be happier if we could separate the *list* of control parameters
we need from the issue of *where* we set those parameters. I would be
even happier if we could agree on the top 3-5 parameters so we can
implement those first.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

23 September 2010, 06:03:02

On 23/09/10 11:34, Csaba Nagy wrote:
> In the meantime our DBs are not able to keep in sync via WAL
> replication, that would need some kind of parallel WAL restore on the
> slave I guess, or I'm not able to configure it properly - in any case
> now we use slony which is working.

It would be interesting to debug that case a bit more. Was bottlenecked 
by CPU or I/O, or network capacity perhaps?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Csaba Nagy

Date:

23 September 2010, 09:31:13

Hi all,

Some time ago I was also interested in this feature, and that time I
also thought about complete setup possibility via postgres connections,
meaning the transfer of the files and all configuration/slave
registration to be done through normal backend connections.

In the meantime our DBs are not able to keep in sync via WAL
replication, that would need some kind of parallel WAL restore on the
slave I guess, or I'm not able to configure it properly - in any case
now we use slony which is working. In fact the way slony is doing the
configuration could be a good place to look...

On Wed, 2010-09-22 at 13:16 -0400, Robert Haas wrote:
> > I guarantee you there is a way around the cascade slave problem.
> 
> And that would be...?
* restrict the local file configuration to a replication ID;
* make all configuration refer to the replica ID;
* keep all configuration in a shared catalog: it can be kept exactly
the same on all replicas, as each replication "node" will only care
about the configuration concerning it's own replica ID;
* added advantage: after take-over the slave will change the configured
master to it's own replica ID, and if the old master would ever connect
again, it could easily notice that and give up;

Cheers,
Csaba.

Re: Configuring synchronous replication

From

Csaba Nagy

Date:

23 September 2010, 09:31:35

On Thu, 2010-09-23 at 12:02 +0300, Heikki Linnakangas wrote:
> On 23/09/10 11:34, Csaba Nagy wrote:
> > In the meantime our DBs are not able to keep in sync via WAL
> > replication, that would need some kind of parallel WAL restore on the
> > slave I guess, or I'm not able to configure it properly - in any case
> > now we use slony which is working.
> 
> It would be interesting to debug that case a bit more. Was bottlenecked 
> by CPU or I/O, or network capacity perhaps?

Unfortunately it was quite long time ago we last tried, and I don't
remember exactly what was bottlenecked. Our application is quite
write-intensive, the ratio of writes to reads which actually reaches the
disk is about 50-200% (according to the disk stats - yes, sometimes we
write more to the disk than we read, probably due to the relatively
large RAM installed). If I remember correctly, the standby was about the
same regarding IO/CPU power as the master, but it was not able to
process the WAL files as fast as they were coming in, which excludes at
least the network as a bottleneck. What I actually suppose happens is
that the one single process applying the WAL on the slave is not able to
match the full IO the master is able to do with all it's processors.

If you're interested, I could try to set up another try, but it would be
on 8.3.7 (that's what we still run). On 9.x would be also interesting,
but that would be a test system and I can't possibly get there the load
we have on production...

Cheers,
Csaba.

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

23 September 2010, 10:18:54

On 23/09/10 15:26, Csaba Nagy wrote:
> Unfortunately it was quite long time ago we last tried, and I don't
> remember exactly what was bottlenecked. Our application is quite
> write-intensive, the ratio of writes to reads which actually reaches the
> disk is about 50-200% (according to the disk stats - yes, sometimes we
> write more to the disk than we read, probably due to the relatively
> large RAM installed). If I remember correctly, the standby was about the
> same regarding IO/CPU power as the master, but it was not able to
> process the WAL files as fast as they were coming in, which excludes at
> least the network as a bottleneck. What I actually suppose happens is
> that the one single process applying the WAL on the slave is not able to
> match the full IO the master is able to do with all it's processors.

There's a program called pg_readahead somewhere on pgfoundry by NTT that 
will help if it's the single-threadedness of I/O. Before handing the WAL 
file to the server, it scans it through and calls posix_fadvise for all 
the blocks that it touches. When the server then replays it, the data 
blocks are already being fetched by the OS, using the whole RAID array.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

23 September 2010, 12:36:21

On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:

> I think it should be a separate config file, and I think it should be
> a config file that can be edited using DDL commands as you propose.
> But it CAN'T be a system catalog, because, among other problems, that
> rules out cascading slaves, which are a feature a lot of people
> probably want to eventually have.

ISTM that we can have a system catalog and still have cascading slaves.
If we administer the catalog via the master, why can't we administer all
slaves, however they cascade, via the master too?

What other problems are there that mean we *must* have a file? I can't
see any. Elsewhere, we've established that we can have unregistered
standbys, so max_wal_senders cannot go away.

If we do have a file, it will be a problem after failover since the file
will be either absent or potentially out of date.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Tom Lane

Date:

23 September 2010, 12:44:58

Simon Riggs <simon@2ndQuadrant.com> writes:
> ISTM that we can have a system catalog and still have cascading slaves.
> If we administer the catalog via the master, why can't we administer all
> slaves, however they cascade, via the master too?

> What other problems are there that mean we *must* have a file?

Well, for one thing, how do you add a new slave?  If its configuration
comes from a system catalog, it seems that it has to already be
replicating before it knows what its configuration is.
        regards, tom lane

Re: Configuring synchronous replication

From

Simon Riggs

Date:

23 September 2010, 13:07:34

On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > ISTM that we can have a system catalog and still have cascading slaves.
> > If we administer the catalog via the master, why can't we administer all
> > slaves, however they cascade, via the master too?
> 
> > What other problems are there that mean we *must* have a file?
> 
> Well, for one thing, how do you add a new slave?  If its configuration
> comes from a system catalog, it seems that it has to already be
> replicating before it knows what its configuration is.

At the moment, I'm not aware of any proposed parameters that need to be
passed from master to standby, since that was one of the arguments for
standby registration in the first place.

If that did occur, when the standby connects it would get told what
parameters to use by the master as part of the handshake. It would have
to work exactly that way with standby.conf on the master also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Robert Haas

Date:

23 September 2010, 13:45:54

On Thu, Sep 23, 2010 at 11:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-09-22 at 13:00 -0400, Robert Haas wrote:
>
>> I think it should be a separate config file, and I think it should be
>> a config file that can be edited using DDL commands as you propose.
>> But it CAN'T be a system catalog, because, among other problems, that
>> rules out cascading slaves, which are a feature a lot of people
>> probably want to eventually have.
>
> ISTM that we can have a system catalog and still have cascading slaves.
> If we administer the catalog via the master, why can't we administer all
> slaves, however they cascade, via the master too?

Well, I guess we could, but is that really convenient?  My gut feeling
is no, but of course it's subjective.

> What other problems are there that mean we *must* have a file? I can't
> see any. Elsewhere, we've established that we can have unregistered
> standbys, so max_wal_senders cannot go away.
>
> If we do have a file, it will be a problem after failover since the file
> will be either absent or potentially out of date.

I'm not sure about that.  I wonder if we can actually turn this into a
feature, with careful design.  Suppose that you have the common
configuration of two machines, A and B.  At any give time, one is the
master and one is the slave.  And let's say you've opted for sync rep,
apply mode, don't wait for disconnected standbys.  Well, you can have
a config file on A that defines B as the slave, and a config file on B
that defines A as the slave.  When failover happens, you still have to
worry about taking a new base backup, removing recovery.conf from the
new master and adding it to the slave, and all that stuff, but the
standby config just works.

Now, admittedly, in more complex topologies, and especially if you're
using configuration options that pertain to the behavior of
disconnected standbys (e.g. wait for them, or retain WAL for them),
you're going to need to adjust the configs.  But I think that's likely
to be true anyway, even with a catalog.  If A is doing sync rep and
waiting for B even when B is disconnected, and the machines switch
roles, it's hard to see how any configuration isn't going to need some
adjustment.  One thing that's nice about the flat file system is that
you can make the configuration changes on the new master before you
promote it (perhaps you had A replicating synchronously to B and B
replicating asynchronously to C, but now that A is dead and B is
promoted, you want the latter replication to become synchronous).
Being able to make those kinds of changes before you start processing
live transactions is possibly useful to some people.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Tom Lane

Date:

23 September 2010, 13:54:25

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote:
>> Well, for one thing, how do you add a new slave?  If its configuration
>> comes from a system catalog, it seems that it has to already be
>> replicating before it knows what its configuration is.

> At the moment, I'm not aware of any proposed parameters that need to be
> passed from master to standby, since that was one of the arguments for
> standby registration in the first place.

> If that did occur, when the standby connects it would get told what
> parameters to use by the master as part of the handshake. It would have
> to work exactly that way with standby.conf on the master also.

Um ... so how does this standby know what master to connect to, what
password to offer, etc?  I don't think that "pass down parameters after
connecting" is likely to cover anything but a small subset of the
configuration problem.
        regards, tom lane

Re: Configuring synchronous replication

From

Robert Haas

Date:

23 September 2010, 13:56:04

On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote:
>>> Well, for one thing, how do you add a new slave?  If its configuration
>>> comes from a system catalog, it seems that it has to already be
>>> replicating before it knows what its configuration is.
>
>> At the moment, I'm not aware of any proposed parameters that need to be
>> passed from master to standby, since that was one of the arguments for
>> standby registration in the first place.
>
>> If that did occur, when the standby connects it would get told what
>> parameters to use by the master as part of the handshake. It would have
>> to work exactly that way with standby.conf on the master also.
>
> Um ... so how does this standby know what master to connect to, what
> password to offer, etc?  I don't think that "pass down parameters after
> connecting" is likely to cover anything but a small subset of the
> configuration problem.

Huh?  We have that stuff already.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Tom Lane

Date:

23 September 2010, 14:08:13

Robert Haas <robertmhaas@gmail.com> writes:
> Now, admittedly, in more complex topologies, and especially if you're
> using configuration options that pertain to the behavior of
> disconnected standbys (e.g. wait for them, or retain WAL for them),
> you're going to need to adjust the configs.  But I think that's likely
> to be true anyway, even with a catalog.  If A is doing sync rep and
> waiting for B even when B is disconnected, and the machines switch
> roles, it's hard to see how any configuration isn't going to need some
> adjustment.  One thing that's nice about the flat file system is that
> you can make the configuration changes on the new master before you
> promote it

Actually, that's the killer argument in this whole thing.  If the
configuration information is in a system catalog, you can't change it
without the master being up and running.  Let us suppose for example
that you've configured hard synchronous replication such that the master
can't commit without slave acks.  Now your slaves are down and you'd
like to change that setting.  Guess what.
        regards, tom lane

Re: Configuring synchronous replication

From

Tom Lane

Date:

23 September 2010, 14:09:01

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Um ... so how does this standby know what master to connect to, what
>> password to offer, etc? �I don't think that "pass down parameters after
>> connecting" is likely to cover anything but a small subset of the
>> configuration problem.

> Huh?  We have that stuff already.

Oh, I thought part of the objective here was to try to centralize that
stuff.  If we're assuming that slaves will still have local replication
configuration files, then I think we should just add any necessary info
to those files and drop this entire conversation.  We're expending a
tremendous amount of energy on something that won't make any real
difference to the overall complexity of configuring a replication setup.
AFAICS the only way you make a significant advance in usability is if
you can centralize all the configuration information in some fashion.
        regards, tom lane

Re: Configuring synchronous replication

From

Robert Haas

Date:

23 September 2010, 14:18:24

On Thu, Sep 23, 2010 at 1:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Um ... so how does this standby know what master to connect to, what
>>> password to offer, etc?  I don't think that "pass down parameters after
>>> connecting" is likely to cover anything but a small subset of the
>>> configuration problem.
>
>> Huh?  We have that stuff already.
>
> Oh, I thought part of the objective here was to try to centralize that
> stuff.  If we're assuming that slaves will still have local replication
> configuration files, then I think we should just add any necessary info
> to those files and drop this entire conversation.  We're expending a
> tremendous amount of energy on something that won't make any real
> difference to the overall complexity of configuring a replication setup.
> AFAICS the only way you make a significant advance in usability is if
> you can centralize all the configuration information in some fashion.

Well, it's quite fanciful to suppose that the slaves aren't going to
need to have local configuration for how to connect to the master.
The configuration settings we're talking about here are the things
that affect either the behavior of the master-slave system as a unit
(like what kind of ACK the master needs to get from the slave before
ACKing the commit back to the user) or the master alone (like tracking
how much WAL needs to be retained for a particular disconnected slave,
rather than as presently always retaining a fixed amount).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Csaba Nagy

Date:

23 September 2010, 14:35:20

On Thu, 2010-09-23 at 16:18 +0300, Heikki Linnakangas wrote:
> There's a program called pg_readahead somewhere on pgfoundry by NTT that 
> will help if it's the single-threadedness of I/O. Before handing the WAL 
> file to the server, it scans it through and calls posix_fadvise for all 
> the blocks that it touches. When the server then replays it, the data 
> blocks are already being fetched by the OS, using the whole RAID array.

That sounds useful, thanks for the hint !

But couldn't this also be directly built in to WAL recovery process ? It
would probably help a lot for recovering from a crash too. We did have
recently a crash and it took hours to recover.

I will try it out as soon as I get the time to set it up...

[searching pgfoundry] 

Unfortunately I can't find it, and google is also not very helpful. Do
you happen to have some links to it ?

Cheers,
Csaba.

Re: Configuring synchronous replication

From

Csaba Nagy

Date:

23 September 2010, 14:35:21

On Thu, 2010-09-23 at 11:43 -0400, Tom Lane wrote:
> > What other problems are there that mean we *must* have a file?
> 
> Well, for one thing, how do you add a new slave?  If its configuration
> comes from a system catalog, it seems that it has to already be
> replicating before it knows what its configuration is.

Or the slave gets a connection string to the master, and reads the
configuration from there - it has to connect there anyway...

The ideal bootstrap for a slave creation would be: get the params to
connect to the master + the replica ID, and the rest should be done by
connecting to the master and getting all the needed thing from there,
including configuration.

Maybe you see some merit for this idea: it wouldn't hurt to get the
interfaces done so that the master could be impersonated by some WAL
repository serving a PITR snapshot, and that the same WAL repository
could connect as a slave to the master and instead of recovering the WAL
stream, archive it. Such a WAL repository would possibly connect to
multiple masters and could also get regularly snapshots too. This would
provide a nice complement to WAL replication as PITR solution using the
same protocols as the WAL standby. I have no idea if this would be easy
to implement or useful for anybody.

Cheers,
Csaba.

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

23 September 2010, 14:42:44

On 23/09/10 20:03, Tom Lane wrote:
> Robert Haas<robertmhaas@gmail.com>  writes:
>> On Thu, Sep 23, 2010 at 12:52 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> Um ... so how does this standby know what master to connect to, what
>>> password to offer, etc?  I don't think that "pass down parameters after
>>> connecting" is likely to cover anything but a small subset of the
>>> configuration problem.
>
>> Huh?  We have that stuff already.
>
> Oh, I thought part of the objective here was to try to centralize that
> stuff.  If we're assuming that slaves will still have local replication
> configuration files, then I think we should just add any necessary info
> to those files and drop this entire conversation.  We're expending a
> tremendous amount of energy on something that won't make any real
> difference to the overall complexity of configuring a replication setup.
> AFAICS the only way you make a significant advance in usability is if
> you can centralize all the configuration information in some fashion.

If you want the behavior where the master doesn't acknowledge a commit 
to the client until the standby (or all standbys, or one of them etc.) 
acknowledges it, even if the standby is not currently connected, the 
master needs to know what standby servers exist. *That's* why 
synchronous replication needs a list of standby servers in the master.

If you're willing to downgrade to a mode where commit waits for 
acknowledgment only from servers that are currently connected, then you 
don't need any new configuration files. But that's not what I call 
synchronous replication, it doesn't give you the guarantees that 
textbook synchronous replication does.

(Gosh, I wish the terminology was more standardized in this area)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

23 September 2010, 16:47:35

On Thu, 2010-09-23 at 13:07 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Now, admittedly, in more complex topologies, and especially if you're
> > using configuration options that pertain to the behavior of
> > disconnected standbys (e.g. wait for them, or retain WAL for them),
> > you're going to need to adjust the configs.  But I think that's likely
> > to be true anyway, even with a catalog.  If A is doing sync rep and
> > waiting for B even when B is disconnected, and the machines switch
> > roles, it's hard to see how any configuration isn't going to need some
> > adjustment.  

Well, its not at all hard to see how that could be configured, because I
already proposed a simple way of implementing parameters that doesn't
suffer from those problems. My proposal did not give roles to named
standbys and is symmetrical, so switchovers won't cause a problem.

Earlier you argued that centralizing parameters would make this nice and
simple. Now you're pointing out that we aren't centralizing this at all,
and it won't be simple. We'll have to have a standby.conf set up that is
customised in advance for each standby that might become a master. Plus
we may even need multiple standby.confs in case that we have multiple
nodes down. This is exactly what I was seeking to avoid and exactly what
I meant when I asked for an analysis of the failure modes.

This proposal is a configuration nightmare, no question, and that is not
the right way to go if you want high availability that works when you
need it to.

> One thing that's nice about the flat file system is that
> > you can make the configuration changes on the new master before you
> > promote it
> 
> Actually, that's the killer argument in this whole thing.  If the
> configuration information is in a system catalog, you can't change it
> without the master being up and running.  Let us suppose for example
> that you've configured hard synchronous replication such that the master
> can't commit without slave acks.  Now your slaves are down and you'd
> like to change that setting.  Guess what.

If we have standby registration and I respect that some people want it,
a table seems to be the best place for them. In a table the parameters
are passed through from master to slave automatically without needing to
synchronize multiple files manually.

They can only be changed on a master, true. But since they only effect
the behaviour of a master (commits => writes) then that doesn't matter
at all. As soon as you promote a new master you'll be able to change
them again, if required. Configuration options that differ on each node,
depending upon the current state of others nodes are best avoided.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Robert Haas

Date:

23 September 2010, 17:09:26

On Thu, Sep 23, 2010 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Well, its not at all hard to see how that could be configured, because I
> already proposed a simple way of implementing parameters that doesn't
> suffer from those problems. My proposal did not give roles to named
> standbys and is symmetrical, so switchovers won't cause a problem.

I know you proposed a way, but my angst is all around whether it was
actually simple.  I found it somewhat difficult to understand, so
possibly other people might have the same problem.

> Earlier you argued that centralizing parameters would make this nice and
> simple. Now you're pointing out that we aren't centralizing this at all,
> and it won't be simple. We'll have to have a standby.conf set up that is
> customised in advance for each standby that might become a master. Plus
> we may even need multiple standby.confs in case that we have multiple
> nodes down. This is exactly what I was seeking to avoid and exactly what
> I meant when I asked for an analysis of the failure modes.

If you're operating on the notion that no reconfiguration will be
necessary when nodes go down, then we have very different notions of
what is realistic.  I think that "copy the new standby.conf file in
place" is going to be the least of the fine admin's problems.

>> One thing that's nice about the flat file system is that
>> > you can make the configuration changes on the new master before you
>> > promote it
>>
>> Actually, that's the killer argument in this whole thing.  If the
>> configuration information is in a system catalog, you can't change it
>> without the master being up and running.  Let us suppose for example
>> that you've configured hard synchronous replication such that the master
>> can't commit without slave acks.  Now your slaves are down and you'd
>> like to change that setting.  Guess what.
>
> If we have standby registration and I respect that some people want it,
> a table seems to be the best place for them. In a table the parameters
> are passed through from master to slave automatically without needing to
> synchronize multiple files manually.
>
> They can only be changed on a master, true. But since they only effect
> the behaviour of a master (commits => writes) then that doesn't matter
> at all. As soon as you promote a new master you'll be able to change
> them again, if required. Configuration options that differ on each node,
> depending upon the current state of others nodes are best avoided.

I think maybe you missed Tom's point, or else you just didn't respond
to it.  If the master is wedged because it is waiting for a standby,
then you cannot commit transactions on the master.  Therefore you
cannot update the system catalog which you must update to unwedge it.
Failing over in that situation is potentially a huge nuisance and
extremely undesirable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Simon Riggs

Date:

23 September 2010, 18:29:10

On Wed, 2010-09-22 at 15:31 -0700, Josh Berkus wrote:
> > The above case is one where I can see your point and it does sound
> > easier in that case. But I then think: "What happens after failover?".
> > We would then need to have 12 different standby.conf files, one on each
> > standby that describes what the setup would look like if that standby
> > became the master. And guess what, every time we made a change on the
> > master, you'd need to re-edit all 12 standby.conf files to reflect the
> > new configuration. So we're still back to having to edit in multiple
> > places, ISTM.
> 
> Unless we can make the standby.conf files identical on all servers in
> the group.  If we can do that, then conf file management utilities,
> fileshares, or a simple automated rsync could easily take care of things.

Would prefer table.

> But ... any setup which involves each standby being *required* to have a
> different configuration on each standby server, which has to be edited
> separately, is going to be fatally difficult to manage for anyone who
> has more than a couple of standbys.  So I'd like to look at what it
> takes to get away from that.

Agreed.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

23 September 2010, 19:13:01

On Thu, 2010-09-23 at 20:42 +0300, Heikki Linnakangas wrote:
> If you want the behavior where the master doesn't acknowledge a
> commit 
> to the client until the standby (or all standbys, or one of them
> etc.) 
> acknowledges it, even if the standby is not currently connected, the 
> master needs to know what standby servers exist. *That's* why 
> synchronous replication needs a list of standby servers in the master.
> 
> If you're willing to downgrade to a mode where commit waits for 
> acknowledgment only from servers that are currently connected, then
> you don't need any new configuration files. 

As I keep pointing out, waiting for an acknowledgement from something
that isn't there might just take a while. The only guarantee that
provides is that you will wait a long time. Is my data more safe? No.

To get zero data loss *and* continuous availability, you need two
standbys offering sync rep and reply-to-first behaviour. You don't need
standby registration to achieve that.

> But that's not what I call synchronous replication, it doesn't give
> you the guarantees that 
> textbook synchronous replication does.

Which textbook?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Markus Wanner

Date:

24 September 2010, 04:33:52

On 09/23/2010 10:09 PM, Robert Haas wrote:
> I think maybe you missed Tom's point, or else you just didn't respond
> to it.  If the master is wedged because it is waiting for a standby,
> then you cannot commit transactions on the master.  Therefore you
> cannot update the system catalog which you must update to unwedge it.
> Failing over in that situation is potentially a huge nuisance and
> extremely undesirable.

Well, Simon is arguing that there's no need to wait for a disconnected
standby. So that's not much of an issue.

Regrads

Markus Wanner

Re: Configuring synchronous replication

From

Markus Wanner

Date:

24 September 2010, 04:52:10

Simon,

On 09/24/2010 12:11 AM, Simon Riggs wrote:
> As I keep pointing out, waiting for an acknowledgement from something
> that isn't there might just take a while. The only guarantee that
> provides is that you will wait a long time. Is my data more safe? No.

By now I agree that waiting for disconnected standbies is useless in
master-slave replication. However, it makes me wonder where you draw the
line between just temporarily unresponsive and disconnected.

> To get zero data loss *and* continuous availability, you need two
> standbys offering sync rep and reply-to-first behaviour. You don't need
> standby registration to achieve that.

Well, if your master reaches the false conclusion that both standbies
are disconnected and happily continues without their ACKs (and the idiot
admin being happy about having boosted database performance with
whatever measure he recently took) you certainly don't have no zero data
loss guarantee anymore.

So for one, this needs a big fat warning that gets slapped on the
admin's forehead in case of a disconnect.

And second, the timeout for considering a standby to be disconnected
should rather be large enough to not get false negatives. IIUC the
master still waits for an ACK during that timeout.

An infinite timeout doesn't have either of these issues, because there's
no such distinction between temporarily unresponsive and disconnected.

Regards

Markus Wanner

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

24 September 2010, 05:08:42

On 24/09/10 01:11, Simon Riggs wrote:
>> But that's not what I call synchronous replication, it doesn't give
>> you the guarantees that
>> textbook synchronous replication does.
>
> Which textbook?

I was using that word metaphorically, but for example:

Wikipedia http://en.wikipedia.org/wiki/Replication_%28computer_science%29 (includes a caveat that many commercial
systemsskimp on it)

Oracle docs
http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repoverview.htm Scroll to "Synchronous Replication"

Googling for "synchronous replication textbook" also turns up this
actual textbook: Database Management Systems by R. Ramakrishnan & others
which uses synchronous replication with this meaning, although in the
context of multi-master replication.

Interestingly, "Transaction Processing: Concepts and techniques" by
Grey, Reuter, chapter 12.6.3, defines three levels:

1-safe - what we call asynchronous
2-safe - commit is acknowledged after the slave acknowledges it, but if
the slave is down, fall back to asynchronous mode.
3-safe - commit is acknowledged only after slave acknowledges it. If it
is down, refuse to commit

In the context of multi-master replication, "eager replication" seems to
be commonly used to mean synchronous replication.

If we just want *something* that's useful, and want to avoid the hassle
of registration and all that, I proposed a while back
(http://archives.postgresql.org/message-id/4C7E29BC.3020902@enterprisedb.com)
that we could aim for behavior that would be useful for distributing
read-only load to slaves.

The use case is specifically that you have one master and one or more
hot standby servers. You also have something like pgpool that
distributes all read-only queries across all the nodes, and routes
updates to the master server.

In this scenario, you want that the master node does not acknowledge a
commit to the client until all currently connected standby servers have
replayed the commit. Furthermore, you want a standby server to stop
accepting queries if it loses connection to the master, to avoid giving
out-of-date responses. With suitable timeouts in the master and the
standby, it seems possible to guarantee that you can connect to any node
in the system and get an up-to-date result.

It does not give zero data loss like synchronous replication does, but
it keeps hot standby servers trustworthy for queries.

It bothers me that no-one seems to have a clear use case in mind. People
want "synchronous replication", but don't seem to care much what
guarantees it should provide. I wish the terminology was better
standardized in this area.

-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

24 September 2010, 05:43:26

On 24/09/10 01:11, Simon Riggs wrote:
> On Thu, 2010-09-23 at 20:42 +0300, Heikki Linnakangas wrote:
>> If you want the behavior where the master doesn't acknowledge a
>> commit
>> to the client until the standby (or all standbys, or one of them
>> etc.)
>> acknowledges it, even if the standby is not currently connected, the
>> master needs to know what standby servers exist. *That's* why
>> synchronous replication needs a list of standby servers in the master.
>>
>> If you're willing to downgrade to a mode where commit waits for
>> acknowledgment only from servers that are currently connected, then
>> you don't need any new configuration files.
>
> As I keep pointing out, waiting for an acknowledgement from something
> that isn't there might just take a while. The only guarantee that
> provides is that you will wait a long time. Is my data more safe? No.

It provides zero data loss, at the expense of availability. That's what 
synchronous replication is all about.

> To get zero data loss *and* continuous availability, you need two
> standbys offering sync rep and reply-to-first behaviour.

Yes, that is a good point.

I'm starting to understand what your proposal was all about. It makes 
sense when you think of a three node system configured for high 
availability with zero data loss like that.

The use case of keeping hot standby servers up todate in a cluster where 
read-only queries are distributed across all nodes seems equally 
important though. What's the simplest method of configuration that 
supports both use cases?

> You don't need standby registration to achieve that.

Not necessarily I guess, but it creeps me out that a standby can just 
connect to the master and act as a synchronous slave, and there is no 
controls in the master on what standby servers there are.

More complicated scenarios with quorums and different number of votes 
get increasingly complicated if there is no central place to configure 
it. But maybe we can ignore the more complicated setups for now.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

24 September 2010, 07:38:48

On Thu, 2010-09-23 at 14:26 +0200, Csaba Nagy wrote:
> Unfortunately it was quite long time ago we last tried, and I don't
> remember exactly what was bottlenecked. Our application is quite
> write-intensive, the ratio of writes to reads which actually reaches
> the disk is about 50-200% (according to the disk stats - yes,
> sometimes we write more to the disk than we read, probably due to the
> relatively large RAM installed). If I remember correctly, the standby
> was about the same regarding IO/CPU power as the master, but it was
> not able to process the WAL files as fast as they were coming in,
> which excludes at least the network as a bottleneck. What I actually
> suppose happens is that the one single process applying the WAL on the
> slave is not able to match the full IO the master is able to do with
> all it's processors.
> 
> If you're interested, I could try to set up another try, but it would
> be on 8.3.7 (that's what we still run). On 9.x would be also
> interesting...

Substantial performance improvements came in 8.4 with bgwriter running
in recovery. That meant that the startup process didn't need to spend
time doing restartpoints and could apply changes continuously.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

24 September 2010, 07:38:56

On Thu, 2010-09-23 at 16:09 -0400, Robert Haas wrote:

> On Thu, Sep 23, 2010 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Well, its not at all hard to see how that could be configured, because I
> > already proposed a simple way of implementing parameters that doesn't
> > suffer from those problems. My proposal did not give roles to named
> > standbys and is symmetrical, so switchovers won't cause a problem.
> 
> I know you proposed a way, but my angst is all around whether it was
> actually simple.  I found it somewhat difficult to understand, so
> possibly other people might have the same problem.

Let's go back to Josh's 12 server example. This current proposal
requires 12 separate and different configuration files each containing
many parameters that require manual maintenance.

I doubt that people looking at that objectively will decide that is the
best approach. 

We need to arrange a clear way for people to decide for themselves. I'll work on that.

> > Earlier you argued that centralizing parameters would make this nice and
> > simple. Now you're pointing out that we aren't centralizing this at all,
> > and it won't be simple. We'll have to have a standby.conf set up that is
> > customised in advance for each standby that might become a master. Plus
> > we may even need multiple standby.confs in case that we have multiple
> > nodes down. This is exactly what I was seeking to avoid and exactly what
> > I meant when I asked for an analysis of the failure modes.
> 
> If you're operating on the notion that no reconfiguration will be
> necessary when nodes go down, then we have very different notions of
> what is realistic.  I think that "copy the new standby.conf file in
> place" is going to be the least of the fine admin's problems.

Earlier you argued that setting parameters on each standby was difficult
and we should centralize things on the master. Now you tell us that
actually we do need lots of settings on each standby and that to think
otherwise is not realistic. That's a contradiction.

The chain of argument used to support this as being a sensible design choice is broken or contradictory in more than
one
place. I think we should be looking for a design using the KISS principle, while retaining sensible tuning options.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

24 September 2010, 07:44:39

Tom Lane <tgl@sss.pgh.pa.us> writes:
> Oh, I thought part of the objective here was to try to centralize that
> stuff.  If we're assuming that slaves will still have local replication
> configuration files, then I think we should just add any necessary info
> to those files and drop this entire conversation.  We're expending a
> tremendous amount of energy on something that won't make any real
> difference to the overall complexity of configuring a replication setup.
> AFAICS the only way you make a significant advance in usability is if
> you can centralize all the configuration information in some fashion.

+1, but for real usability you have to make it so that this central
setup can be edited from any member of the replication.

HINT: plproxy.

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

24 September 2010, 07:53:54

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> If you want the behavior where the master doesn't acknowledge a commit to
> the client until the standby (or all standbys, or one of them etc.)
> acknowledges it, even if the standby is not currently connected, the master
> needs to know what standby servers exist. *That's* why synchronous
> replication needs a list of standby servers in the master.

And this list can be maintained in a semi-automatic fashion: 
- adding to the list is done by the master as soon as a standby connects  maybe we need to add a notion of "fqdn" in
thestandby setup?
 
- service level and current weight and any other knob that comes from  the standby are changed on the fly by the master
ifthat changes on  the standby (default async, 1, but SIGHUP please)
 
- current standby position (LSN for recv, fsync and replayed) of the  standby, as received in the "feedback loop" are
changedon the fly by  the master
 
- removing a standby has to be done manually, using an admin function  that's the only way to sort out permanent vs
transientunavailability
 
- checking the current values in this list is done on the master by  using some system view based on a SRF, as already
said

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Simon Riggs

Date:

24 September 2010, 07:58:34

On Fri, 2010-09-24 at 11:08 +0300, Heikki Linnakangas wrote:
> On 24/09/10 01:11, Simon Riggs wrote:
> >> But that's not what I call synchronous replication, it doesn't give
> >> you the guarantees that
> >> textbook synchronous replication does.
> >
> > Which textbook?
> 
> I was using that word metaphorically, but for example:
> 
> Wikipedia
>   http://en.wikipedia.org/wiki/Replication_%28computer_science%29
>   (includes a caveat that many commercial systems skimp on it)

Yes, I read that. The example it uses shows only one standby, which does
suffer from the problem/caveat it describes. Two standbys resolves that
problem, yet there is no mention of multiple standbys in Wikipedia.

> Oracle docs
>  
> http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repoverview.htm
>   Scroll to "Synchronous Replication"

That document refers to sync rep *only* in the context of multimaster
replication. We aren't discussing that here and so that link is not
relevant at all.

Oracle Data Guard in Maximum availability mode is roughly where I think
we should be aiming
http://download.oracle.com/docs/cd/B10500_01/server.920/a96653/concepts.htm#1033871

But I disagree with consulting other companies' copyrighted material,
and I definitely don't like their overcomplicated configuration. And
they have not yet thought of per-transaction controls. So I believe we
should learn many lessons from them, but actually ignore and surpass
them. Easily.

> Googling for "synchronous replication textbook" also turns up this 
> actual textbook:
>    Database Management Systems by R. Ramakrishnan & others
> which uses synchronous replication with this meaning, although in the 
> context of multi-master replication.
> 
> Interestingly, "Transaction Processing: Concepts and techniques" by 
> Grey, Reuter, chapter 12.6.3, defines three levels:
> 
> 1-safe - what we call asynchronous
> 2-safe - commit is acknowledged after the slave acknowledges it, but if 
> the slave is down, fall back to asynchronous mode.
> 3-safe - commit is acknowledged only after slave acknowledges it. If it 
> is down, refuse to commit

Which again is a one-standby viewpoint on the problem. Wikipedia is
right that there is a problem when using just one server.

"3-safe" mode is not more safe than "2-safe" mode when you have 2
standbys.

If you want high availability you need N+1 redundancy. If you want a
standby server that is N=1. If you want a highly available standby
configuration then N+1 = 2.

Show me the textbook that describes what happens with 2 standbys. If one
exists, I'm certain it would agree with my analysis.

(I'll read and comment on your other points later today.)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Simon Riggs

Date:

24 September 2010, 08:00:26

On Fri, 2010-09-24 at 11:43 +0300, Heikki Linnakangas wrote:
> > To get zero data loss *and* continuous availability, you need two
> > standbys offering sync rep and reply-to-first behaviour.
> 
> Yes, that is a good point.
> 
> I'm starting to understand what your proposal was all about. It makes 
> sense when you think of a three node system configured for high 
> availability with zero data loss like that.
> 
> The use case of keeping hot standby servers up todate in a cluster
> where 
> read-only queries are distributed across all nodes seems equally 
> important though. What's the simplest method of configuration that 
> supports both use cases?

That is definitely the right question. (More later)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

24 September 2010, 08:05:33

Robert Haas <robertmhaas@gmail.com> writes:
> I think maybe you missed Tom's point, or else you just didn't respond
> to it.  If the master is wedged because it is waiting for a standby,
> then you cannot commit transactions on the master.  Therefore you
> cannot update the system catalog which you must update to unwedge it.
> Failing over in that situation is potentially a huge nuisance and
> extremely undesirable.

All Wrong.

You might remember that Simon's proposal begins with per-transaction
synchronous replication behavior?

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

24 September 2010, 08:12:43

On 24/09/10 13:57, Simon Riggs wrote:
> If you want high availability you need N+1 redundancy. If you want a
> standby server that is N=1. If you want a highly available standby
> configuration then N+1 = 2.

Yep. Synchronous replication with one standby gives you zero data loss. 
When you add a 2nd standby as you described, then you have a reasonable 
level of high availability as well, as you can continue processing 
transactions in the master even if one slave dies.

> Show me the textbook that describes what happens with 2 standbys. If one
> exists, I'm certain it would agree with my analysis.

I don't disagree with your analysis about multiple standbys and high 
availability. What I'm saying is that in a two standby situation, if 
you're willing to continue operation as usual in the master even if the 
standby is down, you're not doing synchronous replication. Extending 
that to a two standby situation, my claim is that if you're willing to 
continue operation as usual in the master when both standbys are down, 
you're not doing synchronous replication.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Simon Riggs

Date:

24 September 2010, 08:51:07

On Fri, 2010-09-24 at 14:12 +0300, Heikki Linnakangas wrote:
> What I'm saying is that in a two standby situation, if 
> you're willing to continue operation as usual in the master even if
> the standby is down, you're not doing synchronous replication.

Oracle and I disagree with you on that point, but I am more interested
in behaviour than semantics.

If you have two standbys and one is down, please explain how data loss
has occurred.

>  Extending that to a two standby situation, my claim is that if you're
> willing to continue operation as usual in the master when both
> standbys are down, you're not doing synchronous replication. 

Agreed. 

But you still need to decide how you will act. I choose pragmatism in
that case. 

Others have voiced that they would like the database to shutdown or have
all sessions hang. I personally doubt their employers would feel the
same way. Arguing technical correctness would seem unlikely to allow a
DBA to keep his job if they stood and watched the app become
unavailable.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Robert Haas

Date:

24 September 2010, 09:38:23

On Fri, Sep 24, 2010 at 6:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Earlier you argued that centralizing parameters would make this nice and
>> > simple. Now you're pointing out that we aren't centralizing this at all,
>> > and it won't be simple. We'll have to have a standby.conf set up that is
>> > customised in advance for each standby that might become a master. Plus
>> > we may even need multiple standby.confs in case that we have multiple
>> > nodes down. This is exactly what I was seeking to avoid and exactly what
>> > I meant when I asked for an analysis of the failure modes.
>>
>> If you're operating on the notion that no reconfiguration will be
>> necessary when nodes go down, then we have very different notions of
>> what is realistic.  I think that "copy the new standby.conf file in
>> place" is going to be the least of the fine admin's problems.
>
> Earlier you argued that setting parameters on each standby was difficult
> and we should centralize things on the master. Now you tell us that
> actually we do need lots of settings on each standby and that to think
> otherwise is not realistic. That's a contradiction.

You've repeatedly accused me and others of contradicting ourselves.  I
don't think that's helpful in advancing the debate, and I don't think
it's what I'm doing.

The point I'm trying to make is that when failover happens, lots of
reconfiguration is going to be needed.  There is just no getting
around that.  Let's ignore synchronous replication entirely for a
moment.  You're running 9.0 and you have 10 slaves.  The master dies.
You promote a slave.  Guess what?  You need to look at each slave you
didn't promote and adjust primary_conninfo.  You also need to check
whether the slave has received an xlog record with a higher LSN than
the one you promoted.  If it has, you need to take a new base backup.
Otherwise, you may have data corruption - very possibly silent data
corruption.

Do you dispute this?  If so, on which point?

The reason I think that we should centralize parameters on the master
is because they affect *the behavior of the master*.  Controlling
whether the master will wait for the slave on the slave strikes me
(and others) as spooky action at a distance.  Configuring whether the
master will retain WAL for a disconnected slave on the slave is
outright byzantine.  Of course, configuring these parameters on the
master means that when the master changes, you're going to need a
configuration (possibly the same, possibly different) for said
parameters on the new master.  But since you may be doing a lot of
other adjustment at that point anyway (e.g. new base backups, changes
in the set of synchronous slaves) that doesn't seem like a big deal.

> The chain of argument used to support this as being a sensible design choice is broken or contradictory in more than
one
> place. I think we should be looking for a design using the KISS principle, while retaining sensible tuning options.

The KISS principle is exactly what I am attempting to apply.
Configuring parameters that affect the master on some machine other
than the master isn't KISS, to me.  You may find that broken or
contradictory, but I disagree.  I am attempting to disagree
respectfully, but statements like the above make me feel like you're
flaming, and that's getting under my skin.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Configuring synchronous replication

From

Aidan Van Dyk

Date:

24 September 2010, 10:02:04

On Fri, Sep 24, 2010 at 7:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2010-09-24 at 14:12 +0300, Heikki Linnakangas wrote:
>> What I'm saying is that in a two standby situation, if
>> you're willing to continue operation as usual in the master even if
>> the standby is down, you're not doing synchronous replication.
>
> Oracle and I disagree with you on that point, but I am more interested
> in behaviour than semantics.

I *think* he meant s/two standby/two server/.  That's taken from the 2
references:  *the* master  *the* slave.

In that case, if the master is committing w/ no slave connected, it
*isn't* repliation, synchronous or not.  Usefull, likely, but
replication, not at that PIT.

> If you have two standbys and one is down, please explain how data loss
> has occurred.

Right, of course.  But thinking he meant 2 servers  (1 standby) not 3
servers (2 standby).

But even with only 2 server, if it's down and the master is up, there
isn't data loss.  There's *potential* for dataloss.

> But you still need to decide how you will act. I choose pragmatism in
> that case.
>
> Others have voiced that they would like the database to shutdown or have
> all sessions hang. I personally doubt their employers would feel the
> same way. Arguing technical correctness would seem unlikely to allow a
> DBA to keep his job if they stood and watched the app become
> unavailable.

Again, it all depends on the business.  Synchronous replication can
give you two things:
1) High Availability (Just answer my queries, dammit!)
2) High Durability (Don't give me an answer unless your damn well sure
it's the right one)
and its goal is to do that in the face of "catastrophic failure" (for
some level of catastrophic).

It's the trade of between:
1) The cost of delaying/refusing transactions being greater than the
potential cost of a lost transaction
2) The cost of lost transaction being greater than the cost of
delaying/refusing transactions

So there are people who want to use PostgreSQL in a situation where
they'ld much rather not "say" they have done something unless they are
sure it's safely written in 2 different systems, in 2 different
locations (and yes, the distance between those two locations will be a
trade off wrt performance, and the business will need to decide on
their risk levels).

I understand it's optimal, desireable, or even praactical for the vast
majority of cases.   I don't want it to be impossible, or, if it's
decide that it will be impossible, hopefully not just because you
decided nobody ever needs it, but that its not feasible because of
code/implimentation complexitites ;-)

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

24 September 2010, 10:52:22

On 24/09/10 14:47, Simon Riggs wrote:
> On Fri, 2010-09-24 at 14:12 +0300, Heikki Linnakangas wrote:
>> What I'm saying is that in a two standby situation, if
>> you're willing to continue operation as usual in the master even if
>> the standby is down, you're not doing synchronous replication.
>
> Oracle and I disagree with you on that point, but I am more interested
> in behaviour than semantics.
>
> If you have two standbys and one is down, please explain how data loss
> has occurred.

Sorry, that was a typo. As Aidan guessed, I meant "even in a two server 
situation", ie. one master and one slave.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Dimitri Fontaine

Date:

24 September 2010, 11:01:17

Hi,

Defending my ideas as not to be put in the bag you're wanting to put
away. We have more than 2 proposals lying around here. I'm one of the
guys with a proposal and no code, but still trying to be clear.

Robert Haas <robertmhaas@gmail.com> writes:
> The reason I think that we should centralize parameters on the master
> is because they affect *the behavior of the master*.  Controlling
> whether the master will wait for the slave on the slave strikes me
> (and others) as spooky action at a distance.

I hope it's clear that I didn't propose anything like this in the
related threads. What you setup on the slave is related only to what the
slave has to offer to the master. What happens on the master wrt with
waiting etc is setup on the master, and is controlled per-transaction.

As my ideas come in good parts from understanding Simon work and
proposal, my feeling is that stating them here will help the thread.

>  Configuring whether the
> master will retain WAL for a disconnected slave on the slave is
> outright byzantine.  

Again, I can't remember having proposed such a thing.

> Of course, configuring these parameters on the
> master means that when the master changes, you're going to need a
> configuration (possibly the same, possibly different) for said
> parameters on the new master.  But since you may be doing a lot of
> other adjustment at that point anyway (e.g. new base backups, changes
> in the set of synchronous slaves) that doesn't seem like a big deal.

Should we take some time and define the behaviors we expect in the
cluster, and the ones we want to provide in case of each error case we
can think about, we'd be able to define the set of parameters that we
need to operate the system.

Then, some of us are betting than it will be possible to accommodate
with either a unique central setup that you edit in only one place at
failover time, *or* that the best way to manage the setup is having it
distributed.

Granted, given how it currently works, it looks like you will have to
edit the primary_conninfo on a bunch of standbys at failover time, e.g.

I'd like that we now follow Josh Berkus (and some other) advice now, and
start a new thread to decide what we mean by synchronous replication,
what kind of normal behaviour we want and what responses to errors we
expect to be able to deal with in what (optional) ways.

Because the more we're staying on this thread, and the clearer it is
that there isn't two of us talking about the same synchronous
replication feature set.

Regards,
-- 
dim

Re: Configuring synchronous replication

From

Simon Riggs

Date:

24 September 2010, 11:15:05

On Fri, 2010-09-24 at 16:01 +0200, Dimitri Fontaine wrote:

> I'd like that we now follow Josh Berkus (and some other) advice now, and
> start a new thread to decide what we mean by synchronous replication,
> what kind of normal behaviour we want and what responses to errors we
> expect to be able to deal with in what (optional) ways.

What I intend to do from here is make a list of all desired use cases,
then ask for people to propose ways of configuring those. Hopefully we
don't need to discuss the meaning of the phrase "sync rep", we just need
to look at the use cases.

That way we will be able to directly compare the
flexibility/complexity/benefits of configuration between different
proposals.

I think this will allows us to rapidly converge on something useful.

If multiple solutions exist, we may then be able to decide/vote on a
prioritisation of use cases to help resolve any difficulty.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services

Re: Configuring synchronous replication

From

Heikki Linnakangas

Date:

24 September 2010, 11:16:13

On 24/09/10 17:13, Simon Riggs wrote:
> On Fri, 2010-09-24 at 16:01 +0200, Dimitri Fontaine wrote:
>
>> I'd like that we now follow Josh Berkus (and some other) advice now, and
>> start a new thread to decide what we mean by synchronous replication,
>> what kind of normal behaviour we want and what responses to errors we
>> expect to be able to deal with in what (optional) ways.
>
> What I intend to do from here is make a list of all desired use cases,
> then ask for people to propose ways of configuring those. Hopefully we
> don't need to discuss the meaning of the phrase "sync rep", we just need
> to look at the use cases.

Yes, that seems like a good way forward.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring synchronous replication

From

Robert Haas

Date:

24 September 2010, 12:30:04

On Fri, Sep 24, 2010 at 10:01 AM, Dimitri Fontaine
<dfontaine@hi-media.com> wrote:
>>  Configuring whether the
>> master will retain WAL for a disconnected slave on the slave is
>> outright byzantine.
>
> Again, I can't remember having proposed such a thing.

No one has, but I keep hearing we don't need the master to have a list
of standbys and a list of properties for each standby...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company