Re: Standalone synchronous master - Mailing list pgsql-hackers

From Alexander Björnhagen
Subject Re: Standalone synchronous master
Date
Msg-id CAO-C5==1qyrt3gD3+h1ehoo9UMx8ZTTMcdOyULAfAB3=WafGLQ@mail.gmail.com
Whole thread Raw
In response to Re: Standalone synchronous master  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Standalone synchronous master
List pgsql-hackers
Hmm,

I suppose this conversation would lend itself better to a whiteboard
or a maybe over a few beers instead of via e-mail  ...

>>>>> Basically I like this whole idea, but I'd like to know why do you think this functionality is required?

>>>> How should a synchronous master handle the situation where all
>>>> standbys have failed ?
>>>>
>>>> Well, I think this is one of those cases where you could argue either
>>>> way. Someone caring more about high availability of the system will
>>>> want to let the master continue and just raise an alert to the
>>>> operators. Someone looking for an absolute guarantee of data
>>>> replication will say otherwise.

>>>If you don't care about the absolute guarantee of data, why not just
>>>use async replication? It's still going to replicate the data over to
>>>the client as quickly as it can - which in the end is the same level
>>>of guarantee that you get with this switch set, isn't it?

>> This setup does still guarantee that if the master fails, then you can
>> still fail over to the standby without any possible data loss because
>> all data is synchronously replicated.

>Only if you didn't have a network hitch, or if your slave was down.

>Which basically means it doesn't *guarantee* it.

True. In my two-node system, I’m willing to take that risk when my
only standby has failed.

Most likely (compared to any other scenario), we can re-gain
redundancy before another failure occurs.

Say each one of your nodes can fail once a year. Most people have much
better track record than with their production machines/network/etc
but just as an example. Then on any given day there is a 0,27% chance
that at given node will fail (1/365*100=0,27), right ?

Then the probability of both failing on the same day is (0,27%)^2 =
0,000074 % or about 1 in 13500. And given that it would take only a
few hours tops to restore redundancy, it is even less of a chance than
that because you would not be exposed for the entire day.

So, to be a bit blunt about it and I hope I don’t come off a rude
here, this dual-failure or creeping-doom type scenario on a two-node
system is probably not relevant but more an academical question.

>> I want to replicate data with synchronous guarantee to a disaster site
>> *when possible*. If there is any chance that commits can be
>> replicated, then I’d like to wait for that.

>There's always a chance, it's just about how long you're willing to wait ;)

Yes, exactly. When I can estimate it I’m willing to wait.

>Another thought could be to have something like a "sync_wait_timeout",
>saying "i'm willing to wait <n> seconds for the syncrep to be caught
>up. If nobody is cauth up within that time,then I can back down to
>async mode/"standalone" mode". That way, data availaibility wouldn't
>be affected by short-time network glitches.

This was also mentioned in the previous thread I linked to,
“replication_timeout“ :

http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php

In a HA environment you have redundant networking and bonded
interfaces on each node. The only “glitch” would really be if a switch
failed over and that’s a pretty big “if” right there.

>> If however the disaster node/site/link just plain fails and
>> replication goes down for an *indefinite* amount of time, then I want
>> the primary node to continue operating, raise an alert and deal with
>> that. Rather than have the whole system grind to a halt just because a
>> standby node failed.

>If the standby node failed and can be determined to actually be failed
>(by say a cluster manager), you can always have your cluster software
>(or DBA, of course) turn it off by editing the config setting and
>reloading. Doing it that way you can actually *verify* that the site
>is gone for an indefinite amount of time.

The system might as well do this for me when the standby gets
disconnected instead of halting the master.

>> If we were just talking about network glitches then I would be fine
>> with the current behavior because I do not believe they are
>> long-lasting anyway and they are also *quantifiable* which is a huge
>> bonus.

>But the proposed switches doesn't actually make it possible to
>differentiate between these "non-long-lasting" issues and long-lasting
>ones, does it? We might want an interface that actually does...

“replication_timeout” where the primary disconnects the WAL sender
after a timeout together with “synchronous_standalone_master” which
tells the primary it can continue anyway when that happens allows
exactly that. This would then be first part towards that but I wanted
to start out small and I personally think it is sufficient to draw the
line at TCP disconnect of the standby.

>>>>> When is the replication mode switched from "standalone" to "sync"?
>>>>
>>>> Good question. Currently that happens when a standby server has
>>>> connected and also been deemed suitable for synchronous commit by the
>>>> master ( meaning that its name matches the config variable
>>>> synchronous_standby_names ). So in a setup with both synchronous and
>>>> asynchronous standbys, the master only considers the synchronous ones
>>>> when deciding on standalone mode. The asynchronous standbys are
>>>> “useless” to a synchronous master anyway.
>
>>>But wouldn't an async standby still be a lot better than no standby at
>>>all (standalone)?
>
>> As soon as the standby comes back online, I want to wait for it to sync.

>I guess I just find this very inconsistent. You're willing to wait,
>but only sometimes. You're not willing to wait when it goes down, but
>you are willing to wait when it comes back. I don't see why this
>should be different, and I don't see how you can reliably
>differentiate between these two.

When the wait is quantifiable, I want to wait (like a connected
standby that is in the process of catching up). When it is not (like
when the remote node disappeared and the master has no way of knowing
for how long), I do not want to wait.

In both cases I want to send off alerts, get people involved and fix
the problem causing this, it is not something that should happen
often.

>>>>> The former might block the transactions for a long time until the standby has caught up with the master even
thoughsynchronous_standalone_master is enabled and a user wants to avoid such a downtime. 
>>
>>>> If we a talking about a network “glitch”, than the standby would take
>>>> a few seconds/minutes to catch up (not hours!) which is acceptable if
>>>> you ask me.
>
>>>So it's not Ok to block the master when the standby goes away, but it
>>>is ok to block it when it comes back and catches up? The goes away
>>>might be the same amount of time - or even shorter, depending on
>>>exactly how the network works..
>>
>> To be honest I don’t have a very strong opinion here, we could go
>> either way, I just wanted to keep this patch as small as possible to
>> begin with. But again network glitches aren’t my primary concern in a
>> HA system because the amount of data that the standby lags behind is
>> possible to estimate and plan for.
>>
>> Typically switch convergence takes in the order of 15-30 seconds and I
>> can thus typically assume that the restarted standby can recover that
>> gap in less than a minute. So once upon a blue moon when something
>> like that happens, commits would take up to say 1 minute longer. No
>> big deal IMHO.

>What about the slave rebooting, for example? That'll usually be pretty
>quick too - so you'd be ok waiting for that. But your patch doesn't
>let you wait for that - it will switch to standalone mode right away?
>But if it takes 30 seconds to reboot, and then 30 seconds to catch up,
>you are *not* willing to wait for the first 30 seconds, but you 'are*
>willing fo wait for the second? Just seems strange to me, I guess...

That’s exactly right. While the standby is booting, the master has no
way of knowing what is going on with that standby so then I don’t want
to wait.

When the standby has managed to boot, connect and started to sync up
the data that it was lagging behind, then I do want to wait because I
know that it will not take too long before it has caught up.

>>>>> 1. While synchronous replication is running normally, replication
>>>>> connection is closed because of
>>>>>    network outage.
>>>>> 2. The master works standalone because of
>>>>> synchronous_standalone_master=on and some
>>>>>    new transactions are committed though their WAL records are not
>>>>> replicated to the standby.
>>>>> 3. The master crashes for some reasons, the clusterware detects it and
>>>>> triggers a failover.
>>>>> 4. The standby which doesn't have recent committed transactions
>>>>> becomes the master at a failover...

>>>>> Is this scenario acceptable?

>>>> So you have two separate failures in less time than an admin would
>>>> have time to react and manually bring up a new standby.
>
>>>Given that one is a network failure, and one is a node failure, I
>>>don't see that being strange at all. For example, a HA network
>>>environment might cause a short glitch when it's failing over to a
>>>redundant node - enough to bring down the replication connection and
>>>require it to reconnect (during which the master would be ahead of the
>>>slave).
>>>
>>>In fact, both might well be network failures - one just making the
>>>master completely inaccessble, and thus triggering the need for a
>>>failover.
>>
>> You still have two failures on a two-node system.

>Yes - but only one (or zero) of them is actually to any of the nodes :-)

It doesn’t matter from the viewpoint of our primary and standby
servers. If the link to the standby fails so that it is unreachable
from the master, then the master may consider that node as failed. It
does not matter that the component which failed was not part of that
physical machine, it still rendered it useless because it is no longer
reachable.

So in the previous example where one network link fails and then one
node fails, I see that as two separate failures. If it is possible to
take out both primary and standby servers with only one component
failing (shared network/power/etc), then the system is not designed
right because there is a single-point of failure and no software in
the world will ever save you from that.

That’s why I tried to limit ourselves to the simple use-case where
either the standby or the primary node fails. If both fail then all
bets are off, you’re going to have a very bad day at the office
anyway.

> If we are talking about a setup with only two nodes (which I am), then
> I think it’s fair to limit the discussion to one failure (whatever
> that might be! node,switch,disk,site,intra-site link, power, etc ...).
>
> And in that case, there are only really three likely scenarios :
> 1)      The master fails
> 2)      The standby fails
> 3)      Both fail (due to shared network gear, power, etc)
>
> Yes there might be a need to failover and Yes the standby could
> possibly have lagged behind the master but with my sync+standalone
> mode, you reduce the risk of that compared to just async mode.
>
> So decrease the risk of data loss (case 1), increase system
> availability/uptime (case 2).
>
> That is a actually a pretty good description of my goal here :)
>
> Cheers,
>
> /A


pgsql-hackers by date:

Previous
From: Guillaume Lelarge
Date:
Subject: Re: Standalone synchronous master
Next
From: Alexander Björnhagen
Date:
Subject: Re: Standalone synchronous master