On Sun, 2011-01-02 at 08:08 -0600, Kevin Grittner wrote:
> I think you're talking about different metrics, and you're both
> right. With two servers configured in sync rep your chance of having
> an available (running) server is 99.9992%. The chance that you know
> that you have one that is totally up to date, with no lost
> transactions is 99.9208%. The chance that you *actually* have
> up-to-date data would be higher, but you'd have no way to be sure.
> The 99.96% number is your certainty that you have a running server
> with up-to-date data if only one machine is sync rep.
>
> It's a matter of whether your shop needs five nines of availability
> or the highest probability of not losing data. You get to choose.
Thanks for those calculations.
Do you agree that requiring response from 2 sync standbys, or locking
up, gives us 94% server availability, but 99.9992% data durability? And
that adding additional async servers would not increase the server
availability of that cluster?
Now lets look at what happens when we first start a standby: we do the
base backup, configure the standby, it connects and then <wham> we
cannot process any new transactions until the standby has caught up,
which could well be hours on a big database. So if we don't have a
processing mode that allows work to continue, how will we ever enable
synchronous replication on a 24/7 database? How will we ever allow
standbys to catch up if they drop out for a while?
We should factor that into the availability calcs as well.
-- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services