Thread: Statistics about Streaming Replication deployments in production

Statistics about Streaming Replication deployments in production

From

Samba

Date:

28 July 2011, 11:03:52

Hi all,

We, at Avaya India, have been using postgres for a few years and are very happy with the stability and performance of the system. We would want to utilise the newly released streaming replication feature to build a master-(multiple)slave based geographically redundant setup . We ship to our customers a product that stores its transactional data in postgres, and the size of the data would be accumulating to some where around a couple of hundred gigabytes over a period of time. it will have heavy read load and average write load.

One concern that is being coined by the our management team is regarding the relative stability and 'industrial-strength' of streaming replication. Considering that this feature is just one year old, doubts are expressed about

data integrity -- cancelled long running transactions on Primary must not be applied on the standby
reliability -- what if the network link is broken or one of the pair got crashed when log-segments for a huge committed transaction are being sent from master top standby?
guaranteed recovery (on failover) -- at any moment, one should be able to turn the standby into active and start using it (there should not be a scenario where master crashed and the slave could not be turned active)

On account of these, we thought it would be reassuring to our management team if we can cite a few existing production deployments and their success stories.

I think one year is sufficient time for any product/feature to be thoroughly tested for all its strengths and weaknesses; so would it be too much to ask the vast postgres customer base about their experiences with streaming replication, the good, the bad; and perhaps the best and the ugly too? It would be great if customers can give their identity (employer info) but not necessary though.

Thanks and Regards,
Samba

Re: Statistics about Streaming Replication deployments in production

From

Tomas Vondra

Date:

30 July 2011, 21:58:58

Dne 28.7.2011 13:03, Samba napsal(a):
> One concern that is being coined by the our management team is regarding
> the relative stability and 'industrial-strength' of streaming
> replication. Considering that this feature is just one year old, doubts
> are expressed about
>
>   * data integrity -- cancelled long running transactions on Primary
>     must not be applied on the standby

I'm not quite sure what you mean by "apply on the standby." Queries that
run on primary and modify data (e.g. an INSERT) has to apply the changes
to the standby. That's how streaming application works - it maintains a
binary copy of the datafiles. If a query on primary modifies the
datafiles, the change has to be applied to the standby even if the query
is cancelled.

But those changes won't be visible because it was not commited (just as
you can't see the changes on the primary).

>   *  reliability -- what if the network link is broken or one of the
>     pair got crashed when log-segments for a huge committed transaction
>     are being sent from master top standby?

The standby can ask for the changes either the primary or check the WAL
archiving. So even if the network goes down, the standby can get the
data from the archive.

If you care about continuous backups and PITR, you should probably
enable WAL archiving anyway. See this:

http://www.postgresql.org/docs/9.0/static/continuous-archiving.html

>   *  guaranteed recovery (on failover) -- at any moment, one should be
>     able to turn the standby into active and start using it (there
>     should not be a scenario where master crashed and the slave could
>     not be turned active)

I'm not aware of any bug preventing a failover ...

> On account of these, we thought it would be reassuring to our management
> team if we can cite a few existing production deployments and their
> success stories.

I'd like to see that too, but I guess it's bit too early for that. Keep
in mind the SR is just one year old. That's not much, especially for
large projects - it takes time to develop the system, test it, prepare
the production environment etc.

> I think one year is sufficient time for any product/feature to be
> thoroughly tested for all its strengths and weaknesses; so would it be
> too much to ask the vast postgres customer base about their experiences
> with streaming replication, the good, the bad; and perhaps the best and
> the ugly too? It would be great if customers can give their identity
> (employer info) but not necessary though.

Well, yes. I believe the companies have been testing it, bugs were
reported to pgsql-bugs and fixed. That's how it works ;-)

Tomas

Re: Statistics about Streaming Replication deployments in production

From

Simon Riggs

Date:

31 July 2011, 17:15:48

On Thu, Jul 28, 2011 at 12:03 PM, Samba <saasira@gmail.com> wrote:

> I think one year is sufficient time for any product/feature to be thoroughly
> tested for all its strengths and weaknesses; so would it be too much to ask
> the vast postgres customer base about their experiences with streaming
> replication, the good, the bad; and perhaps the best and the ugly too? It
> would be great if customers can give their identity (employer info) but not
> necessary though.

Maybe its not clear in the documentation but the streaming replication
feature isn't just one year old.

The core parts of it are actually 7 years old, and they are definitely
battle tested. The slightly newer parts changed the transport logic to
stream rather than use file-by-file.

The features relevant here are Point in Time Recovery (8.0), Warm
Standby (8.2), pg_standby (8.3), Bgwriter during recovery (8.4)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services