Thread: Core team statement on replication in PostgreSQL

Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
The Postgres core team met at PGCon to discuss a few issues, the largest
of which is the need for simple, built-in replication for PostgreSQL.
Historically the project policy has been to avoid putting replication
into core PostgreSQL, so as to leave room for development of competing
solutions, recognizing that there is no "one size fits all" replication
solution.  However, it is becoming clear that this policy is hindering
acceptance of PostgreSQL to too great an extent, compared to the benefit
it offers to the add-on replication projects.  Users who might consider
PostgreSQL are choosing other database systems because our existing
replication options are too complex to install and use for simple cases.
In practice, simple asynchronous single-master-multiple-slave
replication covers a respectable fraction of use cases, so we have
concluded that we should allow such a feature to be included in the core
project.  We emphasize that this is not meant to prevent continued
development of add-on replication projects that cover more complex use
cases.

We believe that the most appropriate base technology for this is
probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
We hope that such a feature can be completed for 8.4.  Ideally this
would be coupled with the ability to execute read-only queries on the
slave servers, but we see technical difficulties that might prevent that
from being completed before 8.5 or even further out.  (The big problem
is that long-running slave-side queries might still need tuples that are
vacuumable on the master, and so replication of vacuuming actions would
cause the slave's queries to deliver wrong answers.)

Again, this will not replace Slony, pgPool, Continuent, Londiste, or
other systems for many users, as it will be not be highly scalable nor
support long-distance replication nor replicating less than an entire
installation.  But it is time to include a simple, reliable basic
replication feature in the core system.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The Postgres core team met at PGCon to discuss a few issues, the largest
>  of which is the need for simple, built-in replication for PostgreSQL.
>  Historically the project policy has been to avoid putting replication
>  into core PostgreSQL, so as to leave room for development of competing
>  solutions, recognizing that there is no "one size fits all" replication
>  solution.  However, it is becoming clear that this policy is hindering
>  acceptance of PostgreSQL to too great an extent, compared to the benefit
>  it offers to the add-on replication projects.  Users who might consider
>  PostgreSQL are choosing other database systems because our existing
>  replication options are too complex to install and use for simple cases.
>  In practice, simple asynchronous single-master-multiple-slave
>  replication covers a respectable fraction of use cases, so we have
>  concluded that we should allow such a feature to be included in the core
>  project.  We emphasize that this is not meant to prevent continued
>  development of add-on replication projects that cover more complex use
>  cases.
>
>  We believe that the most appropriate base technology for this is
>  probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
>  We hope that such a feature can be completed for 8.4.

+1

Although I would explain it more shortly - we do need a solution for
lossless failover servers and such solution needs to live in core backend.

>  Ideally this
>  would be coupled with the ability to execute read-only queries on the
>  slave servers, but we see technical difficulties that might prevent that
>  from being completed before 8.5 or even further out.  (The big problem
>  is that long-running slave-side queries might still need tuples that are
>  vacuumable on the master, and so replication of vacuuming actions would
>  cause the slave's queries to deliver wrong answers.)

Well, both Slony-I and upcoming Skytools 3 have the same problem when
cleaning events and have it solved simply by slaves reporting back their
lowest position on event stream.  I cannot see why it cannot be applied
in this case too.  So each slave just needs to report its own longest
open tx as "open" to master.  Yes, it bloats master but no way around it.

Only problem could be the plan to vacuum tuples updated in between long
running tx and the regular ones, but such behaviour can be just turned off.

We could also have a option of "inaccessible slave", for those who
fear bloat on master.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
> The Postgres core team met at PGCon to discuss a few issues, the
> largest of which is the need for simple, built-in replication for
> PostgreSQL.  Historically the project policy has been to avoid
> putting replication into core PostgreSQL, so as to leave room for
> development of competing solutions, recognizing that there is no
> "one size fits all" replication solution.  However, it is becoming
> clear that this policy is hindering acceptance of PostgreSQL to too
> great an extent, compared to the benefit it offers to the add-on
> replication projects.  Users who might consider PostgreSQL are
> choosing other database systems because our existing replication
> options are too complex to install and use for simple cases.  In
> practice, simple asynchronous single-master-multiple-slave
> replication covers a respectable fraction of use cases, so we have
> concluded that we should allow such a feature to be included in the
> core project.  We emphasize that this is not meant to prevent
> continued development of add-on replication projects that cover more
> complex use cases.
> 
> We believe that the most appropriate base technology for this is
> probably real-time WAL log shipping, as was demoed by NTT OSS at
> PGCon.  We hope that such a feature can be completed for 8.4.

> Ideally this would be coupled with the ability to execute read-only
> queries on the slave servers, but we see technical difficulties that
> might prevent that from being completed before 8.5 or even further
> out.  (The big problem is that long-running slave-side queries might
> still need tuples that are vacuumable on the master, and so
> replication of vacuuming actions would cause the slave's queries to
> deliver wrong answers.)

This part is a deal-killer.  It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.

IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/29/08, David Fetter <david@fetter.org> wrote:
> On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
>  > Ideally this would be coupled with the ability to execute read-only
>  > queries on the slave servers, but we see technical difficulties that
>  > might prevent that from being completed before 8.5 or even further
>  > out.  (The big problem is that long-running slave-side queries might
>  > still need tuples that are vacuumable on the master, and so
>  > replication of vacuuming actions would cause the slave's queries to
>  > deliver wrong answers.)
>
> This part is a deal-killer.  It's a giant up-hill slog to sell warm
>  standby to those in charge of making resources available because the
>  warm standby machine consumes SA time, bandwidth, power, rack space,
>  etc., but provides no tangible benefit, and this feature would have
>  exactly the same problem.
>
>  IMHO, without the ability to do read-only queries on slaves, it's not
>  worth doing this feature at all.

I would not be so harsh - I'd like to have the lossless standby even
without read-only slaves.

But Tom's mail gave me impression core wants to wait until we get "perfect"
read-only slave implementation so we wait with it until 8.6, which does
not seem sensible.  If we can do slightly inefficient (but simple)
implementation
right now, I see no reason to reject it, we can always improve it later.

Especially as it can be switchable.  And we could also have
transaction_timeout paramenter on slaves so the hit on master is limited.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:

On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
> On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:

> This part is a deal-killer.  It's a giant up-hill slog to sell warm
> standby to those in charge of making resources available because the
> warm standby machine consumes SA time, bandwidth, power, rack space,
> etc., but provides no tangible benefit, and this feature would have
> exactly the same problem.
> 
> IMHO, without the ability to do read-only queries on slaves, it's not
> worth doing this feature at all.

The only question I have is... what does this give us that PITR doesn't
give us?

Sincerely,

Joshua D. Drake




Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Marko,

> But Tom's mail gave me impression core wants to wait until we get "perfect"
> read-only slave implementation so we wait with it until 8.6, which does
> not seem sensible.  If we can do slightly inefficient (but simple)
> implementation
> right now, I see no reason to reject it, we can always improve it later.

That's incorrect.  We're looking for a workable solution.  If we could 
get one for 8.4, that would be brilliant but we think it's going to be 
harder than that.

Publishing the XIDs back to the master is one possibility.  We also 
looked at using "spillover segments" for vacuumed rows, but that seemed 
even less viable.

I'm also thinking, for *async replication*, that we could simply halt 
replication on the slave whenever a transaction passes minxid on the 
master.  However, the main focus will be on synchrounous hot standby.

--Josh



Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote:
> On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
> > This part is a deal-killer.  It's a giant up-hill slog to sell
> > warm standby to those in charge of making resources available
> > because the warm standby machine consumes SA time, bandwidth,
> > power, rack space, etc., but provides no tangible benefit, and
> > this feature would have exactly the same problem.
> > 
> > IMHO, without the ability to do read-only queries on slaves, it's
> > not worth doing this feature at all.
> 
> The only question I have is... what does this give us that PITR
> doesn't give us?

It looks like a wrapper for PITR to me, so the gain would be ease of
use.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
"Douglas McNaught"
Date:
On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

> The only question I have is... what does this give us that PITR doesn't
> give us?

I think the idea is that WAL records would be shipped (possibly via
socket) and applied as they're generated, rather than on a
file-by-file basis.  At least that's what "real-time" implies to me...

-Doug


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Josh Berkus wrote:
> Marko,
> 
> > But Tom's mail gave me impression core wants to wait until we get "perfect"
> > read-only slave implementation so we wait with it until 8.6, which does
> > not seem sensible.  If we can do slightly inefficient (but simple)
> > implementation
> > right now, I see no reason to reject it, we can always improve it later.
> 
> That's incorrect.  We're looking for a workable solution.  If we could 
> get one for 8.4, that would be brilliant but we think it's going to be 
> harder than that.
> 
> Publishing the XIDs back to the master is one possibility.  We also 
> looked at using "spillover segments" for vacuumed rows, but that seemed 
> even less viable.
> 
> I'm also thinking, for *async replication*, that we could simply halt 
> replication on the slave whenever a transaction passes minxid on the 
> master.  However, the main focus will be on synchrounous hot standby.

Another idea I discussed with Tom is having the slave _delay_ applying
WAL files until all slave snapshots are ready.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
"Dave Page"
Date:
On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:
> On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>
>> The only question I have is... what does this give us that PITR doesn't
>> give us?
>
> I think the idea is that WAL records would be shipped (possibly via
> socket) and applied as they're generated, rather than on a
> file-by-file basis.  At least that's what "real-time" implies to me...

Yes, we're talking real-time streaming (synchronous) log shipping.

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: Core team statement on replication in PostgreSQL

From
Rick Vernam
Date:
On Thursday 29 May 2008 09:54:03 am Marko Kreen wrote:
> On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The Postgres core team met at PGCon to discuss a few issues, the largest
> >  of which is the need for simple, built-in replication for PostgreSQL.
> >  Historically the project policy has been to avoid putting replication
> >  into core PostgreSQL, so as to leave room for development of competing
> >  solutions, recognizing that there is no "one size fits all" replication
> >  solution.  However, it is becoming clear that this policy is hindering
> >  acceptance of PostgreSQL to too great an extent, compared to the benefit
> >  it offers to the add-on replication projects.  Users who might consider
> >  PostgreSQL are choosing other database systems because our existing
> >  replication options are too complex to install and use for simple cases.
> >  In practice, simple asynchronous single-master-multiple-slave
> >  replication covers a respectable fraction of use cases, so we have
> >  concluded that we should allow such a feature to be included in the core
> >  project.  We emphasize that this is not meant to prevent continued
> >  development of add-on replication projects that cover more complex use
> >  cases.
> >
> >  We believe that the most appropriate base technology for this is
> >  probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
> >  We hope that such a feature can be completed for 8.4.
>
> +1
>
> Although I would explain it more shortly - we do need a solution for
> lossless failover servers and such solution needs to live in core backend.

+1 for lossless failover (ie, synchronous)


Re: Core team statement on replication in PostgreSQL

From
Aidan Van Dyk
Date:
* Josh Berkus <josh@agliodbs.com> [080529 11:52]:
> Marko,
> 
> >But Tom's mail gave me impression core wants to wait until we get "perfect"
> >read-only slave implementation so we wait with it until 8.6, which does
> >not seem sensible.  If we can do slightly inefficient (but simple)
> >implementation
> >right now, I see no reason to reject it, we can always improve it later.
> 
> That's incorrect.  We're looking for a workable solution.  If we could 
> get one for 8.4, that would be brilliant but we think it's going to be 
> harder than that.
> 
> Publishing the XIDs back to the master is one possibility.  We also 
> looked at using "spillover segments" for vacuumed rows, but that seemed 
> even less viable.
> 
> I'm also thinking, for *async replication*, that we could simply halt 
> replication on the slave whenever a transaction passes minxid on the 
> master.  However, the main focus will be on synchrounous hot standby.

Or, instead of statement timeout killing statements on the RO slave,
simply kill any "old" transactions on the RO slave.   "Old" in the sense
that the master's xmin has passed it.  And it's just an exersise in
controlling the age of xmin on the master, which could even be done
user-side.

Doesn't fit all, but no one size does...  It would work for where you're
hammering your slaves with a diverse set of high-velocity short queries
that you're trying to avoid on the master...

An option to "pause reply (making it async)"  or "abort transactions
(for sync)" might make it possible to easily run an async slave for slow
reporting queries, and a sync slave for short queries.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Joshua D. Drake wrote:
> 
> On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
>> On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
> 
>> This part is a deal-killer.  It's a giant up-hill slog to sell warm
>> standby to those in charge of making resources available because the
>> warm standby machine consumes SA time, bandwidth, power, rack space,
>> etc., but provides no tangible benefit, and this feature would have
>> exactly the same problem.
>>
>> IMHO, without the ability to do read-only queries on slaves, it's not
>> worth doing this feature at all.
> 
> The only question I have is... what does this give us that PITR doesn't
> give us?

Since people seem to be unclear on what we're proposing:

8.4 Synchronous Warm Standby: makes PostgreSQL more suitable for HA 
systems by eliminating failover data loss and cutting failover time.

8.5 (probably) Synchronous & Asynchronous Hot Standby: adds read-only 
queries on slaves to the above.

Again, if we can implement queries on slaves for 8.4, we're all for it.  However, after conversations in Core and with
Simonwe all think it's 
 
going to be too big a task to complete in 4-5 months.  We *don't* want 
to end up delaying 8.4 for 5 months because we're debugging hot standby.

--Josh



Re: Core team statement on replication in PostgreSQL

From
Brian Hurt
Date:
David Fetter wrote:<br /><blockquote cite="mid20080529152105.GO16218@fetter.org" type="cite"><pre wrap="">
This part is a deal-killer.  It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.

IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.
 </pre></blockquote><br /> I don't think I agree with this.  There are a large number of situations where it's positive
expectancyto do precisely this- it's not unlike buying a $1 lottery ticket with a 1 chance in 100 of winning $1000- the
vastmajority of the time (99 times out of 100), you're going to lose $1.  But when you win, you win big, and make up
forall the small losses you incurred getting there and then some.  Failover machines are like that- most of the time
they'renegative value, as you said- taking up SA time, bandwidth, power, rack space, money, etc.  But every once in a
(great)while, they save you.  If the cost of having the database down for hours or days (as you madly try to next-day
replacementhardware) isn't that great, then no, this isn't  worthwhile- but in cases where the database being down
chalksup the lost money quickly, this is easy to cost-justify.<br /><br /> Being able to do read-only queries makes
thisfeature more valuable in more situations, but I disagree that it's a deal-breaker.<br /><br /> Brian<br /><br /> 

Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote:
> Josh Berkus wrote:
> > Publishing the XIDs back to the master is one possibility.  We
> > also looked at using "spillover segments" for vacuumed rows, but
> > that seemed even less viable.
> > 
> > I'm also thinking, for *async replication*, that we could simply
> > halt replication on the slave whenever a transaction passes minxid
> > on the master.  However, the main focus will be on synchrounous
> > hot standby.
> 
> Another idea I discussed with Tom is having the slave _delay_
> applying WAL files until all slave snapshots are ready.

Either one of these would be great, but something that involves
machines that stay useless most of the time is just not going to work.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
David Fetter wrote:
> On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote:
> > Josh Berkus wrote:
> > > Publishing the XIDs back to the master is one possibility.  We
> > > also looked at using "spillover segments" for vacuumed rows, but
> > > that seemed even less viable.
> > > 
> > > I'm also thinking, for *async replication*, that we could simply
> > > halt replication on the slave whenever a transaction passes minxid
> > > on the master.  However, the main focus will be on synchrounous
> > > hot standby.
> > 
> > Another idea I discussed with Tom is having the slave _delay_
> > applying WAL files until all slave snapshots are ready.
> 
> Either one of these would be great, but something that involves
> machines that stay useless most of the time is just not going to work.

Right, the ultimate target is to have the slave be read-only, but we
need to get the streaming of WAL logs done first.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Aidan Van Dyk
Date:
* Dave Page <dpage@pgadmin.org> [080529 12:03]:
> On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:

> > I think the idea is that WAL records would be shipped (possibly via
> > socket) and applied as they're generated, rather than on a
> > file-by-file basis.  At least that's what "real-time" implies to me...
> 
> Yes, we're talking real-time streaming (synchronous) log shipping.

But synchronous streaming doesn't mean the WAL has to be *applied* on
the salve yet.  Just that it has to be "safely" on the slave (i.e on
disk, not just in kernel buffers).

The whole single-threaded WAL replay problem is going to rear it's ugly
head here too, and mean that a slave *won't* be able to keep up with a
busy master if it's actually trying to apply all the changes in
real-time.  Well, actually, if it's synchronous, it will keep up, but it
just means that now your master is IO capabilities is limited to the
speed of the slaves single-threaded WAL application.

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Bruce,

> Another idea I discussed with Tom is having the slave _delay_ applying
> WAL files until all slave snapshots are ready.
> 

Well, again, that only works for async mode.  I personally think that's 
the correct solution for async.  But for synch mode, I think we need to 
push the xids back to the master; generally if a user is running in 
synch mode they're concerned about failover time and zero data loss, so 
holding back the WAL files doesn't make sense.

Also, if you did delay applying WAL files on an async slave, you'd reach 
a point (perhaps after a 6-hour query) where it'd actually be cheaper to 
rebuild the slave than to apply the pent-up WAL files.

--Josh Berkus



Re: Core team statement on replication in PostgreSQL

From
Andrew Dunstan
Date:

Dave Page wrote:
> On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:
>   
>> On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>
>>     
>>> The only question I have is... what does this give us that PITR doesn't
>>> give us?
>>>       
>> I think the idea is that WAL records would be shipped (possibly via
>> socket) and applied as they're generated, rather than on a
>> file-by-file basis.  At least that's what "real-time" implies to me...
>>     
>
> Yes, we're talking real-time streaming (synchronous) log shipping.
>   

That's not what Tom's email said, AIUI. "Synchronous" replication surely 
means that the master and slave always have the same set of transactions 
applied. Streaming <> synchronous. But streaming log shipping will allow 
us to get get closer to synchronicity in some situations, i.e. the 
window for missing transactions will be much smaller.

Some of us were discussing this late on Friday night after PGcon. ISTM 
that we can have either 1) fairly hot failover slaves that are 
guaranteed to be almost up to date, or 2) slaves that can support 
read-only transactions but might get somewhat out of date if they run 
long transactions. The big problem is in having slaves which are both 
highly up to date and support arbitrary read-only transactions. Maybe in 
the first instance, at least, we need to make slaves choose which role 
they will play.

cheers

andrew


Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/29/08, Joshua D. Drake <jd@commandprompt.com> wrote:
>  On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
>  > On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
> > This part is a deal-killer.  It's a giant up-hill slog to sell warm
>  > standby to those in charge of making resources available because the
>  > warm standby machine consumes SA time, bandwidth, power, rack space,
>  > etc., but provides no tangible benefit, and this feature would have
>  > exactly the same problem.
>  >
>  > IMHO, without the ability to do read-only queries on slaves, it's not
>  > worth doing this feature at all.
>
> The only question I have is... what does this give us that PITR doesn't
>  give us?

Tom is talking about synchronous WAL replication.

So you can do lossless failover.  Currently there is no good
solution for this.

And it needs to live in core backend.  Yes, it could somehow be implemented
by filling backend with hooks,  but the question is how it will get synced
with changes in core backend after couple of releases?  The WAL writing
and txid/snapshot handling receive heavy changes on each release.

No external project that needs deep hooks has been able to keep pace with
core changes thus far.  Unless heavily commercially backed which means
not open-source.  Companies can tell the price they pay for such syncing..

Other solution would be indeed to have fixed hooks guaranteed to be stable
between releases.  (replica-hooks-discuss?)  But that would mean limiting
the changes we can do with WAL-writing/snapshot handling code and that
does not seem like attractive solution.

By having such replication code that tightly ties into core code
included in main Postgres source, we are still free to do any changes
we feel like and not be tied into external API promises.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Josh Berkus wrote:
> Bruce,
> 
> > Another idea I discussed with Tom is having the slave _delay_ applying
> > WAL files until all slave snapshots are ready.
> > 
> 
> Well, again, that only works for async mode.  I personally think that's 
> the correct solution for async.  But for synch mode, I think we need to 
> push the xids back to the master; generally if a user is running in 
> synch mode they're concerned about failover time and zero data loss, so 
> holding back the WAL files doesn't make sense.

You send the WAL to the slave, but the slave doesn't apply them right
away --- it isn't related to async.

> Also, if you did delay applying WAL files on an async slave, you'd reach 
> a point (perhaps after a 6-hour query) where it'd actually be cheaper to 
> rebuild the slave than to apply the pent-up WAL files.

True.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Steve Atkins
Date:
On May 29, 2008, at 9:12 AM, David Fetter wrote:

> On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote:
>> Josh Berkus wrote:
>>> Publishing the XIDs back to the master is one possibility.  We
>>> also looked at using "spillover segments" for vacuumed rows, but
>>> that seemed even less viable.
>>>
>>> I'm also thinking, for *async replication*, that we could simply
>>> halt replication on the slave whenever a transaction passes minxid
>>> on the master.  However, the main focus will be on synchrounous
>>> hot standby.
>>
>> Another idea I discussed with Tom is having the slave _delay_
>> applying WAL files until all slave snapshots are ready.
>
> Either one of these would be great, but something that involves
> machines that stay useless most of the time is just not going to work.

I have customers who are thinking about warm standby functionality, and
the only thing stopping them deploying it is complexity and maintenance,
not the cost of the HA hardware. If trivial-to-deploy replication that  
didn't
offer read-only access of the slaves were available today I'd bet that  
most
of them would be using it.

Read-only slaves would certainly be nice, but (for me) it's making it  
trivial to
deploy and maintain that's more interesting.

Cheers,  Steve



Re: Core team statement on replication in PostgreSQL

From
Mathias Brossard
Date:
Tom Lane wrote:
> In practice, simple asynchronous single-master-multiple-slave
> replication covers a respectable fraction of use cases, so we have
> concluded that we should allow such a feature to be included in the core
> project.  We emphasize that this is not meant to prevent continued
> development of add-on replication projects that cover more complex use
> cases.

IMHO, this will help PostgreSQL adoption, mindshare and even boost interest in
development for the other replication use cases.

> We believe that the most appropriate base technology for this is
> probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.

The slides are up at http://www.pgcon.org/2008/schedule/events/76.en.html
 From what I gather from those slides it seems to me that the NTT solution is
synchronous not asynchronous. In my opinion it's even better, but I do
understand that others might prefer asynchronous. I'm going to speculate, but I
would think it should be possible (without a substancial rewrite) to support
both modes (or even some intermediate modes, like DRBD on Linux).

> We hope that such a feature can be completed for 8.4.  Ideally this
> would be coupled with the ability to execute read-only queries on the
> slave servers, but we see technical difficulties that might prevent that
> from being completed before 8.5 or even further out.  (The big problem
> is that long-running slave-side queries might still need tuples that are
> vacuumable on the master, and so replication of vacuuming actions would
> cause the slave's queries to deliver wrong answers.)

 From the 8.4dev documentation, another problem for read-only slaves would be :
« Operations on hash indexes are not presently WAL-logged, so replay will not
update these indexes. The recommended workaround is to manually REINDEX  each
such index after completing a recovery operation. ».

Sincerely,
--
Mathias Brossard

Attachment

Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:

On Thu, 2008-05-29 at 09:10 -0700, Josh Berkus wrote:
> Joshua D. Drake wrote:
> > 
> > The only question I have is... what does this give us that PITR doesn't
> > give us?
> 
> Since people seem to be unclear on what we're proposing:
> 
> 8.4 Synchronous Warm Standby: makes PostgreSQL more suitable for HA 
> systems by eliminating failover data loss and cutting failover time.
> 

What does this give us that Solaris Cluster, RedHat Cluster, DRBD etc..
doesn't give us? I am not trying to be a poison pill, but I am just not
seeing the benefit over what solutions that already exist. I could
probably argue if I had more time, that this solution doesn't do
anything but make us look like we are half baked in implementation.

If the real goal is read-only slaves with synchronous capability, then
let's implement that. If we can't do that by 8.4 it gets pushed to 8.5.
We already have a dozen different utilities to give us what is being
currently proposed.

Sincerely,

Joshua D. Drake






Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/29/08, Aidan Van Dyk <aidan@highrise.ca> wrote:
> * Dave Page <dpage@pgadmin.org> [080529 12:03]:
> > On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:
> > > I think the idea is that WAL records would be shipped (possibly via
>  > > socket) and applied as they're generated, rather than on a
>  > > file-by-file basis.  At least that's what "real-time" implies to me...
>  >
>  > Yes, we're talking real-time streaming (synchronous) log shipping.
>
> But synchronous streaming doesn't mean the WAL has to be *applied* on
>  the salve yet.  Just that it has to be "safely" on the slave (i.e on
>  disk, not just in kernel buffers).
>
>  The whole single-threaded WAL replay problem is going to rear it's ugly
>  head here too, and mean that a slave *won't* be able to keep up with a
>  busy master if it's actually trying to apply all the changes in
>  real-time.  Well, actually, if it's synchronous, it will keep up, but it
>  just means that now your master is IO capabilities is limited to the
>  speed of the slaves single-threaded WAL application.

I don't think thats a problem.  If the user runs its server at the
limit of write-bandwidth, thats its problem.

IOW, with synchronous replication, we _want_ the server to lag behind
slaves.

About the single-threading problem - afaik, the replay is mostly I/O bound
so threading would not buy you much.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
Aidan Van Dyk
Date:
* Marko Kreen <markokr@gmail.com> [080529 12:27]:

> I don't think thats a problem.  If the user runs its server at the
> limit of write-bandwidth, thats its problem.
> 
> IOW, with synchronous replication, we _want_ the server to lag behind
> slaves.
> 
> About the single-threading problem - afaik, the replay is mostly I/O bound
> so threading would not buy you much.

Right - the problem is that the master has N>1 backends working away,
preloading the modified heap pages into shared buffers, where they are
modified w/ WAL.  This means the kernel/controller has man read-requests
in flight at a time as the modifies/writes chug along.  The slave has to
read/modify/write every buffer, one at a time, as WAL arrives, meaning
there is ever only 1 IO request in flight at a time.

So the server as a queue of many parallel reads going on, the slave has
a set of sequential random reads going on.

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Dave Page wrote:
>> Yes, we're talking real-time streaming (synchronous) log shipping.

> That's not what Tom's email said, AIUI.

Sorry, I was a bit sloppy about that.  If we go with a WAL-shipping
solution it would be pretty easy to support both synchronous and
asynchronous cases (synchronous == master doesn't report commit until
the WAL is down to disk on the slaves too).  There are different
use-cases for both so it'd make sense to do both.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Shane Ambler
Date:
David Fetter wrote:

> This part is a deal-killer.  It's a giant up-hill slog to sell warm
> standby to those in charge of making resources available because the
> warm standby machine consumes SA time, bandwidth, power, rack space,
> etc., but provides no tangible benefit, and this feature would have
> exactly the same problem.
> 
> IMHO, without the ability to do read-only queries on slaves, it's not
> worth doing this feature at all.

+1

I would think that a read-only WAL slave is more valuable than a 
real-time backup. (especially as the topic is about adding slaves not 
increasing the effectiveness of backups)


I also think that starting with a read-only WAL slave will ease the 
transition between delayed slave updating and real-time slave updating.


-- 

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote:
>> The only question I have is... what does this give us that PITR
>> doesn't give us?

> It looks like a wrapper for PITR to me, so the gain would be ease of
> use.

A couple of points about that:

* Yeah, ease of use is a huge concern here.  We're getting beat up
because people have to go find a separate package (and figure out
which one they want), install it, learn how to use it, etc.  It doesn't
help that the most mature package is Slony which is, um, not very
novice-friendly or low-admin-complexity.  I personally got religion
on this about two months ago when Red Hat switched their bugzilla
from Postgres to MySQL because the admins didn't want to deal with Slony
any more.  People want simple.

* The proposed approach is trying to get to "real" replication
incrementally.  Getting rid of the loss window involved in file-by-file
log shipping is step one, and I suspect that step two is going to be
fixing performance issues in WAL replay to ensure that slaves can keep
up.  After that we'd start thinking about how to let slaves run
read-only queries.  But even without read-only queries, this will be
a useful improvement for HA/backup scenarios.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Greg Smith
Date:
On Thu, 29 May 2008, David Fetter wrote:

> It's a giant up-hill slog to sell warm standby to those in charge of 
> making resources available because the warm standby machine consumes SA 
> time, bandwidth, power, rack space, etc., but provides no tangible 
> benefit, and this feature would have exactly the same problem.

This is an interesting commentary on the priorities of the customers 
you're selling to, but I don't think you can extrapolate from that too 
much.  The deployments I normally deal with won't run a system unless 
there's a failover backup available, period, and the fact that such a 
feature is not integrated into the core yet is a major problem for them. 
Read-only slaves is a very nice to have, but by no means a prerequisite 
before core replication will be useful to some people.  Hardware/machine 
resources are only worth a tiny fraction of what the data is in some 
environments, and in some of those downtime is really, really expensive.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Core team statement on replication in PostgreSQL

From
Jeff Davis
Date:
On Thu, 2008-05-29 at 09:18 -0700, Josh Berkus wrote:
> Bruce,
> 
> > Another idea I discussed with Tom is having the slave _delay_ applying
> > WAL files until all slave snapshots are ready.
> > 
> 
> Well, again, that only works for async mode.

It depends on what we mean by synchronous. Do we mean "the WAL record
has made it to the disk on the slave system," or "the WAL record has
been applied on the slave system"?

With this type of replication there will always be a difference for some
small window, but most people would expect that window to be very small
for synchronous replication.

Regards,Jeff Davis



Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Josh,

> What does this give us that Solaris Cluster, RedHat Cluster, DRBD etc..
> doesn't give us? 

Actually, these solutions all have some serious drawbacks, not the least 
of which is difficult administration (I speak from bitter personal 
experience).  Also, most of them require installation at the filesystem 
level, something which often isn't available in a hosted environment.

--Josh Berkus


Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Thu, May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote:
>
> Being able to do read-only queries makes this feature more valuable in more 
> situations, but I disagree that it's a deal-breaker.

Your managers are apparently more enlightened than some. ;-)

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Thu, May 29, 2008 at 07:20:37PM +0300, Marko Kreen wrote:
> 
> So you can do lossless failover.  Currently there is no good
> solution for this.

Indeed.  Getting lossless failover would be excellent.

I understand David's worry (having had those arguments more times than
I care to admit), but if people don't want to spend the money on the
extra machine that can't be queried, they can use another solution for
the time being.

The big missing piece is lossless failover.  People are currently
doing it with DRBD, various clustering things, &c., and those are
complicated to set up and maintain.  (As I've told more than one
person looking at it, there is a risk that you'll actually make your
installation complicated enough that you'll make it _less_ reliable.
I have some bitter personal experiences with this effect, and I know
some others on this list do as well.)

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 02:13:26PM -0400, Andrew Sullivan wrote:
> On Thu, May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote:
> > Being able to do read-only queries makes this feature more
> > valuable in more situations, but I disagree that it's a
> > deal-breaker.
> 
> Your managers are apparently more enlightened than some. ;-)

Than most managers, at least in my experience, and since this feature
is (IMHO rightly) based around broad adoption, it's a good thing to
bring up.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
Teodor Sigaev
Date:
> in this case too.  So each slave just needs to report its own longest
> open tx as "open" to master.  Yes, it bloats master but no way around it.

Slaves should not report it every time or every transaction. Vacuum on master 
will ask them before doing a real work.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Thu, May 29, 2008 at 12:19 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> That's not what Tom's email said, AIUI. "Synchronous" replication surely
> means that the master and slave always have the same set of transactions
> applied. Streaming <> synchronous. But streaming log shipping will allow us
> to get get closer to synchronicity in some situations, i.e. the window for
> missing transactions will be much smaller.
>
> Some of us were discussing this late on Friday night after PGcon. ISTM that
> we can have either 1) fairly hot failover slaves that are guaranteed to be
> almost up to date, or 2) slaves that can support read-only transactions but
> might get somewhat out of date if they run long transactions. The big
> problem is in having slaves which are both highly up to date and support
> arbitrary read-only transactions. Maybe in the first instance, at least, we
> need to make slaves choose which role they will play.

I personally would be thrilled to have slaves be query-able in any
fashion, even if 'wrong' under certain circumstances.  Any
asynchronous solution by definition gives the wrong answer on the
slave.  Read only slave is the #1 most anticipated feature in the
circles I run with.  It would literally transform how the database
world thinks about postgres overnight.  This, coupled with easier
standby setup (a pg_archive to mirror pg_restore) would be most
welcome!

merlin


Re: Core team statement on replication in PostgreSQL

From
Robert Hodges
Date:
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi everyone, <br /><br /> First of all, I’m
absolutelydelighted that the PG community is thinking seriously about replication.  <br /><br /> Second, having a
solid,easy-to-use database availability solution that works more or less out of the box would be an enormous benefit to
customers. Availability is the single biggest problem for customers in my experience and as other people have commented
thealternatives are not nice.  It’s an excellent idea to build off an existing feature—PITR is already pretty useful
andthe proposed features are solid next steps.  The fact that it does not solve all problems is not a drawback but
meansit’s likely to get done in a reasonable timeframe. <br /><br /> Third, you can’t stop with just this feature.
 (Thisis the BUT part of the post.)  The use cases <b>not</b> covered by this feature area actually pretty large.  Here
area few that concern me: <br /><br /> 1.) Partial replication. <br /> 2.) WAN replication. <br /> 3.) Bi-directional
replication. (Yes, this is evil but there are problems where it is indispensable.) <br /> 4.) Upgrade support.  Aside
fromdatabase upgrade (how would this ever really work between versions?), it would not support zero-downtime app
upgrades,which depend on bi-directional replication tricks. <br /> 5.) Heterogeneous replication. <br /> 6.) Finally,
performancescaling using scale-out over large numbers of replicas.  I think it’s possible to get tunnel vision on
this—it’snot a big requirement in the PG community because people don’t use PG in the first place when they want to do
this. They use MySQL, which has very good replication for performance scaling, though it’s rather weak for
availability. <br /><br /> As a consequence, I don’t see how you can get around doing some sort of row-based
replicationlike all the other databases.  Now that people are starting to get religion on this issue I would strongly
advocatea parallel effort to put in a change-set extraction API that would allow construction of comprehensive
master/slavereplication.  (Another approach would be to make it possible for third party apps to read the logs and
regenerateSQL.) There are existing models for how to do change set extraction; we have done it several times at my
companyalready.  There are also research projects like GORDA that have looked fairly comprehensively at this problem.
  <br/><br /> My company would be quite happy to participate in or even sponsor such an API.  Between the proposed
WAL-basedapproach and change-set-based replication it’s not hard to see PG becoming the open source database of choice
fora very large number of users.  <br /><br /> Cheers, Robert<br /><br /> On 5/29/08 6:37 PM, "Tom Lane"
<tgl@sss.pgh.pa.us>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span
style="font-size:12.0px">DavidFetter <david@fetter.org> writes:<br /> > On Thu, May 29, 2008 at 08:46:22AM
-0700,Joshua D. Drake wrote:<br /> >> The only question I have is... what does this give us that PITR<br />
>>doesn't give us?<br /><br /> > It looks like a wrapper for PITR to me, so the gain would be ease of<br />
>use.<br /><br /> A couple of points about that:<br /><br /> * Yeah, ease of use is a huge concern here.  We're
gettingbeat up<br /> because people have to go find a separate package (and figure out<br /> which one they want),
installit, learn how to use it, etc.  It doesn't<br /> help that the most mature package is Slony which is, um, not
very<br/> novice-friendly or low-admin-complexity.  I personally got religion<br /> on this about two months ago when
RedHat switched their bugzilla<br /> from Postgres to MySQL because the admins didn't want to deal with Slony<br /> any
more. People want simple.<br /><br /> * The proposed approach is trying to get to "real" replication<br />
incrementally. Getting rid of the loss window involved in file-by-file<br /> log shipping is step one, and I suspect
thatstep two is going to be<br /> fixing performance issues in WAL replay to ensure that slaves can keep<br /> up.
 Afterthat we'd start thinking about how to let slaves run<br /> read-only queries.  But even without read-only
queries,this will be<br /> a useful improvement for HA/backup scenarios.<br /><br />                         regards,
tomlane<br /><br /> --<br /> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)<br /> To make changes
toyour subscription:<br /><a
href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br
/></span></font></blockquote><fontface="Verdana, Helvetica, Arial"><span style="font-size:12.0px"><br /><br /> -- <br
/>Robert Hodges, CTO, Continuent, Inc.<br /> Email:  robert.hodges@continuent.com<br /> Mobile:  +1-510-501-3728
 Skype: hodgesrm<br /></span></font> 

Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Thu, May 29, 2008 at 3:05 PM, Robert Hodges
<robert.hodges@continuent.com> wrote:
> Third, you can't stop with just this feature.  (This is the BUT part of the
> post.)  The use cases not covered by this feature area actually pretty
> large.  Here are a few that concern me:
>
> 1.) Partial replication.
> 2.) WAN replication.
> 3.) Bi-directional replication.  (Yes, this is evil but there are problems
> where it is indispensable.)
> 4.) Upgrade support.  Aside from database upgrade (how would this ever
> really work between versions?), it would not support zero-downtime app
> upgrades, which depend on bi-directional replication tricks.
> 5.) Heterogeneous replication.
> 6.) Finally, performance scaling using scale-out over large numbers of
> replicas.  I think it's possible to get tunnel vision on this—it's not a big
> requirement in the PG community because people don't use PG in the first
> place when they want to do this.  They use MySQL, which has very good
> replication for performance scaling, though it's rather weak for
> availability.

These type of things are what Slony is for.  Slony is trigger based.
This makes it more complex than log shipping style replication, but
provides lots of functionality.

wal shipping based replication is maybe the fastest possible
solution...you are already paying the overhead so it comes virtually
for free from the point of view of the master.

mysql replication is imo nearly worthless from backup standpoint.

merlin


Re: Core team statement on replication in PostgreSQL

From
Brian Hurt
Date:
Andrew Sullivan wrote:<br /><blockquote cite="mid20080529181326.GY40070@commandprompt.com" type="cite"><pre wrap="">On
Thu,May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote: </pre><blockquote type="cite"><pre wrap="">Being able to do
read-onlyqueries makes this feature more valuable in more 
 
situations, but I disagree that it's a deal-breaker.   </pre></blockquote><pre wrap="">
Your managers are apparently more enlightened than some. ;-)

A
 </pre></blockquote> No doubt.  But defining the minimum acceptable feature set by the demands of the dumbest manager
isa no-win proposition. <br /><br /> Brian<br /><br /> 

Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/29/08, Teodor Sigaev <teodor@sigaev.ru> wrote:
> > in this case too.  So each slave just needs to report its own longest
> > open tx as "open" to master.  Yes, it bloats master but no way around it.
>
>  Slaves should not report it every time or every transaction. Vacuum on
> master will ask them before doing a real work.

That is bit too slow if we want to do some page-level cleanup with
regular transactions / bgwriter.  But once or twice per minute update
should work fine.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Thu, May 29, 2008 at 12:05:18PM -0700, Robert Hodges wrote:
> people are starting to get religion on this issue I would strongly
> advocate a parallel effort to put in a change-set extraction API
> that would allow construction of comprehensive master/slave
> replication.  

You know, I gave a talk in Ottawa just last week about how the last
effort to develop a comprehensive API for replication failed.  I had
some ideas about why, the main one of which is something like this:
"Big features with a roadmap have not historically worked, so unless
we're willing to change the way we work, we won't get that."

I don't think an API is what's needed.  It's clear proposals for
particlar features that can be delivered in small pieces.  That's what
the current proposal offers.  I think any kind of row-based approach
such as what you're proposing would need that kind of proposal too.

That isn't to say that I think an API is impossible or undesirable.
It is to say that the last few times we tried, it went nowhere; and
that I don't think the circumstances have changed.

A
-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  * The proposed approach is trying to get to "real" replication
>  incrementally.  Getting rid of the loss window involved in file-by-file
>  log shipping is step one, and I suspect that step two is going to be
>  fixing performance issues in WAL replay to ensure that slaves can keep
>  up.  After that we'd start thinking about how to let slaves run
>  read-only queries.  But even without read-only queries, this will be
>  a useful improvement for HA/backup scenarios.

I agree with this plan, but I think this extends also for read-only
queries - we don't need to have the perfect, no-overhead solution
as the first step, instead lets have simple and working solution
with some overhead, then improve that one.

And for the first-step solution, I think letting VACUUM keep tuples
around based on slave queries is preferable to letting slaves lag.
This is useful to more situations.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
David Fetter wrote:
> Either one of these would be great, but something that involves
> machines that stay useless most of the time is just not going to work.

Lots of people do use warm standby already anyway, just not based on 
mechanisms built into PostgreSQL.  So defining away this need is completely 
unrealistic based on my experience.  Even if there were a read-only slave, 
lots of applications couldn't make use of it.

Anyway, a common approach to making better use of the hardware is to put some 
other service on the otherwise-standby machine, which in turn uses your 
master database server machine as its failover target.  Unless you run *only* 
a database, there would usually be some candidate that you could set up that 
way.

Another common approach is to realize that for some the costs of a downtime 
risk are higher than buying some extra hardware.

I think the consensus in the core team was that having synchronous log 
shipping in 8.4 would already be a worthwhile feature by itself.


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
Jeff Davis wrote:
> It depends on what we mean by synchronous. Do we mean "the WAL record
> has made it to the disk on the slave system," or "the WAL record has
> been applied on the slave system"?

DRBD, which is a common warm standby solution for PostgreSQL at the moment, 
provides various levels of synchronicity.  I imagine we could also define, as 
need arises, various levels, some of which may be the ones you listed.


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
Merlin Moncure wrote:
> Read only slave is the #1 most anticipated feature in the
> circles I run with.

Do these circles not know about slony and londiste?


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
Joshua D. Drake wrote:
> What does this give us that Solaris Cluster, RedHat Cluster, DRBD etc..
> doesn't give us?

I personally think that DRBD is a fine solution.  But it only runs on Linux.  
And Solaris Cluster isn't the same as DRBD.


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Andrew Sullivan <ajs@commandprompt.com> writes:
> On Thu, May 29, 2008 at 12:05:18PM -0700, Robert Hodges wrote:
>> people are starting to get religion on this issue I would strongly
>> advocate a parallel effort to put in a change-set extraction API
>> that would allow construction of comprehensive master/slave
>> replication.  

> You know, I gave a talk in Ottawa just last week about how the last
> effort to develop a comprehensive API for replication failed.

Indeed, core's change of heart on this issue was largely driven by
Andrew's talk and subsequent discussion.  We had more or less been
waiting for the various external replication projects to tell us
what they wanted in this line, and it was only the realization that
no such thing was likely to happen that forced us to think seriously
about what could be done within the core project.

As I said originally, we have no expectation that the proposed features
will displace the existing replication projects for "high end"
replication problems ... and I'd characterize all of Robert's concerns
as "high end" problems.  We are happy to let those be solved outside
the core project.

About the only thing that would make me want to consider row-based
replication in core would be if we determine that read-only slave
queries are impractical atop a WAL-log-shipping implementation.
Which could happen; in fact I think that's the main risk of the
proposed development plan.  But I also think that the near-term
steps of the plan are worth doing anyway, for various other reasons,
and so we won't be out too much effort if the plan fails.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
Andrew Sullivan wrote:
> The big missing piece is lossless failover.  People are currently
> doing it with DRBD, various clustering things, &c., and those are
> complicated to set up and maintain.

Well, we'll see at the end of this (we hope) how a setup procedure of DRBD vs. 
PG warm standby works from a standing start.  I wouldn't be overly 
optimistic. ;-)


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
Mathias Brossard wrote:
>  From what I gather from those slides it seems to me that the NTT solution
> is synchronous not asynchronous. In my opinion it's even better, but I do
> understand that others might prefer asynchronous. I'm going to speculate,
> but I would think it should be possible (without a substancial rewrite) to
> support both modes (or even some intermediate modes, like DRBD on Linux).

Well, we already have asynchronous.  The whole point is adding synchronous.


Re: Core team statement on replication in PostgreSQL

From
Peter Eisentraut
Date:
Tom Lane wrote:
> We believe that the most appropriate base technology for this is
> probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.

Now how do we get our hands on their code?


Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Thu, May 29, 2008 at 3:59 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> Merlin Moncure wrote:
>> Read only slave is the #1 most anticipated feature in the
>> circles I run with.
>
> Do these circles not know about slony and londiste?

Sure.

For various reasons mentioned elsewhere on this thread, a PITR based
replication solution would have a lot of appeal, assuming it could be
made to work.  The main thing is that it is completely non-invasive to
the master database and all the work is handled on the standby.  It
also operates on a trivial mechanism to transfer the files.

merlin


Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Robert,

> 1.) Partial replication.
> 2.) WAN replication.
> 3.) Bi-directional replication.  (Yes, this is evil but there are 
> problems where it is indispensable.)
> 4.) Upgrade support.  Aside from database upgrade (how would this ever 
> really work between versions?), it would not support zero-downtime app 
> upgrades, which depend on bi-directional replication tricks.
> 5.) Heterogeneous replication.
> 6.) Finally, performance scaling using scale-out over large numbers of 
> replicas.  I think it’s possible to get tunnel vision on this—it’s not a 
> big requirement in the PG community because people don’t use PG in the 
> first place when they want to do this.  They use MySQL, which has very 
> good replication for performance scaling, though it’s rather weak for 
> availability.  

Let's not try to boil the ocean, hey?
From my perspective, the above use cases are what complex tools like 
Slony, Bucardo, Skytools, Continuent, pgCluster, pgPool2, etc., etc. are 
for.  Now, if you're saying that you want to develop row-based 
replication so that Continuent will work better, I'm all for it; but 
saying that we *shouldn't* implement the current spec which satisfies 
large numbers of users because it doesn't support *all* users is a 
recipe for self-defeat.  We can't satisfy all users with one 
implementation, and we shouldn't try.

I think, for that matter, that work on the common replication hooks 
supporting the external replication packages should continue.  We need 
these for precisely the reasons you state.  But ... single-master, 
single-slave, synch or asynch, whole-installation local network 
replication is a case which covers a *lot* of users' needs ... I'd argue 
the numerical majority.

--Josh



Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 09:54:03PM +0200, Peter Eisentraut wrote:
> David Fetter wrote:
> > Either one of these would be great, but something that involves
> > machines that stay useless most of the time is just not going to
> > work.
> 
> Lots of people do use warm standby already anyway, just not based on
> mechanisms built into PostgreSQL.  So defining away this need is
> completely unrealistic based on my experience.  Even if there were a
> read-only slave, lots of applications couldn't make use of it.
> 
> Anyway, a common approach to making better use of the hardware is to
> put some other service on the otherwise-standby machine, which in
> turn uses your master database server machine as its failover
> target.  Unless you run *only* a database, there would usually be
> some candidate that you could set up that way.
> 
> Another common approach is to realize that for some the costs of a
> downtime risk are higher than buying some extra hardware.

> I think the consensus in the core team was that having synchronous
> log shipping in 8.4 would already be a worthwhile feature by itself.

If that was in fact the consensus of the core team, and what I've been
seeing from several core members in this thread makes that idea
unclear, it's out of step with the stated goal of the feature.  Having
some kind of half-way, doesn't-actually-quite-work-out-of-the-box
"replication" will make things worse and not better.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Thu, May 29, 2008 at 09:54:03PM +0200, Peter Eisentraut wrote:
>> I think the consensus in the core team was that having synchronous
>> log shipping in 8.4 would already be a worthwhile feature by itself.

> If that was in fact the consensus of the core team, and what I've been
> seeing from several core members in this thread makes that idea
> unclear, it's out of step with the stated goal of the feature.  Having
> some kind of half-way, doesn't-actually-quite-work-out-of-the-box
> "replication" will make things worse and not better.

What is your justification for denigrating this plan with that?
Or are you merely complaining because we know we won't be all the
way there in 8.4?
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 04:44:19PM -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > On Thu, May 29, 2008 at 09:54:03PM +0200, Peter Eisentraut wrote:
> >> I think the consensus in the core team was that having
> >> synchronous log shipping in 8.4 would already be a worthwhile
> >> feature by itself.

I'm sorry if I was unclear on this.  I don't disagree that it would be
a worthy feature.  I do disagree with the idea, given this goal, that
we can expect to ship 8.4 without a built-in replication and have that
be good enough by itself.

> > If that was in fact the consensus of the core team, and what I've
> > been seeing from several core members in this thread makes that
> > idea unclear, it's out of step with the stated goal of the
> > feature.  Having some kind of half-way,
> > doesn't-actually-quite-work-out-of-the-box "replication" will make
> > things worse and not better.
> 
> What is your justification for denigrating this plan with that?  Or
> are you merely complaining because we know we won't be all the way
> there in 8.4?

Again, just my humble opinion, but given the stated goal, which I
agree with, I'd say it's worth holding up 8.4 until some kind of
out-of-the-box replication advances that goal, where Yet Another
Toolkit Suitable For People Who Are Already Database Kernel Hackers To
Build Replication Systems On Top Of does not.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
David Fetter wrote:
> > What is your justification for denigrating this plan with that?  Or
> > are you merely complaining because we know we won't be all the way
> > there in 8.4?
> 
> Again, just my humble opinion, but given the stated goal, which I
> agree with, I'd say it's worth holding up 8.4 until some kind of
> out-of-the-box replication advances that goal, where Yet Another
> Toolkit Suitable For People Who Are Already Database Kernel Hackers To
> Build Replication Systems On Top Of does not.

I am confused why we would hold 8.4 when no one said we should hold 8.3
because we didn't have built-in replication.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
David,

>> I think the consensus in the core team was that having synchronous
>> log shipping in 8.4 would already be a worthwhile feature by itself.
> 
> If that was in fact the consensus of the core team, 

It is.

> and what I've been
> seeing from several core members in this thread makes that idea
> unclear, it's out of step with the stated goal of the feature.  Having
> some kind of half-way, doesn't-actually-quite-work-out-of-the-box
> "replication" will make things worse and not better.

So, you've got a better implementation up your sleeve?

I really don't get where you're coming from on this.  Frankly, your 
e-mails seem gauged to be merely disruptive without any intent of 
constructive input.  If you're opposed to working on replication in the 
core, then just say so.  If you think that there's an easier way to 
develop M-S replication in the core than using WAL, then please present 
your solution.

--Josh



Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 04:54:04PM -0400, Bruce Momjian wrote:
> David Fetter wrote:
> > > What is your justification for denigrating this plan with that?
> > > Or are you merely complaining because we know we won't be all
> > > the way there in 8.4?
> > 
> > Again, just my humble opinion, but given the stated goal, which I
> > agree with, I'd say it's worth holding up 8.4 until some kind of
> > out-of-the-box replication advances that goal, where Yet Another
> > Toolkit Suitable For People Who Are Already Database Kernel
> > Hackers To Build Replication Systems On Top Of does not.
> 
> I am confused why we would hold 8.4 when no one said we should hold
> 8.3 because we didn't have built-in replication.

If people on core had come to the idea that we needed to build in
replication *before* 8.3 came out, they certainly didn't announce it.

Now is a great time to mention this because it gives everybody time to:

1.  Come to a consensus on what the out-of-the-box replication should
be, and 

2.  Build, test and debug whatever the consensus out-of-the-box
replication turns out to be.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> David Fetter wrote:
>> Again, just my humble opinion, but given the stated goal, which I
>> agree with, I'd say it's worth holding up 8.4 until some kind of
>> out-of-the-box replication advances that goal, where Yet Another
>> Toolkit Suitable For People Who Are Already Database Kernel Hackers To
>> Build Replication Systems On Top Of does not.

> I am confused why we would hold 8.4 when no one said we should hold 8.3
> because we didn't have built-in replication.

Possibly I am misinterpreting what David said, but I think his opinion
is that synchronous log shipping in itself is useless, and that we
should hold 8.4 until we have that *and* slave-side query execution.
Suffice it to say that I disagree with the first position, and as for
the second, I will never again be willing to hold a release for one
single feature, no matter what it is.  We've learned the folly of
managing the project that way several times over.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
David Fetter
Date:
On Thu, May 29, 2008 at 01:55:42PM -0700, Josh Berkus wrote:
> David,
>
>>> I think the consensus in the core team was that having synchronous
>>> log shipping in 8.4 would already be a worthwhile feature by itself.
>>
>> If that was in fact the consensus of the core team, 
>
> It is.
>
>> and what I've been seeing from several core members in this thread
>> makes that idea unclear, it's out of step with the stated goal of
>> the feature.  Having some kind of half-way,
>> doesn't-actually-quite-work-out-of-the-box "replication" will make
>> things worse and not better.
>
> So, you've got a better implementation up your sleeve?

Nope.

> I really don't get where you're coming from on this.  Frankly, your
> e-mails seem gauged to be merely disruptive without any intent of
> constructive input.

I'm sorry they've come across that way.  That was not my intent.

> If you're opposed to working on replication in the core, then just
> say so.  If you think that there's an easier way to develop M-S
> replication in the core than using WAL, then please present your
> solution.

I think having master-slave replication in the core using WAL is a
*great* thing to do, doable, a good path to go on, etc., and I think
it's worth holding up 8.4 until we have at least one actual
out-of-the-box version of same.

People have hinted that we might be able to get both a synchronous one
and an asynchronous one based on WAL, which would be even better. :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
David,

> I think having master-slave replication in the core using WAL is a
> *great* thing to do, doable, a good path to go on, etc., and I think
> it's worth holding up 8.4 until we have at least one actual
> out-of-the-box version of same.

Ah, ok.  Well, I can tell you that the core team is also united on the 
value of time-based as opposed to feature-based release cycles.  *if* we 
can develop this in time for 8.4, everybody would be overjoyed, but I 
also think we should be realistic.

> People have hinted that we might be able to get both a synchronous one
> and an asynchronous one based on WAL, which would be even better. :)

That's the idea, yes. No reason to dump asynch WAL copying when it's 
already working.

--Josh


Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Thu, May 29, 2008 at 01:39:29PM -0700, David Fetter wrote:

> > I think the consensus in the core team was that having synchronous
> > log shipping in 8.4 would already be a worthwhile feature by itself.
> 
> If that was in fact the consensus of the core team, and what I've been
> seeing from several core members in this thread makes that idea
> unclear, it's out of step with the stated goal of the feature.  Having
> some kind of half-way, doesn't-actually-quite-work-out-of-the-box
> "replication" will make things worse and not better.

I don't agree with that.  It seems to me that the stated goal is
fourfold:

1.  Get something that is simple (out of the box).

2.  Get something that is generally useful, even if not universally useful.

3.  Get something that is implementable in stages, with each stage
representing an increase in functionality.

4.  Aim for 8.4, but be realistic about chances.

It seems to me that sync. replication to a write-only standby actually
meets those goals.  I can think of more than one case where I have
spent lots of time, money, or both attempting to achieve this goal
with various add-on hardware and software systems that mostly worked
except when they didn't.  These additions were complicated to set up,
painful to use, and documented by sadists who couldn't write.  

I agree that there are plenty of people unwilling to use a system that
requires additional hardware that mostly can't be used for other
things.  But those people already have alternatives (maybe even more,
and simple ones, soon).  The synchronous-needing crowd currently have
nothing.  The proposed feature would be a huge improvement.

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
Mathias Brossard
Date:
Peter Eisentraut wrote:
> Mathias Brossard wrote:
>>  From what I gather from those slides it seems to me that the NTT solution
>> is synchronous not asynchronous. In my opinion it's even better, but I do
>> understand that others might prefer asynchronous. I'm going to speculate,
>> but I would think it should be possible (without a substancial rewrite) to
>> support both modes (or even some intermediate modes, like DRBD on Linux).
> 
> Well, we already have asynchronous.  The whole point is adding synchronous.

I don't think the current log shipping solution matches, as Tom puts it, 
the « need for simple, built-in replication » (from a user perspective, 
I wouldn't call it "simple").

I pointed out that the NTT solution is synchronous because Tom said in 
the first part of his email that:
> In practice, simple asynchronous single-master-multiple-slave> replication covers a respectable fraction of use
cases,so we have> concluded that we should allow such a feature to be included in the> core project.
 

... and yet "the most appropriate base technology for this" is 
synchronous and maybe I should have also pointed out in my previous mail 
is that it doesn't support multiple slaves.

Also, as other have pointed out there are different interpretations of 
"synchronous" depending on wether the WAL data has reached the other end 
of the network connection, a safe disk checkpoint or the slave DB itself.

Sincerely,
-- 
Mathias Brossard


Re: Core team statement on replication in PostgreSQL

From
Robert Treat
Date:
On Thursday 29 May 2008 12:13:20 Bruce Momjian wrote:
> David Fetter wrote:
> > On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote:
> > > Josh Berkus wrote:
> > > > Publishing the XIDs back to the master is one possibility.  We
> > > > also looked at using "spillover segments" for vacuumed rows, but
> > > > that seemed even less viable.
> > > >
> > > > I'm also thinking, for *async replication*, that we could simply
> > > > halt replication on the slave whenever a transaction passes minxid
> > > > on the master.  However, the main focus will be on synchrounous
> > > > hot standby.
> > >
> > > Another idea I discussed with Tom is having the slave _delay_
> > > applying WAL files until all slave snapshots are ready.
> >
> > Either one of these would be great, but something that involves
> > machines that stay useless most of the time is just not going to work.
>
> Right, the ultimate target is to have the slave be read-only, but we
> need to get the streaming of WAL logs done first.
>

I would have thought the read only piece would have been more important than 
the synchronous piece. In my experience readable slaves is the big selling 
point in both Oracle and MySQL's implementations, and people are not nearly 
as concerned if there is a small asynchronous window.  

It would also be easier to implement on some level; we have already solved the 
asynchronus wal shipping problem, so we would just need to solve the 
read-only bits. For synchronus hot standby, you have to solve both the 
synchronus shipping and the read-only bits. Seems like more work with less 
upside that read-only slaves vs. pitr warm standby we have now. 

Interesting that core views this differently. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:

On Thu, 2008-05-29 at 17:42 -0400, Robert Treat wrote:

> 
> I would have thought the read only piece would have been more important than 
> the synchronous piece. In my experience readable slaves is the big selling 
> point in both Oracle and MySQL's implementations, and people are not nearly 
> as concerned if there is a small asynchronous window.  

The read only piece is the more important piece from a market
perspective. 

Joshua D. Drake




Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Robert Treat <xzilla@users.sourceforge.net> writes:
> I would have thought the read only piece would have been more important than 
> the synchronous piece. In my experience readable slaves is the big selling 
> point in both Oracle and MySQL's implementations, and people are not nearly 
> as concerned if there is a small asynchronous window.  

Actually, it's the "easy to use" piece that's near the front of my own
priority list.  In any case, the expectation that we can get the log
shipping fixed sooner than we can get read-only slaves has nothing to do
with the relative desirability of the pieces; it's a technical judgment
based on what we can see of the problems' difficulty.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Dimitri Fontaine
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I'd first want to applaud core decision: having bare PostgreSQL
propose a reliable and simple to set-up synchronous replication
solution is an excellent perspective! ...

Le 29 mai 08 à 23:42, Robert Treat a écrit :
> I would have thought the read only piece would have been more
> important than
> the synchronous piece. In my experience readable slaves is the big
> selling
> point in both Oracle and MySQL's implementations, and people are not
> nearly
> as concerned if there is a small asynchronous window.

... Even more so when you're confronted to this exact problem.
A fellow PG user ended up having both the WAL and the data replicated
by DRBD (protocol C) and some heartbeat scripts to do the automatic
failover. This wasn't easy to setup, and to some extend we're still
concerned about the reliability part of it. We know about the "easy to
use" part of it: we didn't get it.

While at it, would it be possible for the "simple" part of the core
team statement to include automatic failover?
That would mean for current master when it's going to stop on error
(fatal) to tell the slave to warm-up. Of course in case of more severe
crash the slave would have to get started by other means, but covering
the fatal error path and have the master restart as a slave would only
add up to the reliability... wouldn't it?

> It would also be easier to implement on some level; we have already
> solved the
> asynchronus wal shipping problem, so we would just need to solve the
> read-only bits. For synchronus hot standby, you have to solve both the
> synchronus shipping and the read-only bits. Seems like more work
> with less
> upside that read-only slaves vs. pitr warm standby we have now.
>
> Interesting that core views this differently.

core seems to think read-only slave is more complex than synchronous
slave, in term of slave read only long transaction and master vacuums
for example.

Regards,
- --
dim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkg/Kk8ACgkQlBXRlnbh1bneKACeMK+fSp8VExctndo46X76NTxV
atIAn2UYw1g/4RPddypqirrZcqg5C7gm
=JeA6
-----END PGP SIGNATURE-----


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Dimitri Fontaine <dim@hi-media.com> writes:
> While at it, would it be possible for the "simple" part of the core  
> team statement to include automatic failover?

No, I think it would be a useless expenditure of energy.  Failover
includes a lot of things that are not within our purview: switching
IP addresses to point to the new server, some kind of STONITH solution
to keep the original master from coming back to life, etc.  Moreover
there are already projects/products concerned with those issues.

It might be useful to document where to find solutions to that problem,
but we can't take it on as part of core Postgres.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Andrew Dunstan
Date:

Joshua D. Drake wrote:
> On Thu, 2008-05-29 at 17:42 -0400, Robert Treat wrote:
>
>   
>> I would have thought the read only piece would have been more important than 
>> the synchronous piece. In my experience readable slaves is the big selling 
>> point in both Oracle and MySQL's implementations, and people are not nearly 
>> as concerned if there is a small asynchronous window.  
>>     
>
> The read only piece is the more important piece from a market
> perspective. 
>
>
>   

You must be gauging a different market from the one I'm in. I have just 
come back from a meeting with a (quite technically savvy) customer who 
was quite excited by the news and saw the possibility of read-only 
slaves as a nice to have extra rather than a must-have-or-it's-not-worth 
anything feature.

I'm really quite astounded and rather saddened by the waves of 
negativity I have seen today.

cheers

andrew




Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:

On Thu, 2008-05-29 at 18:39 -0400, Andrew Dunstan wrote:
> 
> Joshua D. Drake wrote:
> > On Thu, 2008-05-29 at 17:42 -0400, Robert Treat wrote:
> > 
> 
> You must be gauging a different market from the one I'm in. I have just 
> come back from a meeting with a (quite technically savvy) customer who 

One customer does not make a hundred. I am not saying that the shipping
isn't valid, just that those that I talk to are more interested in the
read only slave. Consider that we have any number of ways to solve the
problem we are considering implementing now. DRBD being just one of
them. Especially considering we aren't talking about failover etc..

> was quite excited by the news and saw the possibility of read-only 
> slaves as a nice to have extra rather than a must-have-or-it's-not-worth 
> anything feature.
> 
> I'm really quite astounded and rather saddened by the waves of 
> negativity I have seen today.

I have seen much more positive than negative. I think most are just
trying to figure out exactly what we are talking about.

I for example am not really against the feature at all. I am suspect of
the idea of only shipping 2 components of a 3 component feature though.

Sincerely,

Joshua D. Drake





Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> One customer does not make a hundred. I am not saying that the shipping
> isn't valid, just that those that I talk to are more interested in the
> read only slave. Consider that we have any number of ways to solve the
> problem we are considering implementing now. DRBD being just one of
> them.

In principle, sure, but there's a big gap between theory and practice
here.  What you are arguing for is the "give the users a toolkit and
let them figure it out" approach that David was lambasting upthread.
People want the bits to go from point A to point B; they don't want
to have to research, design, test, and administer their own solution
for moving the bits.  Furthermore, as several people have pointed out
from their own bitter experiences, getting a reliable and easy-to-use
solution is not as easy as it sounds.  I think we have nontrivial
work in front of us to build a simple, reliable, community-tested
log shipping solution; and it's not very sexy work either.  But it
needs to get done, and it really needs to get done first.  There's
no point in having read-only slave queries if you don't have a
trustworthy method of getting the data to them.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:

On Thu, 2008-05-29 at 19:02 -0400, Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > One customer does not make a hundred. I am not saying that the shipping
> > isn't valid, just that those that I talk to are more interested in the
> > read only slave. Consider that we have any number of ways to solve the
> > problem we are considering implementing now. DRBD being just one of
> > them.
> 
> In principle, sure, but there's a big gap between theory and practice
> here.  What you are arguing for is the "give the users a toolkit and
> let them figure it out" approach that David was lambasting upthread.

I think maybe my actual argument isn't coming through. What I am arguing
for is not shipping XY without Z. That is all. (and no, I don't think we
should hold up 8.4).

In short. Let's do the feature but don't ship half of it. Ship the whole
thing. If that means it doesn't get into 8.4, then it doesn't get into
8.4.

> People want the bits to go from point A to point B; they don't want
> to have to research, design, test, and administer their own solution
> for moving the bits.

You don't have to convince me. I agree with you.

>   I think we have nontrivial
> work in front of us to build a simple, reliable, community-tested
> log shipping solution; and it's not very sexy work either.  But it
> needs to get done, and it really needs to get done first.  There's
> no point in having read-only slave queries if you don't have a
> trustworthy method of getting the data to them.

O.k. I was with you until here. Log shipping ala pg_standby works fine
now sans read-only slave. No, it isn't out of the box which I can see an
argument for but it is certainly trustworthy. Or do you mean the
synchronous part?

Sincerely,

Joshua D. rake




Re: Core team statement on replication in PostgreSQL

From
Chris Browne
Date:
tgl@sss.pgh.pa.us (Tom Lane) writes:
> As I said originally, we have no expectation that the proposed features
> will displace the existing replication projects for "high end"
> replication problems ... and I'd characterize all of Robert's concerns
> as "high end" problems.  We are happy to let those be solved outside
> the core project.

I expect that one of the effects of having an "improved PITR" as a
'core' replication solution will be that Slony-I will become less
popular amongst "naive users," and that is *not* a bad thing.  I am
quite sure that there are a lot of cases today where users would be
better served by what falls out of this.

It will mean that those that are left will, despite being fewer, be
more cognizant that they are getting what they require.  

We should see somewhat fewer "expectation gaps," and that should allow
more people to be more satisfied with what they are getting.  
-- 
let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/linux.html
Lisp stoppped itself
FEP Command:


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> I think maybe my actual argument isn't coming through. What I am arguing
> for is not shipping XY without Z. That is all. (and no, I don't think we
> should hold up 8.4).

So we should keep all the work out of the tree until every part is done?
No thanks; especially not when there is a perfectly respectable use-case
for parts X and Y alone (whether it suits *your* uses or not).

>> There's no point in having read-only slave queries if you don't have a
>> trustworthy method of getting the data to them.

> O.k. I was with you until here. Log shipping ala pg_standby works fine
> now sans read-only slave. No, it isn't out of the box which I can see an
> argument for but it is certainly trustworthy. Or do you mean the
> synchronous part?

How much testing has pg_standby really gotten?  Some, sure, but it's a
contrib module that wasn't even there before 8.3.  Even ignoring the lag
issue, I wouldn't trust it a whole lot if I were a DBA responsible for
valuable data.  As much as some folk would like to think that contrib
is mainstream, it's not really in the same league as far as testing
coverage goes.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Thu, May 29, 2008 at 7:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Thu, 2008-05-29 at 19:02 -0400, Tom Lane wrote:
>>   I think we have nontrivial
>> work in front of us to build a simple, reliable, community-tested
>> log shipping solution; and it's not very sexy work either.  But it
>> needs to get done, and it really needs to get done first.  There's
>> no point in having read-only slave queries if you don't have a
>> trustworthy method of getting the data to them.
>
> O.k. I was with you until here. Log shipping ala pg_standby works fine
> now sans read-only slave. No, it isn't out of the box which I can see an
> argument for but it is certainly trustworthy. Or do you mean the
> synchronous part?

I disagree...setting up pg_standby is more complex than it really has
to be.   There are several examples in the archives of people getting
their standby solutions busted with partial wal files, etc.  I helped
beta test pg_standby and there are a few 'gotchas' in getting it set
up properly.

pg_standby is not the problem (although there are some odd things
about it), it's getting files from point a to point b.  It would be
nice to have 'pg_archive' which mirrors pg_standby and handles the
work on the client side for example.

While some of us can work magic with rsync, etc.  It would be nice to
get things running with few .conf settings and no external
dependencies.

merlin


Re: Core team statement on replication in PostgreSQL

From
Greg Smith
Date:
On Thu, 29 May 2008, Tom Lane wrote:

> There's no point in having read-only slave queries if you don't have a 
> trustworthy method of getting the data to them.

This is a key statement that highlights the difference in how you're 
thinking about this compared to some other people here.  As far as some 
are concerned, the already working log shipping *is* a trustworthy method 
of getting data to the read-only slaves.  There are plenty of applications 
(web oriented ones in particular) where if you could direct read-only 
queries against a slave, the resulting combination would be a giant 
improvement over the status quo even if that slave was as much as 
archive_timeout behind the master.  That quantity of lag is perfectly fine 
for a lot of the same apps that have read scalability issues.

If you're someone who falls into that camp, the idea of putting the sync 
replication job before the read-only slave one seems really backwards.

I fully accept that it may be the case that it doesn't make technical 
sense to tackle them in any order besides sync->read-only slaves because 
of dependencies in the implementation between the two.  If that's the 
case, it would be nice to explicitly spell out what that was to deflect 
criticism of the planned prioritization.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Core team statement on replication in PostgreSQL

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [080529 20:22]:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > I think maybe my actual argument isn't coming through. What I am arguing
> > for is not shipping XY without Z. That is all. (and no, I don't think we
> > should hold up 8.4).
> 
> So we should keep all the work out of the tree until every part is done?
> No thanks; especially not when there is a perfectly respectable use-case
> for parts X and Y alone (whether it suits *your* uses or not).

Thank you.

I would be really disappointed if some sort of synchronous WAL shipping
patch was done and considered good, but just not applied because some
semi-related "read only queries while WAL applying" patch wasn't yet
considered done and good...

Then again, the opposite is also true.  If someone sat down and did the
hard work and made read-only queries runnable on a slave applying WAL
and the patch was considered done and good, I would be disappointed if it
wasn't applied because the synchronous WAL shipping wasn't done.

Both features are worthy on their own.  But the first done should not be
held up because the other is not ready.

But because core specifically mentioned "ease of use" as one of the
goals, I really think the "built-in WAL shipping" should be *their*
focuss first.  Having read-only queries that no-one can use (well, sure,
some of us could set it up reliably) isn't that useful to "general
PostgreSQL" community that I think they were thinking of when deciding
on this.  But their focus doesn't force anybody interested in other
features to not work on their features.  It's an OSS community...

And the fact that NTT had a nice presentation on it means it's
probably actually doable for
8.4:http://www.pgcon.org/2008/schedule/attachments/61_Synchronous%20Log%20Shipping%20Replication.pdf

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:
Tom Lane wrote:

>>> There's no point in having read-only slave queries if you don't have a
>>> trustworthy method of getting the data to them.
> 
>> O.k. I was with you until here. Log shipping ala pg_standby works fine
>> now sans read-only slave. No, it isn't out of the box which I can see an
>> argument for but it is certainly trustworthy. Or do you mean the
>> synchronous part?
> 
> How much testing has pg_standby really gotten?  Some, sure, but it's a
> contrib module that wasn't even there before 8.3.  Even ignoring the lag
> issue, I wouldn't trust it a whole lot if I were a DBA responsible for
> valuable data.  As much as some folk would like to think that contrib
> is mainstream, it's not really in the same league as far as testing
> coverage goes.

This is a pretty telling statement.

Sincerely,

Joshua D. Drake



Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Greg,

> I fully accept that it may be the case that it doesn't make technical
> sense to tackle them in any order besides sync->read-only slaves because
> of dependencies in the implementation between the two.  If that's the
> case, it would be nice to explicitly spell out what that was to deflect
> criticism of the planned prioritization.

There's a very simple reason to prioritize the synchronous log shipping first; 
NTT may open source their solution and we'll get it a lot sooner than the 
other components.  

That is, we expect that synch log shipping is *easier* than read-only slaves 
and will get done sooner.  Since there are quite a number of users who could 
use this, whether or not they can run queries on the slaves, why not ship 
that feature as soon as its done?

There's also a number of issues with using the currently log shipping method 
for replication.  In additon to the previously mentioned setup pains, there's 
the 16MB chunk size for shipping log segments, which is fine for data 
warehouses but kind of sucks for a web application with a 3GB database which 
may take 2 hours to go though 16MB.  So we have to change the shipping method 
anyway, and if we're doing that, why not work on synch?

Mind you, if someone wanted to get started on read-only slaves *right now* I 
can't imagine anyone would object.  There's a number of problems to solve 
with recovery mode, table locking etc. that can use some work even before we 
deal with changes to log shipping, or XID writeback or any of the other 
issues.  So, volunteers?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Thu, May 29, 2008 at 07:02:56PM -0400, Tom Lane wrote:

> People want the bits to go from point A to point B; they don't want
> to have to research, design, test, and administer their own solution
> for moving the bits.  

I agree with this.  I think I probably know as well as most people --
perhaps less well than maybe two people in the world -- where most of
the bodies in Slony are hidden, and I still find it a pain to
administer.  Other systems are only somewhat less troublesome; and I
really do know about the concepts involved.  I'm not tripping on
important things.  It's just some work to set up.

Other systems hide that work.

Given that (for instance) psql is really very easy to use once you
know a few things, the ongoing pain of simple replication in Postgres
is a big wart.

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Thu, May 29, 2008 at 9:26 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> I fully accept that it may be the case that it doesn't make technical
>> sense to tackle them in any order besides sync->read-only slaves because
>> of dependencies in the implementation between the two.  If that's the
>> case, it would be nice to explicitly spell out what that was to deflect
>> criticism of the planned prioritization.
>
> There's a very simple reason to prioritize the synchronous log shipping first;
> NTT may open source their solution and we'll get it a lot sooner than the
> other components.

That's a good argument.  I just read the NTT document and the stuff
looks fantastic.  You've convinced me...it just doesn't seem prudent
to forge ahead with hot standby without dealing with all the
syncnronous changes to wal logging first.  I just want you guys to
understand how important hot standby is to a lot of people.  sync
logging maybe less so, but having a proof of concept implementation
significantly alters the bang/buck ratio.

> That is, we expect that synch log shipping is *easier* than read-only slaves
> and will get done sooner.  Since there are quite a number of users who could
> use this, whether or not they can run queries on the slaves, why not ship
> that feature as soon as its done?

I think what dfetter, etc. were saying is that we should elevate the
hot standby stuff to a requirement, or at least a future requirement.
IOW, we should try and avoid doing anything which would make it harder
than it already is.  Please understand that I don't thing people on
the list were trying to be negative...the failure of hot standby to
materialize in the 8.3 cycle was a bitter pill for many people.  I
personally see this new thinking as a hugely positive development.

> There's also a number of issues with using the currently log shipping method
> for replication.  In additon to the previously mentioned setup pains, there's
> the 16MB chunk size for shipping log segments, which is fine for data
> warehouses but kind of sucks for a web application with a 3GB database which
> may take 2 hours to go though 16MB.  So we have to change the shipping method
> anyway, and if we're doing that, why not work on synch?

well, there is the archive_timeout setting...but point taken.  A big
use case for hot standby is OLTP environments where you get to combine
HA and reporting server into a single box.

> Mind you, if someone wanted to get started on read-only slaves *right now* I
> can't imagine anyone would object.  There's a number of problems to solve
> with recovery mode, table locking etc. that can use some work even before we
> deal with changes to log shipping, or XID writeback or any of the other
> issues.  So, volunteers?

As I see it, sync logging, hot standby, and improved setup features
are all mostly orthogonal..  Florian took some pretty decent notes
during his analysis and outlined the problem areas pretty well.  That
would be a starting point.  It just strikes me for all this stuff to
having even remote chance of making 8.4 the work needs to be divided
up into teams and conquered separately.

merlin


Re: Core team statement on replication in PostgreSQL

From
Andrew Dunstan
Date:

Josh Berkus wrote:
> Greg,
>
>   
>> I fully accept that it may be the case that it doesn't make technical
>> sense to tackle them in any order besides sync->read-only slaves because
>> of dependencies in the implementation between the two.  If that's the
>> case, it would be nice to explicitly spell out what that was to deflect
>> criticism of the planned prioritization.
>>     
>
> There's a very simple reason to prioritize the synchronous log shipping first; 
> NTT may open source their solution and we'll get it a lot sooner than the 
> other components.  
>   

I have been reading the slides from the NTT presentation, and I now 
really regret not having gone to that talk.

It does seem quite heavy, though, including new background processes, 
heartbeat etc.
> That is, we expect that synch log shipping is *easier* than read-only slaves 
> and will get done sooner.  Since there are quite a number of users who could 
> use this, whether or not they can run queries on the slaves, why not ship 
> that feature as soon as its done?
>   

Indeed.

> There's also a number of issues with using the currently log shipping method 
> for replication.  In additon to the previously mentioned setup pains, there's 
> the 16MB chunk size for shipping log segments, which is fine for data 
> warehouses but kind of sucks for a web application with a 3GB database which 
> may take 2 hours to go though 16MB.  So we have to change the shipping method 
> anyway, and if we're doing that, why not work on synch?
>   

Well, yes, but you do know about archive_timeout, right? No need to wait 
2 hours.


cheers

andrew



Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> I fully accept that it may be the case that it doesn't make technical 
> sense to tackle them in any order besides sync->read-only slaves because 
> of dependencies in the implementation between the two.

Well, it's certainly not been my intention to suggest that no one should
start work on read-only-slaves before we finish the other part.  The
point is that I expect the log shipping issues will be done first
because they're easier, and it would be pointless to not release that
feature if we had it.

But since you mention it: one of the plausible answers for fixing the
vacuum problem for read-only slaves is to have the slaves push an xmin
back upstream to the master to prevent premature vacuuming.  The current
design of pg_standby is utterly incapable of handling that requirement.
So there might be an implementation dependency there, depending on how
we want to solve that problem.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Gurjeet Singh"
Date:
On Fri, May 30, 2008 at 10:40 AM, Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br
/><divclass="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt
0pt0pt 0.8ex; padding-left: 1ex;"> But since you mention it: one of the plausible answers for fixing the<br /> vacuum
problemfor read-only slaves is to have the slaves push an xmin<br /> back upstream to the master to prevent premature
vacuuming. The current<br /> design of pg_standby is utterly incapable of handling that requirement.<br /> So there
mightbe an implementation dependency there, depending on how<br /> we want to solve that problem.<br
/></blockquote></div><br/>I think it would be best to not make the slave interfere with the master's operations; that's
onlygoing to increase the operational complexity of such a solution.<br /><br />There could be multiple slaves
followinga master, some serving data-warehousing queries, some for load-balancing reads, some others just for disaster
recovery,and then some just to mitigate human errors by re-applying the logs with a delay.<br /><br />I don't think any
oneinstallation would see all of the above mentioned scenarios, but we need to take care of multiple slaves operating
offof a single master; something similar to cascaded Slony-I.<br /><br />My two cents.<br /><br />Best regards,<br />--
<br/>gurjeet[.singh]@EnterpriseDB.com<br />singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com<br /><br
/>EnterpriseDB<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /><br />Mail sent from my
BlackLaptopdevice  

Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Thu, May 29, 2008 at 01:58:34PM -0700, David Fetter wrote:
> If people on core had come to the idea that we needed to build in
> replication *before* 8.3 came out, they certainly didn't announce it.
> 
> Now is a great time to mention this because it gives everybody time to:
> 
> 1.  Come to a consensus on what the out-of-the-box replication should
> be, and 
> 
> 2.  Build, test and debug whatever the consensus out-of-the-box
> replication turns out to be.

None of that is an argument for why this has to go in 8.4.

I argued in Ottawa that the idea that you have to plan a feature for
_the next release_ is getting less tenable with each release.  This is
because major new features for Postgres are now often big and
complicated.  The days of big gains from single victories are mostly
over (though there are exceptions, like HOT).  Postgres is already
mature.  As for the middle-aged person with a mortgage, longer-term
planning is simply a necessary part of life now.

There are two possibilities here.  One is to have huge releases on
much longer timetables.  I think this is unsustainable in a free
project, because people will get bored and go away if they don't get
to use the results of their work in a reasonably short time frame.
The other is to accept that sometimes, planning and development for
new features will have to start a long time before actual release --
maybe planning and some coding for 2 releases out.  That allows large
features like the one we're discussing to be developed responsibly
without making everything else wait for it.

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
"Marko Kreen"
Date:
On 5/30/08, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:
> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > But since you mention it: one of the plausible answers for fixing the
> > vacuum problem for read-only slaves is to have the slaves push an xmin
> > back upstream to the master to prevent premature vacuuming.  The current
> > design of pg_standby is utterly incapable of handling that requirement.
> > So there might be an implementation dependency there, depending on how
> > we want to solve that problem.
>
> I think it would be best to not make the slave interfere with the master's
> operations; that's only going to increase the operational complexity of such
> a solution.

I disagree - it's better to consider syncronized WAL-slaves
as equal to master, so having queries there affect master is ok.

You need to remeber this solution tries not to replace 100-node Slony-I
setups.  You can run sanity checks on slaves or use them to load-balance
read-only OLTP queries, but not random stuff.

> There could be multiple slaves following a master, some serving
> data-warehousing queries, some for load-balancing reads, some others just
> for disaster recovery, and then some just to mitigate human errors by
> re-applying the logs with a delay.

To run warehousing queries you better use Slony-I / Londiste.  For
warehousring you want different / more indexes on tables anyway,
so I think it's quite ok to say "don't do it" for complex queries
on WAL-slaves.

> I don't think any one installation would see all of the above mentioned
> scenarios, but we need to take care of multiple slaves operating off of a
> single master; something similar to cascaded Slony-I.

Again, the synchronized WAL replication is not generic solution.
Use Slony/Londiste if you want to get totally independent slaves.

Thankfully the -core has set concrete and limited goals,
that means it is possible to see working code in reasonable time.
I think that should apply to read-only slaves too.

If we try to make it handle any load, it will not be finished in any time.

Now if we limit the scope I've seen 2 variants thus far:

1) Keep slave max in sync, let the load there affect master (xmin). - Slave can be used to load-balance OLTP load -
Slaveshould not be used for complex queries.
 

2) If long query is running, let slave lag (avoid applying WAL data). - Slave cannot be used to load-balance OLTP load
-Slave can be used for complex queries (although no new indexes   or temp tables can be created).
 

I think 1) is more important (and more easily implementable) case.

For 2) we already have solutions (Slony/Londiste/Bucardo, etc)
so there is no point to make effort to solve this here.

-- 
marko


Re: Core team statement on replication in PostgreSQL

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>   
>> On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote:
>>     
>>> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>       
>>>> But since you mention it: one of the plausible answers for fixing the
>>>> vacuum problem for read-only slaves is to have the slaves push an xmin
>>>> back upstream to the master to prevent premature vacuuming.
>>>>         
>>> I think it would be best to not make the slave interfere with the
>>> master's operations; that's only going to increase the operational
>>> complexity of such a solution.
>>>       
>
>   
>> We ruled that out as the-only-solution a while back. It does have the
>> beauty of simplicity, so it may exist as an option or possibly the only
>> way, for 8.4.
>>     
>
> Yeah.  The point is that it's fairly clear that we could make that work.
> A solution that doesn't impact the master at all would be nicer, but
> it's not at all clear to me that one is possible, unless we abandon
> WAL-shipping as the base technology.
>
>             
>   

Quite. Before we start ruling things out let's know what we think we can 
actually do.

I hope that NTT will release their code ASAP so we will have a better 
idea of what we have and what we need.

cheers

andrew


Re: Core team statement on replication in PostgreSQL

From
Robert Hodges
Date:
Hi Tom,

Thanks for the reasoned reply.  As you saw from point #2 in my comments, I
think you should do this feature.  I hope this answers Josh Berkus' concern
about my comments.

You make a very interesting comment which seems to go to the heart of this
design approach:

> About the only thing that would make me want to consider row-based
> replication in core would be if we determine that read-only slave
> queries are impractical atop a WAL-log-shipping implementation.

It's possible I'm misunderstanding some of the implementation issues, but it
is striking that the detailed responses to your proposal list a number of
low-level dependencies between master and slave states when replicating WAL
records.  It appears that you are designing a replication mechanism that
works effectively between a master and a relatively small number of "nearby"
slaves.  This is clearly an important use case but it also seems clear that
the WAL approach is not a general-purpose approach to replication.  In other
words, you'll incrementally get to that limited end point I describe.  This
will still leave a lot to be desired on read scaling, not to mention many
other cases.

Hence my original comments.  However, rather than harp on that further I
will open up a separate thread to describe a relatively small set of
extensions to PostgreSQL that would be enabling for a wide range of
replication applications.  Contrary to popular opinion these extensions are
actually well understood at the theory level and have been implemented as
prototypes as well as in commercial patches multiple times in different
databases.  Those of us who are deeply involved in replication deserve just
condemnation for not stepping up and getting our thoughts out on the table.

Meanwhile, I would be interested in your reaction to these thoughts on the
scope of the real-time WAL approach.  There's obviously tremendous interest
in this feature.  A general description that goes beyond the NTT slides
would be most helpful for further discussions.

Cheers, Robert

P.s., The NTT slides were really great.  Takahiro and Masao deserve
congratulations on an absolutely first-rate presentation.

On 5/29/08 9:09 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Andrew Sullivan <ajs@commandprompt.com> writes:
>> On Thu, May 29, 2008 at 12:05:18PM -0700, Robert Hodges wrote:
>>> people are starting to get religion on this issue I would strongly
>>> advocate a parallel effort to put in a change-set extraction API
>>> that would allow construction of comprehensive master/slave
>>> replication.
>
>> You know, I gave a talk in Ottawa just last week about how the last
>> effort to develop a comprehensive API for replication failed.
>
> Indeed, core's change of heart on this issue was largely driven by
> Andrew's talk and subsequent discussion.  We had more or less been
> waiting for the various external replication projects to tell us
> what they wanted in this line, and it was only the realization that
> no such thing was likely to happen that forced us to think seriously
> about what could be done within the core project.
>
> As I said originally, we have no expectation that the proposed features
> will displace the existing replication projects for "high end"
> replication problems ... and I'd characterize all of Robert's concerns
> as "high end" problems.  We are happy to let those be solved outside
> the core project.
>
> About the only thing that would make me want to consider row-based
> replication in core would be if we determine that read-only slave
> queries are impractical atop a WAL-log-shipping implementation.
> Which could happen; in fact I think that's the main risk of the
> proposed development plan.  But I also think that the near-term
> steps of the plan are worth doing anyway, for various other reasons,
> and so we won't be out too much effort if the plan fails.
>
>                         regards, tom lane
>



Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote:
>> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> But since you mention it: one of the plausible answers for fixing the
>>> vacuum problem for read-only slaves is to have the slaves push an xmin
>>> back upstream to the master to prevent premature vacuuming.
>>
>> I think it would be best to not make the slave interfere with the
>> master's operations; that's only going to increase the operational
>> complexity of such a solution.

> We ruled that out as the-only-solution a while back. It does have the
> beauty of simplicity, so it may exist as an option or possibly the only
> way, for 8.4.

Yeah.  The point is that it's fairly clear that we could make that work.
A solution that doesn't impact the master at all would be nicer, but
it's not at all clear to me that one is possible, unless we abandon
WAL-shipping as the base technology.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Fri, May 30, 2008 at 9:31 AM, Marko Kreen <markokr@gmail.com> wrote:
> On 5/30/08, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:
>>
>> I think it would be best to not make the slave interfere with the master's
>> operations; that's only going to increase the operational complexity of such
>> a solution.
>
> I disagree - it's better to consider syncronized WAL-slaves
> as equal to master, so having queries there affect master is ok.
>
> You need to remeber this solution tries not to replace 100-node Slony-I
> setups.  You can run sanity checks on slaves or use them to load-balance
> read-only OLTP queries, but not random stuff.
>
>> There could be multiple slaves following a master, some serving
>> data-warehousing queries, some for load-balancing reads, some others just
>> for disaster recovery, and then some just to mitigate human errors by
>> re-applying the logs with a delay.
>
> To run warehousing queries you better use Slony-I / Londiste.  For
> warehousring you want different / more indexes on tables anyway,
> so I think it's quite ok to say "don't do it" for complex queries
> on WAL-slaves.
>
>> I don't think any one installation would see all of the above mentioned
>> scenarios, but we need to take care of multiple slaves operating off of a
>> single master; something similar to cascaded Slony-I.
>
> Again, the synchronized WAL replication is not generic solution.
> Use Slony/Londiste if you want to get totally independent slaves.

I strongly agree with Gurjeet.  The warm standby replication mechanism
is pretty simple and is wonderfully flexible with the one big
requirement that your clusters have to be mirrors of each other.

Synchronous wal replication obviously needs some communication channel
from the slave back to the master. Hopefully, it will be possible to
avoid this for asynchronous shipping.

merlin


Re: Core team statement on replication in PostgreSQL

From
Simon Riggs
Date:
On Fri, 2008-05-30 at 11:30 -0400, Andrew Dunstan wrote:
> 
> Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> >   
> >> On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote:
> >>     
> >>> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>>       
> >>>> But since you mention it: one of the plausible answers for fixing the
> >>>> vacuum problem for read-only slaves is to have the slaves push an xmin
> >>>> back upstream to the master to prevent premature vacuuming.
> >>>>         
> >>> I think it would be best to not make the slave interfere with the
> >>> master's operations; that's only going to increase the operational
> >>> complexity of such a solution.
> >>>       
> >> We ruled that out as the-only-solution a while back. It does have the
> >> beauty of simplicity, so it may exist as an option or possibly the only
> >> way, for 8.4.
> >>     
> > Yeah.  The point is that it's fairly clear that we could make that work.
> > A solution that doesn't impact the master at all would be nicer, but
> > it's not at all clear to me that one is possible, unless we abandon
> > WAL-shipping as the base technology.
> >
> Quite. Before we start ruling things out let's know what we think we can 
> actually do.

Let me re-phrase: I'm aware of that possibility and believe we can and
could do it for 8.4. My assessment is that people won't find it
sufficient and I am looking at other alternatives also. There may be a
better one possible for 8.4, there may not. Hence I've said "something
in 8.4, something better later". There is no need to decide that is the
only way forward, yet.

I hope and expect to put some of these ideas into a more concrete form,
but this has not yet happened. Nothing has slipped, not having any
trouble getting on with it, just that my plans were to not start it yet.
I think having a detailed design ready for review by September commit
fest is credible.

> I hope that NTT will release their code ASAP so we will have a better 
> idea of what we have and what we need.

That has very little to do with Hot Standby, though there could be patch
conflicts, which is why I'm aiming to get WAL streaming done first.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Core team statement on replication in PostgreSQL

From
Andreas 'ads' Scherbaum
Date:
On Thu, 29 May 2008 09:22:26 -0700 Steve Atkins wrote:
> On May 29, 2008, at 9:12 AM, David Fetter wrote:
> 
> > Either one of these would be great, but something that involves
> > machines that stay useless most of the time is just not going to work.
> 
> I have customers who are thinking about warm standby functionality, and
> the only thing stopping them deploying it is complexity and maintenance,
> not the cost of the HA hardware. If trivial-to-deploy replication that  
> didn't offer read-only access of the slaves were available today I'd bet
> that most of them would be using it.

Sure, have a similar customer. They are right now using a set of
Perl-scripts which ship the logfiles to the slave, take care of the
status, apply the logfiles, validate checksums ect ect. The whole thing
works very well in combination with RedHat cluster software, but it
took several weeks to implement the current solution.

Not everyone wants to spend the time and the manpower to implement a
"simple" replication.


Kind regards

--             Andreas 'ads' Scherbaum
German PostgreSQL User Group


Re: Core team statement on replication in PostgreSQL

From
Simon Riggs
Date:
On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote:
> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>         But since you mention it: one of the plausible answers for
>         fixing the
>         vacuum problem for read-only slaves is to have the slaves push
>         an xmin
>         back upstream to the master to prevent premature vacuuming.
>          The current
>         design of pg_standby is utterly incapable of handling that
>         requirement.
>         So there might be an implementation dependency there,
>         depending on how
>         we want to solve that problem.
> 
> I think it would be best to not make the slave interfere with the
> master's operations; that's only going to increase the operational
> complexity of such a solution.
> 
> There could be multiple slaves following a master, some serving
> data-warehousing queries, some for load-balancing reads, some others
> just for disaster recovery, and then some just to mitigate human
> errors by re-applying the logs with a delay.

Agreed.

We ruled that out as the-only-solution a while back. It does have the
beauty of simplicity, so it may exist as an option or possibly the only
way, for 8.4.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Core team statement on replication in PostgreSQL

From
Andreas 'ads' Scherbaum
Date:
On Thu, 29 May 2008 18:29:01 -0400 Tom Lane wrote:

> Dimitri Fontaine <dim@hi-media.com> writes:
> > While at it, would it be possible for the "simple" part of the core  
> > team statement to include automatic failover?
> 
> No, I think it would be a useless expenditure of energy.  Failover
> includes a lot of things that are not within our purview: switching
> IP addresses to point to the new server, some kind of STONITH solution
> to keep the original master from coming back to life, etc.  Moreover
> there are already projects/products concerned with those issues.

True words. Failover is not and should not be part of PostgreSQL.

But PG can help the failover solution, as example: an easy-to-use
interface about the current slave status comes into my mind. Other
ideas might also be possible.


> It might be useful to document where to find solutions to that problem,
> but we can't take it on as part of core Postgres.

Ack


Kind regards

--             Andreas 'ads' Scherbaum
German PostgreSQL User Group


Re: Core team statement on replication in PostgreSQL

From
Andreas 'ads' Scherbaum
Date:
On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:

> Well, yes, but you do know about archive_timeout, right? No need to wait 
> 2 hours.

Then you ship 16 MB binary stuff every 30 second or every minute but
you only have some kbyte real data in the logfile. This must be taken
into account, especially if you ship the logfile over the internet
(means: no high-speed connection, maybe even pay-per-traffic) to the
slave.


Kind regards

--             Andreas 'ads' Scherbaum
German PostgreSQL User Group


Re: Core team statement on replication in PostgreSQL

From
"Gurjeet Singh"
Date:
On Thu, May 29, 2008 at 7:42 PM, Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br
/><divclass="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt
0pt0pt 0.8ex; padding-left: 1ex;"> The big problem<br /> is that long-running slave-side queries might still need
tuplesthat are<br /> vacuumable on the master, and so replication of vacuuming actions would<br /> cause the slave's
queriesto deliver wrong answers.</blockquote></div><br />Another issue with read-only slaves just popped up in my
head.<br/><br />How do we block the readers on the slave while it is replaying an ALTER TABLE or similar command that
requiresExclusive lock and potentially alters the table's structure. Or does the WAL replay already takes an x-lock on
sucha table?<br /><br clear="all" />Best regards,<br />-- <br />gurjeet[.singh]@EnterpriseDB.com<br />singh.gurjeet@{
gmail| hotmail | indiatimes | yahoo }.com<br /><br />EnterpriseDB <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/><br />Mail sent from my BlackLaptop device  

Re: Core team statement on replication in PostgreSQL

From
Greg Smith
Date:
On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote:

> Then you ship 16 MB binary stuff every 30 second or every minute but
> you only have some kbyte real data in the logfile.

Not if you use pg_clearxlogtail ( 
http://www.2ndquadrant.com/replication.htm ), which got lost in the giant 
March commitfest queue but should probably wander into contrib as part of 
8.4.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Core team statement on replication in PostgreSQL

From
Simon Riggs
Date:
On Fri, 2008-05-30 at 11:12 -0700, Robert Hodges wrote:
> This is clearly an important use case but it also seems clear that
> the WAL approach is not a general-purpose approach to replication.

I think we cannot make such a statement yet, if ever.

I would note that log-based replication is now the mainstay of
commercial database replication techniques for loosely-coupled groups of
servers. It would seem strange to assume that it should not be good for
us too, simply because we know it to be difficult.

IMHO the project has a pretty good track record of delivering
functionality that looked hard at first glance.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Core team statement on replication in PostgreSQL

From
Robert Treat
Date:
On Friday 30 May 2008 01:10:20 Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > I fully accept that it may be the case that it doesn't make technical
> > sense to tackle them in any order besides sync->read-only slaves because
> > of dependencies in the implementation between the two.
>
> Well, it's certainly not been my intention to suggest that no one should
> start work on read-only-slaves before we finish the other part.  The
> point is that I expect the log shipping issues will be done first
> because they're easier, and it would be pointless to not release that
> feature if we had it.
>
> But since you mention it: one of the plausible answers for fixing the
> vacuum problem for read-only slaves is to have the slaves push an xmin
> back upstream to the master to prevent premature vacuuming.  The current
> design of pg_standby is utterly incapable of handling that requirement.
> So there might be an implementation dependency there, depending on how
> we want to solve that problem.
>

Sure, but whose to say that after synchronous wal shipping is "finished" it 
wont need a serious re-write due to new needs from the hot standby feature. I 
think going either way carries some risk. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Core team statement on replication in PostgreSQL

From
Robert Treat
Date:
On Thursday 29 May 2008 22:59:21 Merlin Moncure wrote:
> On Thu, May 29, 2008 at 9:26 PM, Josh Berkus <josh@agliodbs.com> wrote:
> >> I fully accept that it may be the case that it doesn't make technical
> >> sense to tackle them in any order besides sync->read-only slaves because
> >> of dependencies in the implementation between the two.  If that's the
> >> case, it would be nice to explicitly spell out what that was to deflect
> >> criticism of the planned prioritization.
> >
> > There's a very simple reason to prioritize the synchronous log shipping
> > first; NTT may open source their solution and we'll get it a lot sooner
> > than the other components.
>
> That's a good argument.  I just read the NTT document and the stuff
> looks fantastic.  You've convinced me...

It would be a better argument if the NTT guys hadn't said that they estimated 
6 months time before the code would be released, which puts us beyond 8.4. 
Now it is possible that the time frame could be sooner, but unless someone 
already has the patch, this reminds me a little too much of the arguments for 
including windows support in a single release because we already had a work 
port/patch set to go from. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Core team statement on replication in PostgreSQL

From
Robert Treat
Date:
On Thursday 29 May 2008 20:31:31 Greg Smith wrote:
> On Thu, 29 May 2008, Tom Lane wrote:
> > There's no point in having read-only slave queries if you don't have a
> > trustworthy method of getting the data to them.
>
> This is a key statement that highlights the difference in how you're
> thinking about this compared to some other people here.  As far as some
> are concerned, the already working log shipping *is* a trustworthy method
> of getting data to the read-only slaves.  There are plenty of applications
> (web oriented ones in particular) where if you could direct read-only
> queries against a slave, the resulting combination would be a giant
> improvement over the status quo even if that slave was as much as
> archive_timeout behind the master.  That quantity of lag is perfectly fine
> for a lot of the same apps that have read scalability issues.
>
> If you're someone who falls into that camp, the idea of putting the sync
> replication job before the read-only slave one seems really backwards.
>

Just looking at it from an overall market perspective, synchronous log 
shipping pretty much only addresses failover needs, where as read-only slaves 
address both failover and scaling issues. (Note I say address, not solve). 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Core team statement on replication in PostgreSQL

From
Andrew Dunstan
Date:

Andreas 'ads' Scherbaum wrote:
> On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
>
>   
>> Well, yes, but you do know about archive_timeout, right? No need to wait 
>> 2 hours.
>>     
>
> Then you ship 16 MB binary stuff every 30 second or every minute but
> you only have some kbyte real data in the logfile. This must be taken
> into account, especially if you ship the logfile over the internet
> (means: no high-speed connection, maybe even pay-per-traffic) to the
> slave.
>
>
>
>   

Sure there's a price to pay. But that doesn't mean the facility doesn't 
exist. And I rather suspect that most of Josh's customers aren't too 
concerned about traffic charges or affected by such bandwidth 
restrictions. Certainly, none of my clients are, and they aren't in the 
giant class. Shipping a 16Mb file, particularly if compressed, every 
minute or so, is not such a huge problem for a great many commercial 
users, and even many domestic users.

cheers

andrew


Re: Core team statement on replication in PostgreSQL

From
"Gurjeet Singh"
Date:
On Sat, May 31, 2008 at 1:52 AM, Greg Smith <gsmith@gregsmith.com> wrote:
On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote:

Then you ship 16 MB binary stuff every 30 second or every minute but
you only have some kbyte real data in the logfile.

Not if you use pg_clearxlogtail ( http://www.2ndquadrant.com/replication.htm ), which got lost in the giant March commitfest queue but should probably wander into contrib as part of 8.4.

This means we need to modify pg_standby to not check for filesize when reading XLogs.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device

Re: Core team statement on replication in PostgreSQL

From
Tatsuo Ishii
Date:
> Andreas 'ads' Scherbaum wrote:
> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
> >
> >   
> >> Well, yes, but you do know about archive_timeout, right? No need to wait 
> >> 2 hours.
> >>     
> >
> > Then you ship 16 MB binary stuff every 30 second or every minute but
> > you only have some kbyte real data in the logfile. This must be taken
> > into account, especially if you ship the logfile over the internet
> > (means: no high-speed connection, maybe even pay-per-traffic) to the
> > slave.
> >
> >
> >
> >   
> 
> Sure there's a price to pay. But that doesn't mean the facility doesn't 
> exist. And I rather suspect that most of Josh's customers aren't too 
> concerned about traffic charges or affected by such bandwidth 
> restrictions. Certainly, none of my clients are, and they aren't in the 
> giant class. Shipping a 16Mb file, particularly if compressed, every 
> minute or so, is not such a huge problem for a great many commercial 
> users, and even many domestic users.

Sumitomo Electric Co., Ltd., a 20 billion dollars selling company in
Japan (parent company of Sumitomo Electric Information Systems Co.,
Ltd., which is one of the Recursive SQL development support company)
uses 100 PostgreSQL servers. They are doing backups by using log
shipping to another data center and have problems with the amount of
the transferring log data. They said this is one of the big problems
they have with PostgreSQL and hope it will be solved in the near
future.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:

On Sat, 2008-05-31 at 02:48 +0530, Gurjeet Singh wrote:

>         
>         Not if you use pg_clearxlogtail
>         ( http://www.2ndquadrant.com/replication.htm ), which got lost
>         in the giant March commitfest queue but should probably wander
>         into contrib as part of 8.4.
> 
> This means we need to modify pg_standby to not check for filesize when
> reading XLogs.
> 
> Best regards,
> 

It does.

Joshua D. Drake

> 



Re: Core team statement on replication in PostgreSQL

From
Greg Smith
Date:
On Sat, 31 May 2008, Gurjeet Singh wrote:

>> Not if you use pg_clearxlogtail
>
> This means we need to modify pg_standby to not check for filesize when
> reading XLogs.

No, the idea is that you run the segments through pg_clearxlogtail | gzip, 
which then compresses lightly used segments massively because all the 
unused bytes are 0.  File comes out the same size at the other side, but 
you didn't ship a full 16MB if there was only a few KB used.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Core team statement on replication in PostgreSQL

From
Simon Riggs
Date:
On Fri, 2008-05-30 at 01:10 -0400, Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > I fully accept that it may be the case that it doesn't make technical 
> > sense to tackle them in any order besides sync->read-only slaves because 
> > of dependencies in the implementation between the two.
> 
> Well, it's certainly not been my intention to suggest that no one should
> start work on read-only-slaves before we finish the other part.  The
> point is that I expect the log shipping issues will be done first
> because they're easier, and it would be pointless to not release that
> feature if we had it.

Agreed.

I'm arriving late to a thread that seems to have grown out of all
proportion.

AFAICS streaming WAL and hot standby are completely orthogonal features.
Streaming WAL is easier and if NTT can release their code to open source
we may get this in the Sept commit fest. Hot Standby is harder and it
was my viewpoint at PGCon that we may not have a perfect working version
of this by the end of 8.4. We are very likely to have something working,
but maybe not the whole feature set as we might wish to have. I expect
to be actively working on this "soon". I definitely do want to see WAL
streaming going in as early as possible and before end of 8.4, otherwise
code conflicts and other difficulties are likely to push out the 8.4
date and/or Hot Standby.

So as I see it, Tom has only passed on my comments on this, not added or
removed anything. The main part of the announcement was really about
bringing the WAL streaming into core and effectively favouring it over a
range of other projects.

Can we all back off a little on this for now? Various concerns have been
validly expressed, but it will all come good AFAICS.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Core team statement on replication in PostgreSQL

From
Dimitri Fontaine
Date:
Le vendredi 30 mai 2008, Tom Lane a écrit :
> No, I think it would be a useless expenditure of energy.  Failover
> includes a lot of things that are not within our purview: switching
> IP addresses to point to the new server, some kind of STONITH solution
> to keep the original master from coming back to life, etc.  Moreover
> there are already projects/products concerned with those issues.

Well, I forgot that there's in fact no active plan to put pgbouncer features
into core at the moment (I'm sure I've read something about this on the lists
though). If it was the case, the slave could proxy queries to the master, and
stop proxying but serve them if the master tells it it's dying.
This way, no need to switch IP addresses, the clients just connect as usual
and get results back and do not have to know whether the host they're qerying
against is a slave or a master. This level of smartness is into -core.

The STONITH part in case of known failure (fatal) does not seem that hard
either, as the master at fatal time could write somewhere it's now a slave
and use this at next startup time (recovery.conf?). If it can't even do that,
well, we're back to crash situation with no provided automatic failover
solution. Not handled failure cases obviously will continue to exist.

I'm not asking for "all cases managed in -core please", just for "some level
of effort" on the topic. Of course, I'm just the one asking questions and
trying to raise ideas, so I'm perfectly fine with your current answer
(useless expenditure of energy) even if somewhat disagreeing on the useless
part of it :)

As for the integrated pgbouncer daemon part, I'm thinking this would allow the
infamous part 3 of the proposal (read-only slave) to get pretty simple to
setup when ready: the slave knows who its master is, and as soon as an XID is
needed the transaction queries are forwarded/proxied to it. Thanks again
Florian !

> It might be useful to document where to find solutions to that problem,
> but we can't take it on as part of core Postgres.

Even the part when it makes sense (provided it does and I'm not completely off
tracks here)?

Regards,
--
dim

Re: Core team statement on replication in PostgreSQL

From
Andreas 'ads' Scherbaum
Date:
On Fri, 30 May 2008 16:22:41 -0400 (EDT) Greg Smith wrote:

> On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote:
> 
> > Then you ship 16 MB binary stuff every 30 second or every minute but
> > you only have some kbyte real data in the logfile.
> 
> Not if you use pg_clearxlogtail ( 
> http://www.2ndquadrant.com/replication.htm ), which got lost in the giant 
> March commitfest queue but should probably wander into contrib as part of 
> 8.4.

Yes, this topic was discussed several times in the past but to
solve this it needs a patch/solution which is integrated into PG
itself, not contrib.


Kind regards

--             Andreas 'ads' Scherbaum
German PostgreSQL User Group


Re: Core team statement on replication in PostgreSQL

From
Simon Riggs
Date:
On Thu, 2008-05-29 at 10:12 -0400, Tom Lane wrote:
> The Postgres core team met at PGCon to discuss a few issues, the largest
> of which is the need for simple, built-in replication for PostgreSQL.
> Historically the project policy has been to avoid putting replication
> into core PostgreSQL, so as to leave room for development of competing
> solutions, recognizing that there is no "one size fits all" replication
> solution.  However, it is becoming clear that this policy is hindering
> acceptance of PostgreSQL to too great an extent, compared to the benefit
> it offers to the add-on replication projects.  Users who might consider
> PostgreSQL are choosing other database systems because our existing
> replication options are too complex to install and use for simple cases.
> In practice, simple asynchronous single-master-multiple-slave
> replication covers a respectable fraction of use cases, so we have
> concluded that we should allow such a feature to be included in the core
> project.  We emphasize that this is not meant to prevent continued
> development of add-on replication projects that cover more complex use
> cases.
> 
> We believe that the most appropriate base technology for this is
> probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
> We hope that such a feature can be completed for 8.4.  Ideally this
> would be coupled with the ability to execute read-only queries on the
> slave servers, but we see technical difficulties that might prevent that
> from being completed before 8.5 or even further out.  (The big problem
> is that long-running slave-side queries might still need tuples that are
> vacuumable on the master, and so replication of vacuuming actions would
> cause the slave's queries to deliver wrong answers.)
> 
> Again, this will not replace Slony, pgPool, Continuent, Londiste, or
> other systems for many users, as it will be not be highly scalable nor
> support long-distance replication nor replicating less than an entire
> installation.  But it is time to include a simple, reliable basic
> replication feature in the core system.

I'm in full support of this and commend the work of the NTT team.

The goals and timescales are realistic and setting a timetable in this
way will help planning for many users,

I'm expecting to lead the charge on the Hot Standby project. The problem
mentioned is just one of the issues, though overall I'm now optimistic
about our eventual success in that area. I'm discussing this now with a
couple of sponsors and would welcome serious financial commitments to
this goal. Please contact me off-list if you agree also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
Andrew,

> Sure there's a price to pay. But that doesn't mean the facility doesn't
> exist. And I rather suspect that most of Josh's customers aren't too
> concerned about traffic charges or affected by such bandwidth
> restrictions. Certainly, none of my clients are, and they aren't in the
> giant class. Shipping a 16Mb file, particularly if compressed, every
> minute or so, is not such a huge problem for a great many commercial
> users, and even many domestic users.

The issue is that when you're talking about telecommunications companies 
(and similar) once a minute isn't adequate.  Those folks want at least 
every second, or even better synchronous.

Anyway, this is a pretty pointless discussion given that we want both 
capabilities, and stuff will get implemented in the order it makes 
technical sense.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Core team statement on replication in PostgreSQL

From
Andreas 'ads' Scherbaum
Date:
On Fri, 30 May 2008 17:05:57 -0400 Andrew Dunstan wrote:
> Andreas 'ads' Scherbaum wrote:
> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
> >
> >> Well, yes, but you do know about archive_timeout, right? No need to wait 
> >> 2 hours.
> >
> > Then you ship 16 MB binary stuff every 30 second or every minute but
> > you only have some kbyte real data in the logfile. This must be taken
> > into account, especially if you ship the logfile over the internet
> > (means: no high-speed connection, maybe even pay-per-traffic) to the
> > slave.
> 
> Sure there's a price to pay. But that doesn't mean the facility doesn't 
> exist. And I rather suspect that most of Josh's customers aren't too 
> concerned about traffic charges or affected by such bandwidth 
> restrictions. Certainly, none of my clients are, and they aren't in the 
> giant class. Shipping a 16Mb file, particularly if compressed, every 
> minute or so, is not such a huge problem for a great many commercial 
> users, and even many domestic users.

The real problem is not the 16 MB, the problem is: you can't compress
this file. If the logfile is rotated it still contains all the
old binary data which is not a good starter for compression.

So you may have some kB changes in the wal logfile every minute but you
still copy 16 MB data. Sure, it's not so much - but if you rotate a
logfile every minute this still transfers 16*60*24 = ~23 GB a day.


Kind regards

--             Andreas 'ads' Scherbaum
German PostgreSQL User Group


Re: Core team statement on replication in PostgreSQL

From
Andrew Dunstan
Date:

Tatsuo Ishii wrote:
>> Andreas 'ads' Scherbaum wrote:
>>     
>>> On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
>>>
>>>   
>>>       
>>>> Well, yes, but you do know about archive_timeout, right? No need to wait 
>>>> 2 hours.
>>>>     
>>>>         
>>> Then you ship 16 MB binary stuff every 30 second or every minute but
>>> you only have some kbyte real data in the logfile. This must be taken
>>> into account, especially if you ship the logfile over the internet
>>> (means: no high-speed connection, maybe even pay-per-traffic) to the
>>> slave.
>>>
>>>
>>>
>>>   
>>>       
>> Sure there's a price to pay. But that doesn't mean the facility doesn't 
>> exist. And I rather suspect that most of Josh's customers aren't too 
>> concerned about traffic charges or affected by such bandwidth 
>> restrictions. Certainly, none of my clients are, and they aren't in the 
>> giant class. Shipping a 16Mb file, particularly if compressed, every 
>> minute or so, is not such a huge problem for a great many commercial 
>> users, and even many domestic users.
>>     
>
> Sumitomo Electric Co., Ltd., a 20 billion dollars selling company in
> Japan (parent company of Sumitomo Electric Information Systems Co.,
> Ltd., which is one of the Recursive SQL development support company)
> uses 100 PostgreSQL servers. They are doing backups by using log
> shipping to another data center and have problems with the amount of
> the transferring log data. They said this is one of the big problems
> they have with PostgreSQL and hope it will be solved in the near
> future.
>
>   

Excellent data point. Now, what I'd like to know is whether they are 
getting into trouble simply because of the volume of log data generated 
or because they have a short archive_timeout set. If it's the former 
(which seems more likely) then none of the ideas I have seen so far in 
this discussion seemed likely to help, and that would indeed be a major 
issue we should look at. Another question is this: are they being 
overwhelmed by the amount of network traffic generated, or by difficulty 
in postgres producers and/or consumers to keep up? If it's network 
traffic, then perhaps compression would help us.

Maybe we need to set some goals for the level of log volumes we expect 
to be able to create/send/comsume.

cheers

andrew


Re: Core team statement on replication in PostgreSQL

From
"Gurjeet Singh"
Date:
On Sat, May 31, 2008 at 3:41 AM, Greg Smith <gsmith@gregsmith.com> wrote:
On Sat, 31 May 2008, Gurjeet Singh wrote:

Not if you use pg_clearxlogtail

This means we need to modify pg_standby to not check for filesize when
reading XLogs.

No, the idea is that you run the segments through pg_clearxlogtail | gzip, which then compresses lightly used segments massively because all the unused bytes are 0.  File comes out the same size at the other side, but you didn't ship a full 16MB if there was only a few KB used.

Got it. I remember reading about pg_clearxlogtail in these mailing lists; but somehow forgot how it actually worked!

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device

Re: Core team statement on replication in PostgreSQL

From
Dimitri Fontaine
Date:
Le vendredi 30 mai 2008, Dimitri Fontaine a écrit :
> This way, no need to switch IP addresses, the clients just connect as usual
> and get results back and do not have to know whether the host they're
> qerying against is a slave or a master. This level of smartness is into
> -core.

Oh, and if you want clients to connect to a single IP and hit either the
master or the slave with some weights to choose one or the other, and a way
to remove from pool on failure etc, I think using haproxy in TCP mode would
do it. HaProxy is really nice for this purpose. http://haproxy.1wt.eu/

Regards,
--
dim

Re: Core team statement on replication in PostgreSQL

From
"Mike Rylander"
Date:
On Fri, May 30, 2008 at 6:47 PM, Andreas 'ads' Scherbaum
<adsmail@wars-nicht.de> wrote:
> On Fri, 30 May 2008 17:05:57 -0400 Andrew Dunstan wrote:
>> Andreas 'ads' Scherbaum wrote:
>> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
>> >
>> >> Well, yes, but you do know about archive_timeout, right? No need to wait
>> >> 2 hours.
>> >
>> > Then you ship 16 MB binary stuff every 30 second or every minute but
>> > you only have some kbyte real data in the logfile. This must be taken
>> > into account, especially if you ship the logfile over the internet
>> > (means: no high-speed connection, maybe even pay-per-traffic) to the
>> > slave.
>>
>> Sure there's a price to pay. But that doesn't mean the facility doesn't
>> exist. And I rather suspect that most of Josh's customers aren't too
>> concerned about traffic charges or affected by such bandwidth
>> restrictions. Certainly, none of my clients are, and they aren't in the
>> giant class. Shipping a 16Mb file, particularly if compressed, every
>> minute or so, is not such a huge problem for a great many commercial
>> users, and even many domestic users.
>
> The real problem is not the 16 MB, the problem is: you can't compress
> this file. If the logfile is rotated it still contains all the
> old binary data which is not a good starter for compression.

Using bzip2 in my archive_command script, my WAL files are normally
compressed to between 2MB and 5MB, depending on the write load
(larger, and more of them, in the middle of the day).  bzip2
compression is more expensive and rotated WAL files are not
particularly compressable to be sure, but due to (and given) the
nature of the data bzip2 works pretty well, and much better than gzip.

>
> So you may have some kB changes in the wal logfile every minute but you
> still copy 16 MB data. Sure, it's not so much - but if you rotate a
> logfile every minute this still transfers 16*60*24 = ~23 GB a day.
>

I archived 1965 logs yesterday on one instance of my app totalling
8.5GB ... not to bad, really.

-- 
Mike Rylander| VP, Research and Design| Equinox Software, Inc. / The Evergreen Experts| phone: 1-877-OPEN-ILS
(673-6457)|email: miker@esilibrary.com| web: http://www.esilibrary.com
 


Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Sat, May 31, 2008 at 2:18 AM, Mike Rylander <mrylander@gmail.com> wrote:
> On Fri, May 30, 2008 at 6:47 PM, Andreas 'ads' Scherbaum
> <adsmail@wars-nicht.de> wrote:
>> On Fri, 30 May 2008 17:05:57 -0400 Andrew Dunstan wrote:
>>> Andreas 'ads' Scherbaum wrote:
>>> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
>>> >
>>> >> Well, yes, but you do know about archive_timeout, right? No need to wait
>>> >> 2 hours.
>>> >
>>> > Then you ship 16 MB binary stuff every 30 second or every minute but
>>> > you only have some kbyte real data in the logfile. This must be taken
>>> > into account, especially if you ship the logfile over the internet
>>> > (means: no high-speed connection, maybe even pay-per-traffic) to the
>>> > slave.
>>>
>>> Sure there's a price to pay. But that doesn't mean the facility doesn't
>>> exist. And I rather suspect that most of Josh's customers aren't too
>>> concerned about traffic charges or affected by such bandwidth
>>> restrictions. Certainly, none of my clients are, and they aren't in the
>>> giant class. Shipping a 16Mb file, particularly if compressed, every
>>> minute or so, is not such a huge problem for a great many commercial
>>> users, and even many domestic users.
>>
>> The real problem is not the 16 MB, the problem is: you can't compress
>> this file. If the logfile is rotated it still contains all the
>> old binary data which is not a good starter for compression.
>
> Using bzip2 in my archive_command script, my WAL files are normally
> compressed to between 2MB and 5MB, depending on the write load
> (larger, and more of them, in the middle of the day).  bzip2
> compression is more expensive and rotated WAL files are not
> particularly compressable to be sure, but due to (and given) the
> nature of the data bzip2 works pretty well, and much better than gzip.

Compression especially is going to negate one of the big advantages of
wal shipping, namely that it is cheap investment in terms of load to
the main.  A gigabit link can ship a lot of log files, you can always
bond and 10gige is coming.  IMO the key trick is to make sure you
don't send the log file more than once from the same source...i.e
cascading relay.

merlin


Re: Core team statement on replication in PostgreSQL

From
"Joshua D. Drake"
Date:
Merlin Moncure wrote:
> On Sat, May 31, 2008 at 2:18 AM, Mike Rylander <mrylander@gmail.com> wrote:
>> On Fri, May 30, 2008 at 6:47 PM, Andreas 'ads' Scherbaum

> Compression especially is going to negate one of the big advantages of
> wal shipping, namely that it is cheap investment in terms of load to
> the main.  A gigabit link can ship a lot of log files, you can always

Who has a gigabit link between Dallas and Atlanta? That is the actual 
problem here. Switch to Switch compression is a waste of time (if you 
aren't running GiGE, what are you doing???).

Sincerely,

Joshua D. Drake


Re: Core team statement on replication in PostgreSQL

From
Robert Hodges
Date:
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi Merlin, <br /><br /> My point here is that
withreasonably small extensions to the core you can build products that are a lot better than SLONY.   Triggers do not
coverDDL, among other issues, and it’s debatable whether they are the best way to implement quorum policies like
Google’ssemi-synchronous replication.  As I mentioned separately this topic deserves another thread which I promise to
start. <br /><br /> It is of course possible to meet some of these needs with an appropriate client interface to WAL
shipping. There’s no a-priori reason why built-in PostgreSQL slaves need to be the only client.  I would put a vote in
forcovering this possibility in the initial replication design.  We are using a very similar approach in our own
master/slavereplication product.  <br /><br /> Thanks, Robert<br /><br /> P.S., No offense intended to Jan Wieck et al.
 Thereare some pretty cool things in SLONY.  <br /><br /> On 5/29/08 8:16 PM, "Merlin Moncure"
<mmoncure@gmail.com>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span
style="font-size:12.0px">OnThu, May 29, 2008 at 3:05 PM, Robert Hodges<br /> <robert.hodges@continuent.com>
wrote:<br/> > Third, you can't stop with just this feature.  (This is the BUT part of the<br /> > post.)  The use
casesnot covered by this feature area actually pretty<br /> > large.  Here are a few that concern me:<br /> ><br
/>> 1.) Partial replication.<br /> > 2.) WAN replication.<br /> > 3.) Bi-directional replication.  (Yes, this
isevil but there are problems<br /> > where it is indispensable.)<br /> > 4.) Upgrade support.  Aside from
databaseupgrade (how would this ever<br /> > really work between versions?), it would not support zero-downtime
app<br/> > upgrades, which depend on bi-directional replication tricks.<br /> > 5.) Heterogeneous replication.<br
/>> 6.) Finally, performance scaling using scale-out over large numbers of<br /> > replicas.  I think it's
possibleto get tunnel vision on this—it's not a big<br /> > requirement in the PG community because people don't use
PGin the first<br /> > place when they want to do this.  They use MySQL, which has very good<br /> > replication
forperformance scaling, though it's rather weak for<br /> > availability.<br /><br /> These type of things are what
Slonyis for.  Slony is trigger based.<br /> This makes it more complex than log shipping style replication, but<br />
provideslots of functionality.<br /><br /> wal shipping based replication is maybe the fastest possible<br />
solution...youare already paying the overhead so it comes virtually<br /> for free from the point of view of the
master.<br/><br /> mysql replication is imo nearly worthless from backup standpoint.<br /><br /> merlin<br /><br
/></span></font></blockquote><fontface="Verdana, Helvetica, Arial"><span style="font-size:12.0px"><br /><br /> -- <br
/>Robert Hodges, CTO, Continuent, Inc.<br /> Email:  robert.hodges@continuent.com<br /> Mobile:  +1-510-501-3728
 Skype: hodgesrm<br /></span></font> 

Re: Core team statement on replication in PostgreSQL

From
"Merlin Moncure"
Date:
On Sun, Jun 1, 2008 at 11:58 AM, Robert Hodges
<robert.hodges@continuent.com> wrote:
> Hi Merlin,
>
> My point here is that with reasonably small extensions to the core you can
> build products that are a lot better than SLONY.   Triggers do not cover
> DDL, among other issues, and it's debatable whether they are the best way to
> implement quorum policies like Google's semi-synchronous replication.  As I
> mentioned separately this topic deserves another thread which I promise to
> start.

These issues are much discussed and well understood.  At this point,
the outstanding points of discussion are technical...how to make this
thing work.

merlin


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> On Sun, Jun 1, 2008 at 11:58 AM, Robert Hodges
> <robert.hodges@continuent.com> wrote:
>> My point here is that with reasonably small extensions to the core you can
>> build products that are a lot better than SLONY.

> These issues are much discussed and well understood.

Well, what we know is that previous attempts to define replication hooks
to be added to the core have died for lack of interest.  Maybe Robert
can start a new discussion that will actually get somewhere; if so, more
power to him.  (Is the replica-hooks-discuss list still working?)  But
that is entirely orthogonal to what is proposed in this thread, which
is to upgrade the existing PITR support into a reasonably useful
replication feature.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Dawid Kuroczko"
Date:
On Thu, May 29, 2008 at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The Postgres core team met at PGCon to discuss a few issues, the largest
> of which is the need for simple, built-in replication for PostgreSQL.
[...]
> We believe that the most appropriate base technology for this is
1> probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
> We hope that such a feature can be completed for 8.4.  Ideally this
> would be coupled with the ability to execute read-only queries on the
> slave servers, but we see technical difficulties that might prevent that
> from being completed before 8.5 or even further out.  (The big problem
> is that long-running slave-side queries might still need tuples that are
> vacuumable on the master, and so replication of vacuuming actions would
> cause the slave's queries to deliver wrong answers.)
>
> Again, this will not replace Slony, pgPool, Continuent, Londiste, or
> other systems for many users, as it will be not be highly scalable nor
> support long-distance replication nor replicating less than an entire
> installation.  But it is time to include a simple, reliable basic
> replication feature in the core system.

Hello!

I thought I would share a few thoughts of my own about the issue.
I have a hands-on experience with Oracle and MySQL apart from
PostgreSQL so I hope it will be a bit interesting.

The former has a feature called "physical standby", which looks
quite like our WAL-shipping based replication.  Simply archived
logs are replayed on the standby database.  A primary database
and standby database are connected, and can stream the logs
directly.  They either copy the log when its finished (as we do now)
or can do it in coninuous manner (as I hope we will be able to).

It is possible to have a synchronous replication (where "COMMIT"
on primary database succeeds when the data is safely stored on
the standby database).  I think such a feature would be a great
advantage for PostgreSQL (where you cannot afford to loose
any transactions).

Their standby database is not accessible.  It can be opened read-only,
but during that time replication stops.  So PostgreSQL having
read-only and still replicating standby database would be great.

The other method is "logical standby" which works by dissecting
WAL-logs and recreating DDLs/DMLs from it.  Never seen anyone
use it. ;-)

Then we have a mysql replication -- done by replaying actual DDLs/DMLs
on the slaves.  This approach has issues, most notably when slaves are
highly loaded and lag behind the master -- so you end up with infrastructure
to monitor lags and turn off slaves which lag too much.  Also it is painful
to setup -- you have to stop, copy, configure and run.

* Back to PostgreSQL world

As for PostgreSQL solutions we have a slony-I, which is great as long as
you don't have too many people managing the database and/or your
schema doesn't change too frequently.  Perhaps it would be maintainable
more easily if there would be to get DDLs (as DDL triggers or similar).
Its main advantages for me is ability to prepare complex setups and
easily add new slaves).  The pgpool solution is quite nice but then
again adding a new slave is not so easy.  And being a filtering
layer between client and server it feels a bit "fragile" (I know it is not,
but then again it is harder to convince someone that "yes it will work
100% right all the time").

* How I would like PostgreSQL WAL-replication to evolve:

First of all it would be great if a slave/standby would contact the master
and maintain the state with it (tell it its xmin, request a log to stream,
go online-streaming).  Especially I hope that it should be possible
to make a switchover (where the two databases exchange roles),
and in this the direct connection between the two should help.

In detail, I think it should go like this:
* A slave database starts up, checks that it works as a "replica"
(hopefully it would not be a postgresql.conf constant, but rather
some file maintained by the database).
* It would connect to the master database, tell where in the WAL
it is now, and request a log N.
* If log N is not available, request a log from external supplied
script (so that it could be fetched from log archive repository
somewhere, recovered from a backup tape, etc).
* Continue asking, until we get to the logs which are available
at master database.
* Continue replaying until we get within max_allowed_replication_lag
time, and open our slave for read-only queries.
* If we start lagging too much perhaps close the read-only access
to the database (perhaps configurable?).

I think that replication should be easy to set up.  I think our
archive_command is quite easy, but many a person come
with a lot of misconceptions how it works (and it takes time
to explain them how it actually work, especially what is
archive_command for, and that pg_start_backup() doesn't
actually _do_ backup, but just tells PostgreSQL that
backup is being done).

Easy to setup and easy to switchover (change the direction
of replication) and failover.

Also, it would be interesting to be able to have a shared-storage
solution where a single database would modify data and multiple
read-only slaves would access the data.  The WAL-shipping would
be used then to "push" modified pages into read-only slaves' memory.
And each checkpoint would allow read-only slaves to release these
pages from memory.  Kind of replication without actually doing any
writes.

* high available  libpq?

Other thing worth to consider is perhaps adding a HA-capability
to libpq.  Currently I use a setup where I have machine with
database and pg_bouncer (for connection pooling at database
level) and on client machines I have a pgpool on localhost
(for transparent failover).  I think some level of replication
awareness could be built-in into libpq.  It would allow to make
simpler, but still HA, setups.

Perhaps ability to specify a standby in connect string, something like:
 "host=foo1.bar;db=pgdb;failover=yes;stbyhost=foo2.bar"

...a failover="yes" would allow libpq to try connecting to stbyhost
if host is not available AND succeed if stbyhost announces itself
as a new master.

This would make switchover easier (clients would be told, that
we are closing, and that they should connect to a new master).

Also some way of supplying multiple hosts, where client should
pick a host at random, would be quite useful, say:
 "host=foo1.bar,foo2.bar,foo3.bar;db=pgdb;loadbalance=random"

Why in libpq and not in [language/framework/application]?  Well
it would help with not reinventing the wheel.  And if we want
a painless switchover, client should be aware that there is
a standby ready.  Also, being optional it wouldn't change anything
for those who don't want it.
  Regards,     Dawid

PS: As for pgpool/pg_bouncer -- I wish there would be a way for
them to tell the database that they are proxying for someone
else (so that I could check in pg_stat_activity who is on the far
end of the database connection, not "localhost" ;)).
-- 
Solving [site load issues] with [more database replication] is a lot
like solving your own personal problems with heroin - at first it
sorta works, but after a while things just get out of hand.


Re: Core team statement on replication in PostgreSQL

From
James Mansion
Date:
David Fetter wrote:
> This part is a deal-killer.  It's a giant up-hill slog to sell warm
> standby to those in charge of making resources available because the
> warm standby machine consumes SA time, bandwidth, power, rack space,
> etc., but provides no tangible benefit, and this feature would have
> exactly the same problem.
>
> IMHO, without the ability to do read-only queries on slaves, it's not
> worth doing this feature at all.
>   
That's not something that squares with my experience *at all*, which 
admitedly is entirely in
investment banks. Business continuity is king, and in some places the 
warm standby rep
from the database vendor is trusted more than block-level rep from the 
SAN vendor
(though that may be changing to some extent in favour of the SAN).

James



Re: Core team statement on replication in PostgreSQL

From
James Mansion
Date:
Aidan Van Dyk wrote:
> The whole single-threaded WAL replay problem is going to rear it's ugly
> head here too, and mean that a slave *won't* be able to keep up with a
> busy master if it's actually trying to apply all the changes in
> real-time.
Is there a reason to commit at the same points that the master 
committed?  Wouldn't relaxing
that mean that at least you would get 'big' commits and some economy of 
scale?  It might
not be too bad.  All I can say is that Sybase warm standby is useful, 
even though the rep
for an update that changes a hundred rows is a hundred updates keyed on 
primary key,
which is pretty sucky in terms of T-SQL performance.



Re: Core team statement on replication in PostgreSQL

From
Hannu Krosing
Date:
On Fri, 2008-05-30 at 15:16 -0400, Robert Treat wrote:
> On Friday 30 May 2008 01:10:20 Tom Lane wrote:
> > Greg Smith <gsmith@gregsmith.com> writes:
> > > I fully accept that it may be the case that it doesn't make technical
> > > sense to tackle them in any order besides sync->read-only slaves because
> > > of dependencies in the implementation between the two.
> >
> > Well, it's certainly not been my intention to suggest that no one should
> > start work on read-only-slaves before we finish the other part.  The
> > point is that I expect the log shipping issues will be done first
> > because they're easier, and it would be pointless to not release that
> > feature if we had it.
> >
> > But since you mention it: one of the plausible answers for fixing the
> > vacuum problem for read-only slaves is to have the slaves push an xmin
> > back upstream to the master to prevent premature vacuuming.  The current
> > design of pg_standby is utterly incapable of handling that requirement.
> > So there might be an implementation dependency there, depending on how
> > we want to solve that problem.
> >
> 
> Sure, but whose to say that after synchronous wal shipping is "finished" it 
> wont need a serious re-write due to new needs from the hot standby feature. I 
> think going either way carries some risk. 

The simplest form of synchronous wal shipping would not even need
postgresql running on slave, just a small daemon which reports when wal
blocks are a) received and b) synced to disk. 

This setup would just guarantee "no data loss on single machine
failure". form there on you could add various features, including
support for both switchover and failover, async replication to multiple
slaves, etc.

the only thing that needs anything additional from slave wal-receiving
daemon is when you want the kind of wal-sync which would guarantee that
read-only query on slave issued after commit returns from master sees
latest data. for this kinds of guarantees you need at least feedback
about wal-replay, but possibly also shared transaction numbers and
shared snapshots, to be sure that OLTP type queries see the latest and
OLAP queries are not denied seeing VACUUMED on master.

--------------
Hannu




Re: Core team statement on replication in PostgreSQL

From
Hannu Krosing
Date:
On Thu, 2008-05-29 at 13:37 -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote:
> >> The only question I have is... what does this give us that PITR
> >> doesn't give us?
> 
> > It looks like a wrapper for PITR to me, so the gain would be ease of
> > use.
> 
> A couple of points about that:
> 
> * Yeah, ease of use is a huge concern here.  We're getting beat up
> because people have to go find a separate package (and figure out
> which one they want), install it, learn how to use it, etc.  It doesn't
> help that the most mature package is Slony which is, um, not very
> novice-friendly or low-admin-complexity.  I personally got religion
> on this about two months ago when Red Hat switched their bugzilla
> from Postgres to MySQL because the admins didn't want to deal with Slony
> any more.  People want simple.
> 
> * The proposed approach is trying to get to "real" replication
> incrementally.  Getting rid of the loss window involved in file-by-file
> log shipping is step one, 

Actually we can already do better than file-by-file by using
pg_xlogfile_name_offset() which was added sometime in 2006. SkyTools for
example does this to get no more than a few seconds failure window.

Doing this synchronously would be of course better.

probably we should use the same "modes/protocols" as DRBD when
determining when a "sync" wal write is "done"

quote from 
http://www.slackworks.com/~dkrovich/DRBD/usingdrbdsetup.html#AEN76


Table 1. DRBD Protocols

Protocol           Description                A
A write operation is complete as
soon as the data is written to disk
and sent to the network.                B
A write operation is complete as
soon as a reception acknowledgement
arrives.                C
A write operation is complete as
soon as a write acknowledgement
arrives.

There are also additional paramaters you can pass to the disk and net
options. See the drbdsetup man page for additional information

/end quote

> and I suspect that step two is going to be
> fixing performance issues in WAL replay to ensure that slaves can keep
> up.  After that we'd start thinking about how to let slaves run
> read-only queries.  But even without read-only queries, this will be
> a useful improvement for HA/backup scenarios.
> 
>             regards, tom lane
> 



Re: Core team statement on replication in PostgreSQL

From
Hannu Krosing
Date:
On Thu, 2008-05-29 at 12:05 -0700, Robert Hodges wrote:
> Hi everyone, 
> 
> First of all, I’m absolutely delighted that the PG community is
> thinking seriously about replication.  
> 
> Second, having a solid, easy-to-use database availability solution
> that works more or less out of the box would be an enormous benefit to
> customers.  Availability is the single biggest problem for customers
> in my experience and as other people have commented the alternatives
> are not nice.  It’s an excellent idea to build off an existing feature
> —PITR is already pretty useful and the proposed features are solid
> next steps.  The fact that it does not solve all problems is not a
> drawback but means it’s likely to get done in a reasonable timeframe. 
> 
> Third, you can’t stop with just this feature.  (This is the BUT part
> of the post.)  The use cases not covered by this feature area actually
> pretty large.  Here are a few that concern me: 
> 
> 1.) Partial replication. 
> 2.) WAN replication. 

1.) & 2.) are better done asunc, the domain of Slony-I/Londiste

> 3.) Bi-directional replication.  (Yes, this is evil but there are
> problems where it is indispensable.) 

Sure, it is also a lot harder and always has several dimensions
(performanse/availability7locking) which play against each other

> 4.) Upgrade support.  Aside from database upgrade (how would this ever
> really work between versions?), it would not support zero-downtime app
> upgrades, which depend on bi-directional replication tricks. 

Or you could use zero-downtime  app upgrades, which don't depend on
this :P

> 5.) Heterogeneous replication. 
> 6.) Finally, performance scaling using scale-out over large numbers of
> replicas.  I think it’s possible to get tunnel vision on this—it’s not
> a big requirement in the PG community because people don’t use PG in
> the first place when they want to do this.  They use MySQL, which has
> very good replication for performance scaling, though it’s rather weak
> for availability.  

Again, doing scale-out over large number of replicas should either be
async or for sync use some broadcast channel to all slaves (and still be
a performance problem on master, as it has to wait for slowest slave).

> As a consequence, I don’t see how you can get around doing some sort
> of row-based replication like all the other databases. 

Is'nt WAL-base replication "some sort of row-based replication" ?

>  Now that people are starting to get religion on this issue I would
> strongly advocate a parallel effort to put in a change-set extraction
> API that would allow construction of comprehensive master/slave
> replication. 

Triggers. see pgQ's logtrigga()/logutrigga(). See slides for Marko
Kreen's presentation at pgCon08.

>  (Another approach would be to make it possible for third party apps
> to read the logs and regenerate SQL.) 

which logs ? WAL or SQL command logs ?

> There are existing models for how to do change set extraction; we have
> done it several times at my company already.  There are also research
> projects like GORDA that have looked fairly comprehensively at this
> problem.

pgQ with its triggers does a pretty good job of change-set extraction.

------------------
Hannu




Re: Core team statement on replication in PostgreSQL

From
Robert Hodges
Date:
Hi Hannu,

Hi Hannu,


On 6/1/08 2:14 PM, "Hannu Krosing" <hannu@krosing.net> wrote:

>
>> As a consequence, I don¹t see how you can get around doing some sort
>> of row-based replication like all the other databases.
>
> Is'nt WAL-base replication "some sort of row-based replication" ?
>
Yes, in theory.  However, there's a big difference between replicating
physical WAL records and doing logical replication with SQL statements.
Logical replication requires extra information to reconstruct primary keys.
(Somebody tell me if this is already in the WAL; I'm learning the code as
fast as possible but assuming for now it's not.)

>
>>  Now that people are starting to get religion on this issue I would
>> strongly advocate a parallel effort to put in a change-set extraction
>> API that would allow construction of comprehensive master/slave
>> replication.
>
> Triggers. see pgQ's logtrigga()/logutrigga(). See slides for Marko
> Kreen's presentation at pgCon08.
>
>
Thanks very much for the pointer.  The slides look interesting.

Robert



Re: Core team statement on replication in PostgreSQL

From
Chris Browne
Date:
adsmail@wars-nicht.de ("Andreas 'ads' Scherbaum") writes:
> On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
>
>> Well, yes, but you do know about archive_timeout, right? No need to wait 
>> 2 hours.
>
> Then you ship 16 MB binary stuff every 30 second or every minute but
> you only have some kbyte real data in the logfile. This must be taken
> into account, especially if you ship the logfile over the internet
> (means: no high-speed connection, maybe even pay-per-traffic) to the
> slave.

If you have that kind of scenario, then you have painted yourself into
a corner, and there isn't anything that can be done to extract you
from it.

Consider: If you have so much update traffic that it is too much to
replicate via WAL-copying, why should we expect that other mechanisms
*wouldn't* also overflow the connection?

If you haven't got enough network bandwidth to use this feature, then
nobody is requiring that you use it.  It seems like a perfectly
reasonable prerequisite to say "this requires that you have enough
bandwidth."
-- 
(reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/
"There's nothing worse than having only one drunk head."
-- Zaphod Beeblebrox


Re: Core team statement on replication in PostgreSQL

From
Hannu Krosing
Date:
On Thu, 2008-05-29 at 23:37 +0200, Mathias Brossard wrote:

> I pointed out that the NTT solution is synchronous because Tom said in 
> the first part of his email that:
> 
>  > In practice, simple asynchronous single-master-multiple-slave
>  > replication covers a respectable fraction of use cases, so we have
>  > concluded that we should allow such a feature to be included in the
>  > core project.
> 
> ... and yet "the most appropriate base technology for this" is 
> synchronous and maybe I should have also pointed out in my previous mail 
> is that it doesn't support multiple slaves.

I don't think that you need too many slaves in sync mode.

Probably 1-st slave sync and others async from there on will be good
enough.

> Also, as other have pointed out there are different interpretations of 
> "synchronous" depending on wether the WAL data has reached the other end 
> of the network connection, a safe disk checkpoint or the slave DB itself.

Probably all DRBD-s levels ( A) data sent to network, B) data received,
C) data written to disk) should be supported + C1) data replayed in
slave DB. C1 meaning that it can be done in parallel with C)

Then each DBA can set it up depending on what he trusts - network,
slave's power supply or slaves' disk.

Also, the case of slave failure should be addressed. I don't think that
the best solution is halting all ops on master if slave/network fails.

Maybe we should allow also a setup with 2-3 slaves, where operations can
continue when at least 1 slave is "syncing" ?

--------------
Hannu



Re: Core team statement on replication in PostgreSQL

From
Andreas 'ads' Scherbaum
Date:
On Mon, 02 Jun 2008 11:52:05 -0400 Chris Browne wrote:

> adsmail@wars-nicht.de ("Andreas 'ads' Scherbaum") writes:
> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
> >
> >> Well, yes, but you do know about archive_timeout, right? No need to wait 
> >> 2 hours.
> >
> > Then you ship 16 MB binary stuff every 30 second or every minute but
> > you only have some kbyte real data in the logfile. This must be taken
> > into account, especially if you ship the logfile over the internet
> > (means: no high-speed connection, maybe even pay-per-traffic) to the
> > slave.
> 
> If you have that kind of scenario, then you have painted yourself into
> a corner, and there isn't anything that can be done to extract you
> from it.

You are misunderstanding something. It's perfectly possible that you
have a low-traffic database with changes every now and then. But you
have to copy a full 16 MB logfile every 30 seconds or every minute just
to have the slave up-to-date.


> Consider: If you have so much update traffic that it is too much to
> replicate via WAL-copying, why should we expect that other mechanisms
> *wouldn't* also overflow the connection?

For some MB real data you copy several GB logfiles per day - that's a
lot overhead, isn't it?


> If you haven't got enough network bandwidth to use this feature, then
> nobody is requiring that you use it.  It seems like a perfectly
> reasonable prerequisite to say "this requires that you have enough
> bandwidth."

If you have a high-traffic database, then of course you need an other
connection as if you only have a low-traffic or a mostly read-only
database. But that's not the point. Copying an almost unused 16 MB WAL
logfile is just overhead - especially because the logfile is not
compressable very much because of all the leftovers from earlier use.


Kind regards

--             Andreas 'ads' Scherbaum
German PostgreSQL User Group


Re: Core team statement on replication in PostgreSQL

From
Hannu Krosing
Date:
On Mon, 2008-06-02 at 22:40 +0200, Andreas 'ads' Scherbaum wrote:
> On Mon, 02 Jun 2008 11:52:05 -0400 Chris Browne wrote:
> 
> > adsmail@wars-nicht.de ("Andreas 'ads' Scherbaum") writes:
> > > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote:
> > >
> > >> Well, yes, but you do know about archive_timeout, right? No need to wait 
> > >> 2 hours.
> > >
> > > Then you ship 16 MB binary stuff every 30 second or every minute but
> > > you only have some kbyte real data in the logfile. This must be taken
> > > into account, especially if you ship the logfile over the internet
> > > (means: no high-speed connection, maybe even pay-per-traffic) to the
> > > slave.
> > 
> > If you have that kind of scenario, then you have painted yourself into
> > a corner, and there isn't anything that can be done to extract you
> > from it.
> 
> You are misunderstanding something. It's perfectly possible that you
> have a low-traffic database with changes every now and then. But you
> have to copy a full 16 MB logfile every 30 seconds or every minute just
> to have the slave up-to-date.

To repeat my other post in this thread:

Actually we can already do better than file-by-file by using
pg_xlogfile_name_offset() which was added sometime in 2006. walmgr.py
from SkyTools package for example does this to get no more than a few
seconds failure window and it copies just the changed part of WAL to
slave.

pg_xlogfile_name_offset() was added just for this purpose - to enable
WAL shipping scripts to query, where inside the logfile current write
pointer is.

It is not synchronous, but it can be made very close, within subsecond
if you poll it frequently enough.

-------------------
Hannu




Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Sun, Jun 01, 2008 at 01:43:22PM -0400, Tom Lane wrote:
> power to him.  (Is the replica-hooks-discuss list still working?)  But

Yes.  And silent as ever. :-)

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
"Stephen Denne"
Date:
Hannu Krosing wrote:
> The simplest form of synchronous wal shipping would not even need
> postgresql running on slave, just a small daemon which
> reports when wal
> blocks are a) received and b) synced to disk.

While that does sound simple, I'd presume that most people would want the guarantee of the same version of postgresql
installedwherever the logs are ending up, with the log receiver speaking the same protocol version as the log sender. I
imaginethat would be most easily achieved through using something like the continuously restoring startup mode of
currentpostgresql. 

However variations on this kind of daemon can be used to perform testing, configuring it to work well, go slow, pause,
notrespond, disconnect, or fail in particular ways, emulating disk full, etc. 

Regards,
Stephen Denne.
--
At the Datamail Group we value teamwork, respect, achievement, client focus, and courage.
This email with any attachments is confidential and may be subject to legal privilege.
If it is not intended for you please advise by replying immediately, destroy it and do not
copy, disclose or use it in any way.

The Datamail Group, through our GoGreen programme, is committed to environmental sustainability.
Help us in our efforts by not printing this email.
__________________________________________________________________ This email has been scanned by the DMZGlobal
BusinessQuality             Electronic Messaging Suite. 
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________




Re: Core team statement on replication in PostgreSQL

From
Markus Schiltknecht
Date:
Hello Andrew,

Andrew Sullivan wrote:
> Yes.  And silent as ever. :-)

Are the slides of your PgCon talk available for download somewhere?

BTW: up until recently, there was yet another mailing list: 
pgreplication-general@gborg.postgresql.org. It was less focused on hooks 
and got at least some traffic. :-) Are those mails still archived somewhere?

Regards

Markus





Re: Core team statement on replication in PostgreSQL

From
"Heikki Linnakangas"
Date:
Stephen Denne wrote:
> Hannu Krosing wrote:
>> The simplest form of synchronous wal shipping would not even need
>> postgresql running on slave, just a small daemon which 
>> reports when wal
>> blocks are a) received and b) synced to disk. 
> 
> While that does sound simple, I'd presume that most people would want the guarantee of the same version of postgresql
installedwherever the logs are ending up, with the log receiver speaking the same protocol version as the log sender. I
imaginethat would be most easily achieved through using something like the continuously restoring startup mode of
currentpostgresql.
 

Hmm, WAL version compatibility is an interesting question. Most minor 
releases hasn't changed the WAL format, and it would be nice to allow 
running different minor versions in the master and slave in those cases. 
But it's certainly not unheard of to change the WAL format. Perhaps we 
should introduce a WAL version number, similar to catalog version?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Core team statement on replication in PostgreSQL

From
Teodor Sigaev
Date:
> 
> Hmm, WAL version compatibility is an interesting question. Most minor 
> releases hasn't changed the WAL format, and it would be nice to allow 
As I remember, high minor version should read all WALs from lowers, but it isn't 
true for opposite case and between different major versions.

> running different minor versions in the master and slave in those cases. 
> But it's certainly not unheard of to change the WAL format. Perhaps we 
> should introduce a WAL version number, similar to catalog version?

Agree. Right now it only touches warm-stand-by servers, but introducing simple 
log-shipping and based on it replication will cause a lot of unobvious 
errors/bugs. Is it possible to use catalog version number as WAL version?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Core team statement on replication in PostgreSQL

From
Csaba Nagy
Date:
On Wed, 2008-06-04 at 11:13 +0300, Heikki Linnakangas wrote:
> Hmm, WAL version compatibility is an interesting question. Most minor 
> releases hasn't changed the WAL format, and it would be nice to allow 
> running different minor versions in the master and slave in those cases. 
> But it's certainly not unheard of to change the WAL format. Perhaps we 
> should introduce a WAL version number, similar to catalog version?

Would that also cover possible differences in page size, 32bit OS vs.
64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these
things can have an influence on how the data is written and possibly
make the WAL incompatible with other postgres instances, even if the
exact same version...

Cheers,
Csaba.




Re: Core team statement on replication in PostgreSQL

From
"Heikki Linnakangas"
Date:
Csaba Nagy wrote:
> On Wed, 2008-06-04 at 11:13 +0300, Heikki Linnakangas wrote:
>> Hmm, WAL version compatibility is an interesting question. Most minor 
>> releases hasn't changed the WAL format, and it would be nice to allow 
>> running different minor versions in the master and slave in those cases. 
>> But it's certainly not unheard of to change the WAL format. Perhaps we 
>> should introduce a WAL version number, similar to catalog version?
> 
> Would that also cover possible differences in page size, 32bit OS vs.
> 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these
> things can have an influence on how the data is written and possibly
> make the WAL incompatible with other postgres instances, even if the
> exact same version...

These are already covered by the information in pg_control.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Core team statement on replication in PostgreSQL

From
"Heikki Linnakangas"
Date:
Teodor Sigaev wrote:
> Is it possible to use catalog version number as WAL version?

No, because we don't change the catalog version number in minor 
releases, even though we might change WAL format.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Core team statement on replication in PostgreSQL

From
Andrew Sullivan
Date:
On Wed, Jun 04, 2008 at 09:24:20AM +0200, Markus Schiltknecht wrote:
>
> Are the slides of your PgCon talk available for download somewhere?

There weren't any slides, really (there were 4 that I put up in case
the cases I was discussing needed back-references, but they didn't).
Joshua tells me that I'm supposed to make the paper readable and put
it up on Command Prompt's website, so I will soon.

> BTW: up until recently, there was yet another mailing list: 
> pgreplication-general@gborg.postgresql.org. It was less focused on hooks 
> and got at least some traffic. :-) Are those mails still archived 
> somewhere?

Unless whoever was operating that list moved it to pgfoundry, I doubt
it (except on backups somewhere).

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Hmm, WAL version compatibility is an interesting question. Most minor 
> releases hasn't changed the WAL format, and it would be nice to allow 
> running different minor versions in the master and slave in those cases. 
> But it's certainly not unheard of to change the WAL format. Perhaps we 
> should introduce a WAL version number, similar to catalog version?

Yeah, perhaps.  In the past we've changed the WAL page ID field for
this; I'm not sure if that's enough or not.  It does seem like a good
idea to have a way to check that the slaves aren't trying to read a
WAL version they don't understand.  Also, it's possible that the WAL
format doesn't change across a major update, but you still couldn't
work with say an 8.4 master and an 8.3 slave, so maybe we need the
catalog version ID in there too.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Hannu Krosing
Date:
On Wed, 2008-06-04 at 10:40 -0400, Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > Hmm, WAL version compatibility is an interesting question. Most minor 
> > releases hasn't changed the WAL format, and it would be nice to allow 
> > running different minor versions in the master and slave in those cases. 
> > But it's certainly not unheard of to change the WAL format. Perhaps we 
> > should introduce a WAL version number, similar to catalog version?
> 
> Yeah, perhaps.  In the past we've changed the WAL page ID field for
> this; I'm not sure if that's enough or not.  It does seem like a good
> idea to have a way to check that the slaves aren't trying to read a
> WAL version they don't understand.  Also, it's possible that the WAL
> format doesn't change across a major update, but you still couldn't
> work with say an 8.4 master and an 8.3 slave, so maybe we need the
> catalog version ID in there too.

And something dependent on datetime being integer.

We probably won't need to encode presence of user defined types, like
PostGis , being present ?

-----
Hannu




Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Hannu Krosing <hannu@krosing.net> writes:
> On Wed, 2008-06-04 at 10:40 -0400, Tom Lane wrote:
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Hmm, WAL version compatibility is an interesting question. Most minor 
> releases hasn't changed the WAL format, and it would be nice to allow 
> running different minor versions in the master and slave in those cases. 
> But it's certainly not unheard of to change the WAL format. Perhaps we 
> should introduce a WAL version number, similar to catalog version?
>> 
>> Yeah, perhaps.  In the past we've changed the WAL page ID field for
>> this; I'm not sure if that's enough or not.  It does seem like a good
>> idea to have a way to check that the slaves aren't trying to read a
>> WAL version they don't understand.  Also, it's possible that the WAL
>> format doesn't change across a major update, but you still couldn't
>> work with say an 8.4 master and an 8.3 slave, so maybe we need the
>> catalog version ID in there too.

> And something dependent on datetime being integer.

This thread is getting out of hand, actually.

Heikki's earlier comment about pg_control reminded me that we already
have a unique system identifier stored in pg_control and check that
against WAL headers.  So I think we already have enough certainty that
the master and slaves have the same pg_control and hence are the same
for everything checked by pg_control.

However, since by definition pg_control doesn't change in a minor
upgrade, there isn't any easy way to enforce a rule like "slaves must be
same or newer minor version as the master".  I'm not sure that we
actually *want* to enforce such a rule, though.  Most of the time, the
other way around would work fine.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Simon Riggs
Date:
On Wed, 2008-06-04 at 11:37 -0400, Tom Lane wrote:

> This thread is getting out of hand, actually.

Agreed. We should start new threads for specific things. Please.

> However, since by definition pg_control doesn't change in a minor
> upgrade, there isn't any easy way to enforce a rule like "slaves must be
> same or newer minor version as the master".  I'm not sure that we
> actually *want* to enforce such a rule, though. 

Definitely don't want to prevent minor version mismatches. We want to be
able to upgrade a standby, have it catch up with the master then
switchover to the new version. Otherwise we'd have to take whole
replicated system down to do minor upgrades/backouts. Ugh!

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Core team statement on replication in PostgreSQL

From
Jeff Davis
Date:
On Wed, 2008-06-04 at 14:17 +0300, Heikki Linnakangas wrote:
> > Would that also cover possible differences in page size, 32bit OS vs.
> > 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these
> > things can have an influence on how the data is written and possibly
> > make the WAL incompatible with other postgres instances, even if the
> > exact same version...
> 
> These are already covered by the information in pg_control.

Another thing that can change between systems is the collation behavior,
which can corrupt indexes (and other bad things).

Regards,Jeff Davis



Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2008-06-04 at 14:17 +0300, Heikki Linnakangas wrote:
>> These are already covered by the information in pg_control.

> Another thing that can change between systems is the collation behavior,
> which can corrupt indexes (and other bad things).

That is covered by pg_control, at least to the extent of forcing the
same value of LC_COLLATE.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Jeff Davis
Date:
On Wed, 2008-06-04 at 14:23 -0400, Tom Lane wrote:
> That is covered by pg_control, at least to the extent of forcing the
> same value of LC_COLLATE.

But the same LC_COLLATE means different things on different systems.
Even "en_US" means something different on Mac versus Linux.

Regards,Jeff Davis



Re: Core team statement on replication in PostgreSQL

From
"Koichi Suzuki"
Date:
Well, WAL format doesn't only depend on WAL itself, but also depend on
each resource manager.   If we introduce WAL format version
identification, ISTM that we have to take care of the matching of
resource manager in the master and the slave as well.

2008/6/4 Heikki Linnakangas <heikki@enterprisedb.com>:
> Stephen Denne wrote:
>>
>> Hannu Krosing wrote:
>>>
>>> The simplest form of synchronous wal shipping would not even need
>>> postgresql running on slave, just a small daemon which reports when wal
>>> blocks are a) received and b) synced to disk.
>>
>> While that does sound simple, I'd presume that most people would want the
>> guarantee of the same version of postgresql installed wherever the logs are
>> ending up, with the log receiver speaking the same protocol version as the
>> log sender. I imagine that would be most easily achieved through using
>> something like the continuously restoring startup mode of current
>> postgresql.
>
> Hmm, WAL version compatibility is an interesting question. Most minor
> releases hasn't changed the WAL format, and it would be nice to allow
> running different minor versions in the master and slave in those cases. But
> it's certainly not unheard of to change the WAL format. Perhaps we should
> introduce a WAL version number, similar to catalog version?
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
------
Koichi Suzuki


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
"Koichi Suzuki" <koichi.szk@gmail.com> writes:
> Well, WAL format doesn't only depend on WAL itself, but also depend on
> each resource manager.   If we introduce WAL format version
> identification, ISTM that we have to take care of the matching of
> resource manager in the master and the slave as well.

That seems a bit overdesigned.  What are the prospects that two builds
of the same Postgres version are going to have different sets of
resource managers in them?
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
"Koichi Suzuki"
Date:
If the version of the master and the slave is different and we'd still
like to allow log shipping replication, we need a negotiation if WAL
format for the two is compatible.      I hope it is not in our scope
and I'm worrying too much.

2008/6/5 Tom Lane <tgl@sss.pgh.pa.us>:
> "Koichi Suzuki" <koichi.szk@gmail.com> writes:
>> Well, WAL format doesn't only depend on WAL itself, but also depend on
>> each resource manager.   If we introduce WAL format version
>> identification, ISTM that we have to take care of the matching of
>> resource manager in the master and the slave as well.
>
> That seems a bit overdesigned.  What are the prospects that two builds
> of the same Postgres version are going to have different sets of
> resource managers in them?
>
>                        regards, tom lane
>



-- 
------
Koichi Suzuki


Re: Core team statement on replication in PostgreSQL

From
Gregory Stark
Date:
"Jeff Davis" <pgsql@j-davis.com> writes:

> On Wed, 2008-06-04 at 14:17 +0300, Heikki Linnakangas wrote:
>> > Would that also cover possible differences in page size, 32bit OS vs.
>> > 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these
>> > things can have an influence on how the data is written and possibly
>> > make the WAL incompatible with other postgres instances, even if the
>> > exact same version...
>> 
>> These are already covered by the information in pg_control.
>
> Another thing that can change between systems is the collation behavior,
> which can corrupt indexes (and other bad things).

Well, yes and no. It's entirely possible, for example, for a minor release of
an OS to tweak the collation rules for a collation without changing the name.
For the sake of argument they might just be fixing a bug in the collation
rules. From the point of view of the OS that's a minor bug fix that they might
not foresee causing data corruption problems.

Pegging pg_control to a particular release of the OS would be pretty terrible
though. I don't really see an out for this. But it's another roadblock to
consider akin to "not-really-immutable index expressions" for any proposal
which depends on re-finding index pointers :(

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Gurjeet Singh wrote:
> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > But since you mention it: one of the plausible answers for fixing the
> > vacuum problem for read-only slaves is to have the slaves push an xmin
> > back upstream to the master to prevent premature vacuuming.  The current
> > design of pg_standby is utterly incapable of handling that requirement.
> > So there might be an implementation dependency there, depending on how
> > we want to solve that problem.
> >
> 
> I think it would be best to not make the slave interfere with the master's
> operations; that's only going to increase the operational complexity of such
> a solution.
> 
> There could be multiple slaves following a master, some serving

For the slave to not interfere with the master at all, we would need to
delay application of WAL files on each slave until visibility on that
slave allows the WAL to be applied, but in that case we would have
long-running transactions delay data visibility of all slave sessions.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Andreas 'ads' Scherbaum wrote:
> On Fri, 30 May 2008 16:22:41 -0400 (EDT) Greg Smith wrote:
> 
> > On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote:
> > 
> > > Then you ship 16 MB binary stuff every 30 second or every minute but
> > > you only have some kbyte real data in the logfile.
> > 
> > Not if you use pg_clearxlogtail ( 
> > http://www.2ndquadrant.com/replication.htm ), which got lost in the giant 
> > March commitfest queue but should probably wander into contrib as part of 
> > 8.4.
> 
> Yes, this topic was discussed several times in the past but to
> solve this it needs a patch/solution which is integrated into PG
> itself, not contrib.

Agreed.  I realize why we are not zeroing those bytes (for performance),
but can't we have the archiver zero those bytes before calling the
'archive_command'?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> Agreed.  I realize why we are not zeroing those bytes (for performance),
> but can't we have the archiver zero those bytes before calling the
> 'archive_command'?

Perhaps make the zeroing user-settable.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Gurjeet Singh wrote:
>> There could be multiple slaves following a master, some serving

> For the slave to not interfere with the master at all, we would need to
> delay application of WAL files on each slave until visibility on that
> slave allows the WAL to be applied, but in that case we would have
> long-running transactions delay data visibility of all slave sessions.

Right, but you could segregate out long-running queries to one slave
server that could be further behind than the others.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Agreed.  I realize why we are not zeroing those bytes (for performance),
> but can't we have the archiver zero those bytes before calling the
> 'archive_command'?

The archiver doesn't know any more about where the end-of-data is than
the archive_command does.  Moreover, the archiver doesn't know whether
the archive_command cares.  I think the separate module is a fine
solution.

It should also be pointed out that the whole thing becomes uninteresting
if we get real-time log shipping implemented.  So I see absolutely no
point in spending time integrating pg_clearxlogtail now.
        regards, tom lane


Re: Core team statement on replication in PostgreSQL

From
Greg Smith
Date:
On Mon, 9 Jun 2008, Tom Lane wrote:

> It should also be pointed out that the whole thing becomes uninteresting
> if we get real-time log shipping implemented.  So I see absolutely no
> point in spending time integrating pg_clearxlogtail now.

There are remote replication scenarios over a WAN (mainly aimed at 
disaster recovery) that want to keep a fairly updated database without 
putting too much traffic over the link.  People in that category really 
want zeroed tail+compressed archives, but probably not the extra overhead 
that comes with shipping smaller packets in a real-time implementation.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Core team statement on replication in PostgreSQL

From
"Koichi Suzuki"
Date:
Just for information.

In terms of archive compression, I have archive log compression which
will be found in http://pgfoundry.org/projects/pglesslog/

This feature is also included in NTT's synchronized log shipping
replication presented in the last PGCon.

2008/6/10 Greg Smith <gsmith@gregsmith.com>:
> On Mon, 9 Jun 2008, Tom Lane wrote:
>
>> It should also be pointed out that the whole thing becomes uninteresting
>> if we get real-time log shipping implemented.  So I see absolutely no
>> point in spending time integrating pg_clearxlogtail now.
>
> There are remote replication scenarios over a WAN (mainly aimed at disaster
> recovery) that want to keep a fairly updated database without putting too
> much traffic over the link.  People in that category really want zeroed
> tail+compressed archives, but probably not the extra overhead that comes
> with shipping smaller packets in a real-time implementation.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
------
Koichi Suzuki


Re: Core team statement on replication in PostgreSQL

From
Gregory Stark
Date:
"Greg Smith" <gsmith@gregsmith.com> writes:

> On Mon, 9 Jun 2008, Tom Lane wrote:
>
>> It should also be pointed out that the whole thing becomes uninteresting
>> if we get real-time log shipping implemented.  So I see absolutely no
>> point in spending time integrating pg_clearxlogtail now.
>
> There are remote replication scenarios over a WAN (mainly aimed at disaster
> recovery) that want to keep a fairly updated database without putting too much
> traffic over the link.  People in that category really want zeroed
> tail+compressed archives, but probably not the extra overhead that comes with
> shipping smaller packets in a real-time implementation.

Instead of zeroing bytes and depending on compression why not just pass an
extra parameter to the archive command with the offset to the logical end of
data. The archive_command could just copy from the start to that point and not
bother transferring the rest.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Core team statement on replication in PostgreSQL

From
"Heikki Linnakangas"
Date:
Gregory Stark wrote:
> Instead of zeroing bytes and depending on compression why not just pass an
> extra parameter to the archive command with the offset to the logical end of
> data.

Because the archiver process doesn't have that information.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Core team statement on replication in PostgreSQL

From
Josh Berkus
Date:
All,

> > For the slave to not interfere with the master at all, we would need to
> > delay application of WAL files on each slave until visibility on that
> > slave allows the WAL to be applied, but in that case we would have
> > long-running transactions delay data visibility of all slave sessions.
>
> Right, but you could segregate out long-running queries to one slave
> server that could be further behind than the others.

I still see having 2 different settings:

Synchronous: XID visibility is pushed to the master.  Maintains synchronous 
failover, and users are expected to run *1* master to *1* slave for most 
installations.

Asynchronous: replication stops on the slave whenever minxid gets out of 
synch.  Could have multiple slaves, but noticeable lag between master and 
slave.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Core team statement on replication in PostgreSQL

From
ITAGAKI Takahiro
Date:
Josh Berkus <josh@agliodbs.com> wrote:

> I still see having 2 different settings:
> 
> Synchronous: XID visibility is pushed to the master.  Maintains synchronous 
> failover, and users are expected to run *1* master to *1* slave for most 
> installations.
> 
> Asynchronous: replication stops on the slave whenever minxid gets out of 
> synch.  Could have multiple slaves, but noticeable lag between master and 
> slave.

I agree with you that we have sync/async option in log-shipping.
Also, we could have another setting - synchronous-shipping and
asynchronous-flushing. We won't lose transactions if both servers are
down at once and can avoid delays to flush wal files into primary's disks.

As for multiple slaves, we could have a cascading configuration;
WAL receiver also delivers WAL records to other servers.
I think it is simple that the postgres core has only one-to-one replication
and multiple slaves are supported by 3rd party's WAL receivers.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Core team statement on replication in PostgreSQL

From
"Kevin Grittner"
Date:
>>> On Mon, Jun 9, 2008 at  9:48 PM, in message
<Pine.GSO.4.64.0806092243080.11286@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote: 
> On Mon, 9 Jun 2008, Tom Lane wrote:
> 
>> It should also be pointed out that the whole thing becomes
uninteresting
>> if we get real-time log shipping implemented.  So I see absolutely
no
>> point in spending time integrating pg_clearxlogtail now.
> 
> There are remote replication scenarios over a WAN (mainly aimed at 
> disaster recovery) that want to keep a fairly updated database
without 
> putting too much traffic over the link.  People in that category
really 
> want zeroed tail+compressed archives, but probably not the extra
overhead 
> that comes with shipping smaller packets in a real-time
implementation.
We ship the WAL files over a (relatively) slow WAN for disaster
recovery purposes, and we would be fine with replacing our current
techniques with real-time log shipping as long as:
(1)  We can do it asynchronously.  (i.e., we don't have to wait for
WAN latency to commit transactions.)
(2)  It can ship to multiple targets.  (Management dictates that we
have backups at the site of origin as well as our central site.  A
failure to replicate to one must not delay the other.)
(3)  It doesn't consume substantially more WAN bandwidth overall.
A solution which fails to cover any of these leaves pg_clearxlogtail
interesting to us. 
-Kevin


Follow-up on replication hooks for PostgreSQL

From
Robert Hodges
Date:
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi everyone, <br /><br /> This is a quick update
ona promise I made early in June to suggest requirements as well as ways to add replication hooks that would support
logicalreplication, as opposed to the physical replication work currently underway based on NTT’s code.  <br /><br />
Well,June was a pretty busy month, so it has taken a while to get back to this.  However, we are now beginning to
examineoptions for PostgreSQL logical replication.  To make a long story short we are willing to commit resources to
thisproblem or fund other people to do it for us.  If you are interested please contact me directly.  Meanwhile, we are
quiteserious about this problem and intend to work on helpful additions to PostgreSQL in this area.  I will post more
aswe make progress.  <br /><br /> Thanks, Robert <br /><br /> -- <br /> Robert Hodges, CTO, Continuent, Inc.<br />
Email: robert.hodges@continuent.com<br /><br /> P.s., Happy 12th birthday everyone! </span></font> 

Re: Follow-up on replication hooks for PostgreSQL

From
"Marko Kreen"
Date:
On 7/10/08, Robert Hodges <robert.hodges@continuent.com> wrote:
>  This is a quick update on a promise I made early in June to suggest
> requirements as well as ways to add replication hooks that would support
> logical replication, as opposed to the physical replication work currently
> underway based on NTT's code.
>
>  Well, June was a pretty busy month, so it has taken a while to get back to
> this.  However, we are now beginning to examine options for PostgreSQL
> logical replication.  To make a long story short we are willing to commit
> resources to this problem or fund other people to do it for us.  If you are
> interested please contact me directly.  Meanwhile, we are quite serious
> about this problem and intend to work on helpful additions to PostgreSQL in
> this area.  I will post more as we make progress.

Well, I'm not exactly sure what you are planning.  It's OK to do
draft design privately, but before actually starting coding, the
design should be discussed in -hackers.

And I'm not exactly sure what you mean in "logical replication"?
Way to log DDL statements?  Do you want to log DML also?

FWIW, here's very draft design for functionality that could be used
to make current Slony-I/Londiste-like solutions to replicate DDL also.

1. CREATE DDL TRIGGER statement that allows to call function  for all DDL statements.
  Only filtering that makes sense here is filtering by area:  tables/functions/views/etc.
  It must be possible to do AFTER trigger.  Whether BEFORE  trigger for DDL make sense or not, I'm not sure.

2. When function is called, following information is given:
  - Object type the event was for (table/view/function)  - Array of object names.  - SQL statement as text.
  The trigger function can filter further based on object names  whether it does want to log the event or not.

Trying to make the trigger run on only subset of events is complex,
and parsing the SQL to pieces for trigger to understand it better
is also complex and neither is needed.  Unless there are some common
situation where such simple design fails to work, I would not make
the scheme more complex.

Also the design should be based on assumption that the target side
is exactly in sync.  Eg. DROP CASCADE should be replicated as DROP CASCADE.
We should not make scheme more complex to survive cases where target
is not in sync.  That way madness lies.  The effect should be like
same SQL statements are applied to target by hand, no more, no less.

-- 
marko


Re: Follow-up on replication hooks for PostgreSQL

From
Robert Hodges
Date:
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi Marko, <br /><br /> No fear, we definitely
willdiscuss on pgsql-hackers.  I just wanted to make sure that people understood we are still committed to solving this
problemand will one way or another commit resources to help.  <br /><br /> Just to be clear, by logical replication I
meanreplication based on sending SQL or near-SQL (e.g., generic DML events) between servers.   Physical replication on
theother hand uses internal formats to replicate changes without intervening conversion to SQL, for example by shipping
WALrecords.  There are advantages to each for different applications.  BTW, I heard this nomenclature from Simon Riggs.
 Itseems quite helpful.<br /><br /> The DDL trigger proposal is interesting and would be a very useful feature addition
toPostgreSQL.  To execute correctly it may also be necessary to know which database you were using at the time the SQL
wasissued.  <br /><br /> For our part we are looking for ways to replicate most or all data on a server as efficiently
aspossible.  Generic call-outs at commit time or reading the log directly are attractive approaches.  Depending on the
implementationyou can avoid double writes of replicated data on the master host.  Also, it avoids the management
headacheof ensuring that triggers are correctly installed.  It seems as if one of these generic approaches could hook
intoWAL record transport. <br /><br /> Cheers, Robert<br /><br /> On 7/10/08 4:56 AM, "Marko Kreen"
<markokr@gmail.com>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span
style="font-size:12.0px">On7/10/08, Robert Hodges <robert.hodges@continuent.com> wrote:<br /> >  This is a
quickupdate on a promise I made early in June to suggest<br /> > requirements as well as ways to add replication
hooksthat would support<br /> > logical replication, as opposed to the physical replication work currently<br />
>underway based on NTT's code.<br /> ><br /> >  Well, June was a pretty busy month, so it has taken a while to
getback to<br /> > this.  However, we are now beginning to examine options for PostgreSQL<br /> > logical
replication. To make a long story short we are willing to commit<br /> > resources to this problem or fund other
peopleto do it for us.  If you are<br /> > interested please contact me directly.  Meanwhile, we are quite
serious<br/> > about this problem and intend to work on helpful additions to PostgreSQL in<br /> > this area.  I
willpost more as we make progress.<br /><br /> Well, I'm not exactly sure what you are planning.  It's OK to do<br />
draftdesign privately, but before actually starting coding, the<br /> design should be discussed in -hackers.<br /><br
/>And I'm not exactly sure what you mean in "logical replication"?<br /> Way to log DDL statements?  Do you want to log
DMLalso?<br /><br /> FWIW, here's very draft design for functionality that could be used<br /> to make current
Slony-I/Londiste-likesolutions to replicate DDL also.<br /><br /> 1. CREATE DDL TRIGGER statement that allows to call
function<br/>    for all DDL statements.<br /><br />    Only filtering that makes sense here is filtering by area:<br
/>   tables/functions/views/etc.<br /><br />    It must be possible to do AFTER trigger.  Whether BEFORE<br />
   triggerfor DDL make sense or not, I'm not sure.<br /><br /> 2. When function is called, following information is
given:<br/><br />    - Object type the event was for (table/view/function)<br />    - Array of object names.<br />    -
SQLstatement as text.<br /><br />    The trigger function can filter further based on object names<br />    whether it
doeswant to log the event or not.<br /><br /> Trying to make the trigger run on only subset of events is complex,<br />
andparsing the SQL to pieces for trigger to understand it better<br /> is also complex and neither is needed.  Unless
thereare some common<br /> situation where such simple design fails to work, I would not make<br /> the scheme more
complex.<br/><br /> Also the design should be based on assumption that the target side<br /> is exactly in sync.  Eg.
DROPCASCADE should be replicated as DROP CASCADE.<br /> We should not make scheme more complex to survive cases where
target<br/> is not in sync.  That way madness lies.  The effect should be like<br /> same SQL statements are applied to
targetby hand, no more, no less.<br /><br /> --<br /> marko<br /><br /></span></font></blockquote><font face="Verdana,
Helvetica,Arial"><span style="font-size:12.0px"><br /><br /> -- <br /> Robert Hodges, CTO, Continuent, Inc.<br />
Email: robert.hodges@continuent.com<br /> Mobile:  +1-510-501-3728  Skype:  hodgesrm<br /></span></font> 

Re: Follow-up on replication hooks for PostgreSQL

From
chris
Date:
markokr@gmail.com ("Marko Kreen") writes:
> Also the design should be based on assumption that the target side
> is exactly in sync.  Eg. DROP CASCADE should be replicated as DROP CASCADE.
> We should not make scheme more complex to survive cases where target
> is not in sync.  That way madness lies.  The effect should be like
> same SQL statements are applied to target by hand, no more, no less.

We have, already, in 8.4, a handling of triggers for TRUNCATE; the
reason why support hasn't made it into Slony-I yet relates quite
exactly to this...

The trouble comes in if you do TRUNCATE CASCADE; I'm not quite sure
how to collect together the multiple recordings of the trigger
functions that would be collected as a result; for it all to work,
safely, on the remote node, we'd need to apply all of those truncates
at once.

Note also that there is an issue with coordination of schemas; Slony-I
shuts off the RI triggers on subscribers, so that the target is fairly
certain to not be *entirely* in sync, by express intent.

Those are legitimate differences between source and target.
-- 
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://cbbrowne.com/info/lsf.html
Rules  of the  Evil Overlord  #145. "My  dungeon cell  decor  will not
feature exposed pipes.  While they add to the  gloomy atmosphere, they
are good  conductors of vibrations and  a lot of  prisoners know Morse
code." <http://www.eviloverlord.com/>


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Agreed.  I realize why we are not zeroing those bytes (for performance),
> > but can't we have the archiver zero those bytes before calling the
> > 'archive_command'?
> 
> The archiver doesn't know any more about where the end-of-data is than
> the archive_command does.  Moreover, the archiver doesn't know whether
> the archive_command cares.  I think the separate module is a fine
> solution.
> 
> It should also be pointed out that the whole thing becomes uninteresting
> if we get real-time log shipping implemented.  So I see absolutely no
> point in spending time integrating pg_clearxlogtail now.

People doing PITR are still going to be saving these files, and for a
long time, so I think this is still something we should try to address.

Added to TODO:
       o Reduce PITR WAL file size by removing full page writes and         by removing trailing bytes to improve
compression

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> Added to TODO:
> 
>         o Reduce PITR WAL file size by removing full page writes and
>           by removing trailing bytes to improve compression

If we remove full page writes, how does hint bit setting get propagated
to the slave?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:
> 
> > Added to TODO:
> > 
> >         o Reduce PITR WAL file size by removing full page writes and
> >           by removing trailing bytes to improve compression
> 
> If we remove full page writes, how does hint bit setting get propagated
> to the slave?

We would remove full page writes that are needed for crash recovery, but
perhaps keep other full pages.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Core team statement on replication in PostgreSQL

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> > 
> > > Added to TODO:
> > > 
> > >         o Reduce PITR WAL file size by removing full page writes and
> > >           by removing trailing bytes to improve compression
> > 
> > If we remove full page writes, how does hint bit setting get propagated
> > to the slave?
> 
> We would remove full page writes that are needed for crash recovery, but
> perhaps keep other full pages.

How do you tell which is which?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Core team statement on replication in PostgreSQL

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > Alvaro Herrera wrote:
> > > Bruce Momjian wrote:
> > > 
> > > > Added to TODO:
> > > > 
> > > >         o Reduce PITR WAL file size by removing full page writes and
> > > >           by removing trailing bytes to improve compression
> > > 
> > > If we remove full page writes, how does hint bit setting get propagated
> > > to the slave?
> > 
> > We would remove full page writes that are needed for crash recovery, but
> > perhaps keep other full pages.
> 
> How do you tell which is which?

The WAL format would have to be modified to indicate which entries can
be discarded.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +