Thread: Core team statement on replication in PostgreSQL
The Postgres core team met at PGCon to discuss a few issues, the largest of which is the need for simple, built-in replication for PostgreSQL. Historically the project policy has been to avoid putting replication into core PostgreSQL, so as to leave room for development of competing solutions, recognizing that there is no "one size fits all" replication solution. However, it is becoming clear that this policy is hindering acceptance of PostgreSQL to too great an extent, compared to the benefit it offers to the add-on replication projects. Users who might consider PostgreSQL are choosing other database systems because our existing replication options are too complex to install and use for simple cases. In practice, simple asynchronous single-master-multiple-slave replication covers a respectable fraction of use cases, so we have concluded that we should allow such a feature to be included in the core project. We emphasize that this is not meant to prevent continued development of add-on replication projects that cover more complex use cases. We believe that the most appropriate base technology for this is probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. We hope that such a feature can be completed for 8.4. Ideally this would be coupled with the ability to execute read-only queries on the slave servers, but we see technical difficulties that might prevent that from being completed before 8.5 or even further out. (The big problem is that long-running slave-side queries might still need tuples that are vacuumable on the master, and so replication of vacuuming actions would cause the slave's queries to deliver wrong answers.) Again, this will not replace Slony, pgPool, Continuent, Londiste, or other systems for many users, as it will be not be highly scalable nor support long-distance replication nor replicating less than an entire installation. But it is time to include a simple, reliable basic replication feature in the core system. regards, tom lane
On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The Postgres core team met at PGCon to discuss a few issues, the largest > of which is the need for simple, built-in replication for PostgreSQL. > Historically the project policy has been to avoid putting replication > into core PostgreSQL, so as to leave room for development of competing > solutions, recognizing that there is no "one size fits all" replication > solution. However, it is becoming clear that this policy is hindering > acceptance of PostgreSQL to too great an extent, compared to the benefit > it offers to the add-on replication projects. Users who might consider > PostgreSQL are choosing other database systems because our existing > replication options are too complex to install and use for simple cases. > In practice, simple asynchronous single-master-multiple-slave > replication covers a respectable fraction of use cases, so we have > concluded that we should allow such a feature to be included in the core > project. We emphasize that this is not meant to prevent continued > development of add-on replication projects that cover more complex use > cases. > > We believe that the most appropriate base technology for this is > probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. > We hope that such a feature can be completed for 8.4. +1 Although I would explain it more shortly - we do need a solution for lossless failover servers and such solution needs to live in core backend. > Ideally this > would be coupled with the ability to execute read-only queries on the > slave servers, but we see technical difficulties that might prevent that > from being completed before 8.5 or even further out. (The big problem > is that long-running slave-side queries might still need tuples that are > vacuumable on the master, and so replication of vacuuming actions would > cause the slave's queries to deliver wrong answers.) Well, both Slony-I and upcoming Skytools 3 have the same problem when cleaning events and have it solved simply by slaves reporting back their lowest position on event stream. I cannot see why it cannot be applied in this case too. So each slave just needs to report its own longest open tx as "open" to master. Yes, it bloats master but no way around it. Only problem could be the plan to vacuum tuples updated in between long running tx and the regular ones, but such behaviour can be just turned off. We could also have a option of "inaccessible slave", for those who fear bloat on master. -- marko
On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote: > The Postgres core team met at PGCon to discuss a few issues, the > largest of which is the need for simple, built-in replication for > PostgreSQL. Historically the project policy has been to avoid > putting replication into core PostgreSQL, so as to leave room for > development of competing solutions, recognizing that there is no > "one size fits all" replication solution. However, it is becoming > clear that this policy is hindering acceptance of PostgreSQL to too > great an extent, compared to the benefit it offers to the add-on > replication projects. Users who might consider PostgreSQL are > choosing other database systems because our existing replication > options are too complex to install and use for simple cases. In > practice, simple asynchronous single-master-multiple-slave > replication covers a respectable fraction of use cases, so we have > concluded that we should allow such a feature to be included in the > core project. We emphasize that this is not meant to prevent > continued development of add-on replication projects that cover more > complex use cases. > > We believe that the most appropriate base technology for this is > probably real-time WAL log shipping, as was demoed by NTT OSS at > PGCon. We hope that such a feature can be completed for 8.4. > Ideally this would be coupled with the ability to execute read-only > queries on the slave servers, but we see technical difficulties that > might prevent that from being completed before 8.5 or even further > out. (The big problem is that long-running slave-side queries might > still need tuples that are vacuumable on the master, and so > replication of vacuuming actions would cause the slave's queries to > deliver wrong answers.) This part is a deal-killer. It's a giant up-hill slog to sell warm standby to those in charge of making resources available because the warm standby machine consumes SA time, bandwidth, power, rack space, etc., but provides no tangible benefit, and this feature would have exactly the same problem. IMHO, without the ability to do read-only queries on slaves, it's not worth doing this feature at all. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 5/29/08, David Fetter <david@fetter.org> wrote: > On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote: > > Ideally this would be coupled with the ability to execute read-only > > queries on the slave servers, but we see technical difficulties that > > might prevent that from being completed before 8.5 or even further > > out. (The big problem is that long-running slave-side queries might > > still need tuples that are vacuumable on the master, and so > > replication of vacuuming actions would cause the slave's queries to > > deliver wrong answers.) > > This part is a deal-killer. It's a giant up-hill slog to sell warm > standby to those in charge of making resources available because the > warm standby machine consumes SA time, bandwidth, power, rack space, > etc., but provides no tangible benefit, and this feature would have > exactly the same problem. > > IMHO, without the ability to do read-only queries on slaves, it's not > worth doing this feature at all. I would not be so harsh - I'd like to have the lossless standby even without read-only slaves. But Tom's mail gave me impression core wants to wait until we get "perfect" read-only slave implementation so we wait with it until 8.6, which does not seem sensible. If we can do slightly inefficient (but simple) implementation right now, I see no reason to reject it, we can always improve it later. Especially as it can be switchable. And we could also have transaction_timeout paramenter on slaves so the hit on master is limited. -- marko
On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote: > On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote: > This part is a deal-killer. It's a giant up-hill slog to sell warm > standby to those in charge of making resources available because the > warm standby machine consumes SA time, bandwidth, power, rack space, > etc., but provides no tangible benefit, and this feature would have > exactly the same problem. > > IMHO, without the ability to do read-only queries on slaves, it's not > worth doing this feature at all. The only question I have is... what does this give us that PITR doesn't give us? Sincerely, Joshua D. Drake
Marko, > But Tom's mail gave me impression core wants to wait until we get "perfect" > read-only slave implementation so we wait with it until 8.6, which does > not seem sensible. If we can do slightly inefficient (but simple) > implementation > right now, I see no reason to reject it, we can always improve it later. That's incorrect. We're looking for a workable solution. If we could get one for 8.4, that would be brilliant but we think it's going to be harder than that. Publishing the XIDs back to the master is one possibility. We also looked at using "spillover segments" for vacuumed rows, but that seemed even less viable. I'm also thinking, for *async replication*, that we could simply halt replication on the slave whenever a transaction passes minxid on the master. However, the main focus will be on synchrounous hot standby. --Josh
On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote: > On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote: > > This part is a deal-killer. It's a giant up-hill slog to sell > > warm standby to those in charge of making resources available > > because the warm standby machine consumes SA time, bandwidth, > > power, rack space, etc., but provides no tangible benefit, and > > this feature would have exactly the same problem. > > > > IMHO, without the ability to do read-only queries on slaves, it's > > not worth doing this feature at all. > > The only question I have is... what does this give us that PITR > doesn't give us? It looks like a wrapper for PITR to me, so the gain would be ease of use. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > The only question I have is... what does this give us that PITR doesn't > give us? I think the idea is that WAL records would be shipped (possibly via socket) and applied as they're generated, rather than on a file-by-file basis. At least that's what "real-time" implies to me... -Doug
Josh Berkus wrote: > Marko, > > > But Tom's mail gave me impression core wants to wait until we get "perfect" > > read-only slave implementation so we wait with it until 8.6, which does > > not seem sensible. If we can do slightly inefficient (but simple) > > implementation > > right now, I see no reason to reject it, we can always improve it later. > > That's incorrect. We're looking for a workable solution. If we could > get one for 8.4, that would be brilliant but we think it's going to be > harder than that. > > Publishing the XIDs back to the master is one possibility. We also > looked at using "spillover segments" for vacuumed rows, but that seemed > even less viable. > > I'm also thinking, for *async replication*, that we could simply halt > replication on the slave whenever a transaction passes minxid on the > master. However, the main focus will be on synchrounous hot standby. Another idea I discussed with Tom is having the slave _delay_ applying WAL files until all slave snapshots are ready. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote: > On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > >> The only question I have is... what does this give us that PITR doesn't >> give us? > > I think the idea is that WAL records would be shipped (possibly via > socket) and applied as they're generated, rather than on a > file-by-file basis. At least that's what "real-time" implies to me... Yes, we're talking real-time streaming (synchronous) log shipping. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
On Thursday 29 May 2008 09:54:03 am Marko Kreen wrote: > On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The Postgres core team met at PGCon to discuss a few issues, the largest > > of which is the need for simple, built-in replication for PostgreSQL. > > Historically the project policy has been to avoid putting replication > > into core PostgreSQL, so as to leave room for development of competing > > solutions, recognizing that there is no "one size fits all" replication > > solution. However, it is becoming clear that this policy is hindering > > acceptance of PostgreSQL to too great an extent, compared to the benefit > > it offers to the add-on replication projects. Users who might consider > > PostgreSQL are choosing other database systems because our existing > > replication options are too complex to install and use for simple cases. > > In practice, simple asynchronous single-master-multiple-slave > > replication covers a respectable fraction of use cases, so we have > > concluded that we should allow such a feature to be included in the core > > project. We emphasize that this is not meant to prevent continued > > development of add-on replication projects that cover more complex use > > cases. > > > > We believe that the most appropriate base technology for this is > > probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. > > We hope that such a feature can be completed for 8.4. > > +1 > > Although I would explain it more shortly - we do need a solution for > lossless failover servers and such solution needs to live in core backend. +1 for lossless failover (ie, synchronous)
* Josh Berkus <josh@agliodbs.com> [080529 11:52]: > Marko, > > >But Tom's mail gave me impression core wants to wait until we get "perfect" > >read-only slave implementation so we wait with it until 8.6, which does > >not seem sensible. If we can do slightly inefficient (but simple) > >implementation > >right now, I see no reason to reject it, we can always improve it later. > > That's incorrect. We're looking for a workable solution. If we could > get one for 8.4, that would be brilliant but we think it's going to be > harder than that. > > Publishing the XIDs back to the master is one possibility. We also > looked at using "spillover segments" for vacuumed rows, but that seemed > even less viable. > > I'm also thinking, for *async replication*, that we could simply halt > replication on the slave whenever a transaction passes minxid on the > master. However, the main focus will be on synchrounous hot standby. Or, instead of statement timeout killing statements on the RO slave, simply kill any "old" transactions on the RO slave. "Old" in the sense that the master's xmin has passed it. And it's just an exersise in controlling the age of xmin on the master, which could even be done user-side. Doesn't fit all, but no one size does... It would work for where you're hammering your slaves with a diverse set of high-velocity short queries that you're trying to avoid on the master... An option to "pause reply (making it async)" or "abort transactions (for sync)" might make it possible to easily run an async slave for slow reporting queries, and a sync slave for short queries. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Joshua D. Drake wrote: > > On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote: >> On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote: > >> This part is a deal-killer. It's a giant up-hill slog to sell warm >> standby to those in charge of making resources available because the >> warm standby machine consumes SA time, bandwidth, power, rack space, >> etc., but provides no tangible benefit, and this feature would have >> exactly the same problem. >> >> IMHO, without the ability to do read-only queries on slaves, it's not >> worth doing this feature at all. > > The only question I have is... what does this give us that PITR doesn't > give us? Since people seem to be unclear on what we're proposing: 8.4 Synchronous Warm Standby: makes PostgreSQL more suitable for HA systems by eliminating failover data loss and cutting failover time. 8.5 (probably) Synchronous & Asynchronous Hot Standby: adds read-only queries on slaves to the above. Again, if we can implement queries on slaves for 8.4, we're all for it. However, after conversations in Core and with Simonwe all think it's going to be too big a task to complete in 4-5 months. We *don't* want to end up delaying 8.4 for 5 months because we're debugging hot standby. --Josh
David Fetter wrote:<br /><blockquote cite="mid20080529152105.GO16218@fetter.org" type="cite"><pre wrap=""> This part is a deal-killer. It's a giant up-hill slog to sell warm standby to those in charge of making resources available because the warm standby machine consumes SA time, bandwidth, power, rack space, etc., but provides no tangible benefit, and this feature would have exactly the same problem. IMHO, without the ability to do read-only queries on slaves, it's not worth doing this feature at all. </pre></blockquote><br /> I don't think I agree with this. There are a large number of situations where it's positive expectancyto do precisely this- it's not unlike buying a $1 lottery ticket with a 1 chance in 100 of winning $1000- the vastmajority of the time (99 times out of 100), you're going to lose $1. But when you win, you win big, and make up forall the small losses you incurred getting there and then some. Failover machines are like that- most of the time they'renegative value, as you said- taking up SA time, bandwidth, power, rack space, money, etc. But every once in a (great)while, they save you. If the cost of having the database down for hours or days (as you madly try to next-day replacementhardware) isn't that great, then no, this isn't worthwhile- but in cases where the database being down chalksup the lost money quickly, this is easy to cost-justify.<br /><br /> Being able to do read-only queries makes thisfeature more valuable in more situations, but I disagree that it's a deal-breaker.<br /><br /> Brian<br /><br />
On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote: > Josh Berkus wrote: > > Publishing the XIDs back to the master is one possibility. We > > also looked at using "spillover segments" for vacuumed rows, but > > that seemed even less viable. > > > > I'm also thinking, for *async replication*, that we could simply > > halt replication on the slave whenever a transaction passes minxid > > on the master. However, the main focus will be on synchrounous > > hot standby. > > Another idea I discussed with Tom is having the slave _delay_ > applying WAL files until all slave snapshots are ready. Either one of these would be great, but something that involves machines that stay useless most of the time is just not going to work. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote: > On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote: > > Josh Berkus wrote: > > > Publishing the XIDs back to the master is one possibility. We > > > also looked at using "spillover segments" for vacuumed rows, but > > > that seemed even less viable. > > > > > > I'm also thinking, for *async replication*, that we could simply > > > halt replication on the slave whenever a transaction passes minxid > > > on the master. However, the main focus will be on synchrounous > > > hot standby. > > > > Another idea I discussed with Tom is having the slave _delay_ > > applying WAL files until all slave snapshots are ready. > > Either one of these would be great, but something that involves > machines that stay useless most of the time is just not going to work. Right, the ultimate target is to have the slave be read-only, but we need to get the streaming of WAL logs done first. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
* Dave Page <dpage@pgadmin.org> [080529 12:03]: > On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote: > > I think the idea is that WAL records would be shipped (possibly via > > socket) and applied as they're generated, rather than on a > > file-by-file basis. At least that's what "real-time" implies to me... > > Yes, we're talking real-time streaming (synchronous) log shipping. But synchronous streaming doesn't mean the WAL has to be *applied* on the salve yet. Just that it has to be "safely" on the slave (i.e on disk, not just in kernel buffers). The whole single-threaded WAL replay problem is going to rear it's ugly head here too, and mean that a slave *won't* be able to keep up with a busy master if it's actually trying to apply all the changes in real-time. Well, actually, if it's synchronous, it will keep up, but it just means that now your master is IO capabilities is limited to the speed of the slaves single-threaded WAL application. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Bruce, > Another idea I discussed with Tom is having the slave _delay_ applying > WAL files until all slave snapshots are ready. > Well, again, that only works for async mode. I personally think that's the correct solution for async. But for synch mode, I think we need to push the xids back to the master; generally if a user is running in synch mode they're concerned about failover time and zero data loss, so holding back the WAL files doesn't make sense. Also, if you did delay applying WAL files on an async slave, you'd reach a point (perhaps after a 6-hour query) where it'd actually be cheaper to rebuild the slave than to apply the pent-up WAL files. --Josh Berkus
Dave Page wrote: > On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote: > >> On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote: >> >> >>> The only question I have is... what does this give us that PITR doesn't >>> give us? >>> >> I think the idea is that WAL records would be shipped (possibly via >> socket) and applied as they're generated, rather than on a >> file-by-file basis. At least that's what "real-time" implies to me... >> > > Yes, we're talking real-time streaming (synchronous) log shipping. > That's not what Tom's email said, AIUI. "Synchronous" replication surely means that the master and slave always have the same set of transactions applied. Streaming <> synchronous. But streaming log shipping will allow us to get get closer to synchronicity in some situations, i.e. the window for missing transactions will be much smaller. Some of us were discussing this late on Friday night after PGcon. ISTM that we can have either 1) fairly hot failover slaves that are guaranteed to be almost up to date, or 2) slaves that can support read-only transactions but might get somewhat out of date if they run long transactions. The big problem is in having slaves which are both highly up to date and support arbitrary read-only transactions. Maybe in the first instance, at least, we need to make slaves choose which role they will play. cheers andrew
On 5/29/08, Joshua D. Drake <jd@commandprompt.com> wrote: > On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote: > > On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote: > > This part is a deal-killer. It's a giant up-hill slog to sell warm > > standby to those in charge of making resources available because the > > warm standby machine consumes SA time, bandwidth, power, rack space, > > etc., but provides no tangible benefit, and this feature would have > > exactly the same problem. > > > > IMHO, without the ability to do read-only queries on slaves, it's not > > worth doing this feature at all. > > The only question I have is... what does this give us that PITR doesn't > give us? Tom is talking about synchronous WAL replication. So you can do lossless failover. Currently there is no good solution for this. And it needs to live in core backend. Yes, it could somehow be implemented by filling backend with hooks, but the question is how it will get synced with changes in core backend after couple of releases? The WAL writing and txid/snapshot handling receive heavy changes on each release. No external project that needs deep hooks has been able to keep pace with core changes thus far. Unless heavily commercially backed which means not open-source. Companies can tell the price they pay for such syncing.. Other solution would be indeed to have fixed hooks guaranteed to be stable between releases. (replica-hooks-discuss?) But that would mean limiting the changes we can do with WAL-writing/snapshot handling code and that does not seem like attractive solution. By having such replication code that tightly ties into core code included in main Postgres source, we are still free to do any changes we feel like and not be tied into external API promises. -- marko
Josh Berkus wrote: > Bruce, > > > Another idea I discussed with Tom is having the slave _delay_ applying > > WAL files until all slave snapshots are ready. > > > > Well, again, that only works for async mode. I personally think that's > the correct solution for async. But for synch mode, I think we need to > push the xids back to the master; generally if a user is running in > synch mode they're concerned about failover time and zero data loss, so > holding back the WAL files doesn't make sense. You send the WAL to the slave, but the slave doesn't apply them right away --- it isn't related to async. > Also, if you did delay applying WAL files on an async slave, you'd reach > a point (perhaps after a 6-hour query) where it'd actually be cheaper to > rebuild the slave than to apply the pent-up WAL files. True. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On May 29, 2008, at 9:12 AM, David Fetter wrote: > On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote: >> Josh Berkus wrote: >>> Publishing the XIDs back to the master is one possibility. We >>> also looked at using "spillover segments" for vacuumed rows, but >>> that seemed even less viable. >>> >>> I'm also thinking, for *async replication*, that we could simply >>> halt replication on the slave whenever a transaction passes minxid >>> on the master. However, the main focus will be on synchrounous >>> hot standby. >> >> Another idea I discussed with Tom is having the slave _delay_ >> applying WAL files until all slave snapshots are ready. > > Either one of these would be great, but something that involves > machines that stay useless most of the time is just not going to work. I have customers who are thinking about warm standby functionality, and the only thing stopping them deploying it is complexity and maintenance, not the cost of the HA hardware. If trivial-to-deploy replication that didn't offer read-only access of the slaves were available today I'd bet that most of them would be using it. Read-only slaves would certainly be nice, but (for me) it's making it trivial to deploy and maintain that's more interesting. Cheers, Steve
Tom Lane wrote: > In practice, simple asynchronous single-master-multiple-slave > replication covers a respectable fraction of use cases, so we have > concluded that we should allow such a feature to be included in the core > project. We emphasize that this is not meant to prevent continued > development of add-on replication projects that cover more complex use > cases. IMHO, this will help PostgreSQL adoption, mindshare and even boost interest in development for the other replication use cases. > We believe that the most appropriate base technology for this is > probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. The slides are up at http://www.pgcon.org/2008/schedule/events/76.en.html From what I gather from those slides it seems to me that the NTT solution is synchronous not asynchronous. In my opinion it's even better, but I do understand that others might prefer asynchronous. I'm going to speculate, but I would think it should be possible (without a substancial rewrite) to support both modes (or even some intermediate modes, like DRBD on Linux). > We hope that such a feature can be completed for 8.4. Ideally this > would be coupled with the ability to execute read-only queries on the > slave servers, but we see technical difficulties that might prevent that > from being completed before 8.5 or even further out. (The big problem > is that long-running slave-side queries might still need tuples that are > vacuumable on the master, and so replication of vacuuming actions would > cause the slave's queries to deliver wrong answers.) From the 8.4dev documentation, another problem for read-only slaves would be : « Operations on hash indexes are not presently WAL-logged, so replay will not update these indexes. The recommended workaround is to manually REINDEX each such index after completing a recovery operation. ». Sincerely, -- Mathias Brossard
Attachment
On Thu, 2008-05-29 at 09:10 -0700, Josh Berkus wrote: > Joshua D. Drake wrote: > > > > The only question I have is... what does this give us that PITR doesn't > > give us? > > Since people seem to be unclear on what we're proposing: > > 8.4 Synchronous Warm Standby: makes PostgreSQL more suitable for HA > systems by eliminating failover data loss and cutting failover time. > What does this give us that Solaris Cluster, RedHat Cluster, DRBD etc.. doesn't give us? I am not trying to be a poison pill, but I am just not seeing the benefit over what solutions that already exist. I could probably argue if I had more time, that this solution doesn't do anything but make us look like we are half baked in implementation. If the real goal is read-only slaves with synchronous capability, then let's implement that. If we can't do that by 8.4 it gets pushed to 8.5. We already have a dozen different utilities to give us what is being currently proposed. Sincerely, Joshua D. Drake
On 5/29/08, Aidan Van Dyk <aidan@highrise.ca> wrote: > * Dave Page <dpage@pgadmin.org> [080529 12:03]: > > On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote: > > > I think the idea is that WAL records would be shipped (possibly via > > > socket) and applied as they're generated, rather than on a > > > file-by-file basis. At least that's what "real-time" implies to me... > > > > Yes, we're talking real-time streaming (synchronous) log shipping. > > But synchronous streaming doesn't mean the WAL has to be *applied* on > the salve yet. Just that it has to be "safely" on the slave (i.e on > disk, not just in kernel buffers). > > The whole single-threaded WAL replay problem is going to rear it's ugly > head here too, and mean that a slave *won't* be able to keep up with a > busy master if it's actually trying to apply all the changes in > real-time. Well, actually, if it's synchronous, it will keep up, but it > just means that now your master is IO capabilities is limited to the > speed of the slaves single-threaded WAL application. I don't think thats a problem. If the user runs its server at the limit of write-bandwidth, thats its problem. IOW, with synchronous replication, we _want_ the server to lag behind slaves. About the single-threading problem - afaik, the replay is mostly I/O bound so threading would not buy you much. -- marko
* Marko Kreen <markokr@gmail.com> [080529 12:27]: > I don't think thats a problem. If the user runs its server at the > limit of write-bandwidth, thats its problem. > > IOW, with synchronous replication, we _want_ the server to lag behind > slaves. > > About the single-threading problem - afaik, the replay is mostly I/O bound > so threading would not buy you much. Right - the problem is that the master has N>1 backends working away, preloading the modified heap pages into shared buffers, where they are modified w/ WAL. This means the kernel/controller has man read-requests in flight at a time as the modifies/writes chug along. The slave has to read/modify/write every buffer, one at a time, as WAL arrives, meaning there is ever only 1 IO request in flight at a time. So the server as a queue of many parallel reads going on, the slave has a set of sequential random reads going on. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Andrew Dunstan <andrew@dunslane.net> writes: > Dave Page wrote: >> Yes, we're talking real-time streaming (synchronous) log shipping. > That's not what Tom's email said, AIUI. Sorry, I was a bit sloppy about that. If we go with a WAL-shipping solution it would be pretty easy to support both synchronous and asynchronous cases (synchronous == master doesn't report commit until the WAL is down to disk on the slaves too). There are different use-cases for both so it'd make sense to do both. regards, tom lane
David Fetter wrote: > This part is a deal-killer. It's a giant up-hill slog to sell warm > standby to those in charge of making resources available because the > warm standby machine consumes SA time, bandwidth, power, rack space, > etc., but provides no tangible benefit, and this feature would have > exactly the same problem. > > IMHO, without the ability to do read-only queries on slaves, it's not > worth doing this feature at all. +1 I would think that a read-only WAL slave is more valuable than a real-time backup. (especially as the topic is about adding slaves not increasing the effectiveness of backups) I also think that starting with a read-only WAL slave will ease the transition between delayed slave updating and real-time slave updating. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
David Fetter <david@fetter.org> writes: > On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote: >> The only question I have is... what does this give us that PITR >> doesn't give us? > It looks like a wrapper for PITR to me, so the gain would be ease of > use. A couple of points about that: * Yeah, ease of use is a huge concern here. We're getting beat up because people have to go find a separate package (and figure out which one they want), install it, learn how to use it, etc. It doesn't help that the most mature package is Slony which is, um, not very novice-friendly or low-admin-complexity. I personally got religion on this about two months ago when Red Hat switched their bugzilla from Postgres to MySQL because the admins didn't want to deal with Slony any more. People want simple. * The proposed approach is trying to get to "real" replication incrementally. Getting rid of the loss window involved in file-by-file log shipping is step one, and I suspect that step two is going to be fixing performance issues in WAL replay to ensure that slaves can keep up. After that we'd start thinking about how to let slaves run read-only queries. But even without read-only queries, this will be a useful improvement for HA/backup scenarios. regards, tom lane
On Thu, 29 May 2008, David Fetter wrote: > It's a giant up-hill slog to sell warm standby to those in charge of > making resources available because the warm standby machine consumes SA > time, bandwidth, power, rack space, etc., but provides no tangible > benefit, and this feature would have exactly the same problem. This is an interesting commentary on the priorities of the customers you're selling to, but I don't think you can extrapolate from that too much. The deployments I normally deal with won't run a system unless there's a failover backup available, period, and the fact that such a feature is not integrated into the core yet is a major problem for them. Read-only slaves is a very nice to have, but by no means a prerequisite before core replication will be useful to some people. Hardware/machine resources are only worth a tiny fraction of what the data is in some environments, and in some of those downtime is really, really expensive. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 2008-05-29 at 09:18 -0700, Josh Berkus wrote: > Bruce, > > > Another idea I discussed with Tom is having the slave _delay_ applying > > WAL files until all slave snapshots are ready. > > > > Well, again, that only works for async mode. It depends on what we mean by synchronous. Do we mean "the WAL record has made it to the disk on the slave system," or "the WAL record has been applied on the slave system"? With this type of replication there will always be a difference for some small window, but most people would expect that window to be very small for synchronous replication. Regards,Jeff Davis
Josh, > What does this give us that Solaris Cluster, RedHat Cluster, DRBD etc.. > doesn't give us? Actually, these solutions all have some serious drawbacks, not the least of which is difficult administration (I speak from bitter personal experience). Also, most of them require installation at the filesystem level, something which often isn't available in a hosted environment. --Josh Berkus
On Thu, May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote: > > Being able to do read-only queries makes this feature more valuable in more > situations, but I disagree that it's a deal-breaker. Your managers are apparently more enlightened than some. ;-) A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Thu, May 29, 2008 at 07:20:37PM +0300, Marko Kreen wrote: > > So you can do lossless failover. Currently there is no good > solution for this. Indeed. Getting lossless failover would be excellent. I understand David's worry (having had those arguments more times than I care to admit), but if people don't want to spend the money on the extra machine that can't be queried, they can use another solution for the time being. The big missing piece is lossless failover. People are currently doing it with DRBD, various clustering things, &c., and those are complicated to set up and maintain. (As I've told more than one person looking at it, there is a risk that you'll actually make your installation complicated enough that you'll make it _less_ reliable. I have some bitter personal experiences with this effect, and I know some others on this list do as well.) A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Thu, May 29, 2008 at 02:13:26PM -0400, Andrew Sullivan wrote: > On Thu, May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote: > > Being able to do read-only queries makes this feature more > > valuable in more situations, but I disagree that it's a > > deal-breaker. > > Your managers are apparently more enlightened than some. ;-) Than most managers, at least in my experience, and since this feature is (IMHO rightly) based around broad adoption, it's a good thing to bring up. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
> in this case too. So each slave just needs to report its own longest > open tx as "open" to master. Yes, it bloats master but no way around it. Slaves should not report it every time or every transaction. Vacuum on master will ask them before doing a real work. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, May 29, 2008 at 12:19 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > That's not what Tom's email said, AIUI. "Synchronous" replication surely > means that the master and slave always have the same set of transactions > applied. Streaming <> synchronous. But streaming log shipping will allow us > to get get closer to synchronicity in some situations, i.e. the window for > missing transactions will be much smaller. > > Some of us were discussing this late on Friday night after PGcon. ISTM that > we can have either 1) fairly hot failover slaves that are guaranteed to be > almost up to date, or 2) slaves that can support read-only transactions but > might get somewhat out of date if they run long transactions. The big > problem is in having slaves which are both highly up to date and support > arbitrary read-only transactions. Maybe in the first instance, at least, we > need to make slaves choose which role they will play. I personally would be thrilled to have slaves be query-able in any fashion, even if 'wrong' under certain circumstances. Any asynchronous solution by definition gives the wrong answer on the slave. Read only slave is the #1 most anticipated feature in the circles I run with. It would literally transform how the database world thinks about postgres overnight. This, coupled with easier standby setup (a pg_archive to mirror pg_restore) would be most welcome! merlin
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi everyone, <br /><br /> First of all, I’m absolutelydelighted that the PG community is thinking seriously about replication. <br /><br /> Second, having a solid,easy-to-use database availability solution that works more or less out of the box would be an enormous benefit to customers. Availability is the single biggest problem for customers in my experience and as other people have commented thealternatives are not nice. It’s an excellent idea to build off an existing feature—PITR is already pretty useful andthe proposed features are solid next steps. The fact that it does not solve all problems is not a drawback but meansit’s likely to get done in a reasonable timeframe. <br /><br /> Third, you can’t stop with just this feature. (Thisis the BUT part of the post.) The use cases <b>not</b> covered by this feature area actually pretty large. Here area few that concern me: <br /><br /> 1.) Partial replication. <br /> 2.) WAN replication. <br /> 3.) Bi-directional replication. (Yes, this is evil but there are problems where it is indispensable.) <br /> 4.) Upgrade support. Aside fromdatabase upgrade (how would this ever really work between versions?), it would not support zero-downtime app upgrades,which depend on bi-directional replication tricks. <br /> 5.) Heterogeneous replication. <br /> 6.) Finally, performancescaling using scale-out over large numbers of replicas. I think it’s possible to get tunnel vision on this—it’snot a big requirement in the PG community because people don’t use PG in the first place when they want to do this. They use MySQL, which has very good replication for performance scaling, though it’s rather weak for availability. <br /><br /> As a consequence, I don’t see how you can get around doing some sort of row-based replicationlike all the other databases. Now that people are starting to get religion on this issue I would strongly advocatea parallel effort to put in a change-set extraction API that would allow construction of comprehensive master/slavereplication. (Another approach would be to make it possible for third party apps to read the logs and regenerateSQL.) There are existing models for how to do change set extraction; we have done it several times at my companyalready. There are also research projects like GORDA that have looked fairly comprehensively at this problem. <br/><br /> My company would be quite happy to participate in or even sponsor such an API. Between the proposed WAL-basedapproach and change-set-based replication it’s not hard to see PG becoming the open source database of choice fora very large number of users. <br /><br /> Cheers, Robert<br /><br /> On 5/29/08 6:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">DavidFetter <david@fetter.org> writes:<br /> > On Thu, May 29, 2008 at 08:46:22AM -0700,Joshua D. Drake wrote:<br /> >> The only question I have is... what does this give us that PITR<br /> >>doesn't give us?<br /><br /> > It looks like a wrapper for PITR to me, so the gain would be ease of<br /> >use.<br /><br /> A couple of points about that:<br /><br /> * Yeah, ease of use is a huge concern here. We're gettingbeat up<br /> because people have to go find a separate package (and figure out<br /> which one they want), installit, learn how to use it, etc. It doesn't<br /> help that the most mature package is Slony which is, um, not very<br/> novice-friendly or low-admin-complexity. I personally got religion<br /> on this about two months ago when RedHat switched their bugzilla<br /> from Postgres to MySQL because the admins didn't want to deal with Slony<br /> any more. People want simple.<br /><br /> * The proposed approach is trying to get to "real" replication<br /> incrementally. Getting rid of the loss window involved in file-by-file<br /> log shipping is step one, and I suspect thatstep two is going to be<br /> fixing performance issues in WAL replay to ensure that slaves can keep<br /> up. Afterthat we'd start thinking about how to let slaves run<br /> read-only queries. But even without read-only queries,this will be<br /> a useful improvement for HA/backup scenarios.<br /><br /> regards, tomlane<br /><br /> --<br /> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)<br /> To make changes toyour subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></span></font></blockquote><fontface="Verdana, Helvetica, Arial"><span style="font-size:12.0px"><br /><br /> -- <br />Robert Hodges, CTO, Continuent, Inc.<br /> Email: robert.hodges@continuent.com<br /> Mobile: +1-510-501-3728 Skype: hodgesrm<br /></span></font>
On Thu, May 29, 2008 at 3:05 PM, Robert Hodges <robert.hodges@continuent.com> wrote: > Third, you can't stop with just this feature. (This is the BUT part of the > post.) The use cases not covered by this feature area actually pretty > large. Here are a few that concern me: > > 1.) Partial replication. > 2.) WAN replication. > 3.) Bi-directional replication. (Yes, this is evil but there are problems > where it is indispensable.) > 4.) Upgrade support. Aside from database upgrade (how would this ever > really work between versions?), it would not support zero-downtime app > upgrades, which depend on bi-directional replication tricks. > 5.) Heterogeneous replication. > 6.) Finally, performance scaling using scale-out over large numbers of > replicas. I think it's possible to get tunnel vision on this—it's not a big > requirement in the PG community because people don't use PG in the first > place when they want to do this. They use MySQL, which has very good > replication for performance scaling, though it's rather weak for > availability. These type of things are what Slony is for. Slony is trigger based. This makes it more complex than log shipping style replication, but provides lots of functionality. wal shipping based replication is maybe the fastest possible solution...you are already paying the overhead so it comes virtually for free from the point of view of the master. mysql replication is imo nearly worthless from backup standpoint. merlin
Andrew Sullivan wrote:<br /><blockquote cite="mid20080529181326.GY40070@commandprompt.com" type="cite"><pre wrap="">On Thu,May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote: </pre><blockquote type="cite"><pre wrap="">Being able to do read-onlyqueries makes this feature more valuable in more situations, but I disagree that it's a deal-breaker. </pre></blockquote><pre wrap=""> Your managers are apparently more enlightened than some. ;-) A </pre></blockquote> No doubt. But defining the minimum acceptable feature set by the demands of the dumbest manager isa no-win proposition. <br /><br /> Brian<br /><br />
On 5/29/08, Teodor Sigaev <teodor@sigaev.ru> wrote: > > in this case too. So each slave just needs to report its own longest > > open tx as "open" to master. Yes, it bloats master but no way around it. > > Slaves should not report it every time or every transaction. Vacuum on > master will ask them before doing a real work. That is bit too slow if we want to do some page-level cleanup with regular transactions / bgwriter. But once or twice per minute update should work fine. -- marko
On Thu, May 29, 2008 at 12:05:18PM -0700, Robert Hodges wrote: > people are starting to get religion on this issue I would strongly > advocate a parallel effort to put in a change-set extraction API > that would allow construction of comprehensive master/slave > replication. You know, I gave a talk in Ottawa just last week about how the last effort to develop a comprehensive API for replication failed. I had some ideas about why, the main one of which is something like this: "Big features with a roadmap have not historically worked, so unless we're willing to change the way we work, we won't get that." I don't think an API is what's needed. It's clear proposals for particlar features that can be delivered in small pieces. That's what the current proposal offers. I think any kind of row-based approach such as what you're proposing would need that kind of proposal too. That isn't to say that I think an API is impossible or undesirable. It is to say that the last few times we tried, it went nowhere; and that I don't think the circumstances have changed. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > * The proposed approach is trying to get to "real" replication > incrementally. Getting rid of the loss window involved in file-by-file > log shipping is step one, and I suspect that step two is going to be > fixing performance issues in WAL replay to ensure that slaves can keep > up. After that we'd start thinking about how to let slaves run > read-only queries. But even without read-only queries, this will be > a useful improvement for HA/backup scenarios. I agree with this plan, but I think this extends also for read-only queries - we don't need to have the perfect, no-overhead solution as the first step, instead lets have simple and working solution with some overhead, then improve that one. And for the first-step solution, I think letting VACUUM keep tuples around based on slave queries is preferable to letting slaves lag. This is useful to more situations. -- marko
David Fetter wrote: > Either one of these would be great, but something that involves > machines that stay useless most of the time is just not going to work. Lots of people do use warm standby already anyway, just not based on mechanisms built into PostgreSQL. So defining away this need is completely unrealistic based on my experience. Even if there were a read-only slave, lots of applications couldn't make use of it. Anyway, a common approach to making better use of the hardware is to put some other service on the otherwise-standby machine, which in turn uses your master database server machine as its failover target. Unless you run *only* a database, there would usually be some candidate that you could set up that way. Another common approach is to realize that for some the costs of a downtime risk are higher than buying some extra hardware. I think the consensus in the core team was that having synchronous log shipping in 8.4 would already be a worthwhile feature by itself.
Jeff Davis wrote: > It depends on what we mean by synchronous. Do we mean "the WAL record > has made it to the disk on the slave system," or "the WAL record has > been applied on the slave system"? DRBD, which is a common warm standby solution for PostgreSQL at the moment, provides various levels of synchronicity. I imagine we could also define, as need arises, various levels, some of which may be the ones you listed.
Merlin Moncure wrote: > Read only slave is the #1 most anticipated feature in the > circles I run with. Do these circles not know about slony and londiste?
Joshua D. Drake wrote: > What does this give us that Solaris Cluster, RedHat Cluster, DRBD etc.. > doesn't give us? I personally think that DRBD is a fine solution. But it only runs on Linux. And Solaris Cluster isn't the same as DRBD.
Andrew Sullivan <ajs@commandprompt.com> writes: > On Thu, May 29, 2008 at 12:05:18PM -0700, Robert Hodges wrote: >> people are starting to get religion on this issue I would strongly >> advocate a parallel effort to put in a change-set extraction API >> that would allow construction of comprehensive master/slave >> replication. > You know, I gave a talk in Ottawa just last week about how the last > effort to develop a comprehensive API for replication failed. Indeed, core's change of heart on this issue was largely driven by Andrew's talk and subsequent discussion. We had more or less been waiting for the various external replication projects to tell us what they wanted in this line, and it was only the realization that no such thing was likely to happen that forced us to think seriously about what could be done within the core project. As I said originally, we have no expectation that the proposed features will displace the existing replication projects for "high end" replication problems ... and I'd characterize all of Robert's concerns as "high end" problems. We are happy to let those be solved outside the core project. About the only thing that would make me want to consider row-based replication in core would be if we determine that read-only slave queries are impractical atop a WAL-log-shipping implementation. Which could happen; in fact I think that's the main risk of the proposed development plan. But I also think that the near-term steps of the plan are worth doing anyway, for various other reasons, and so we won't be out too much effort if the plan fails. regards, tom lane
Andrew Sullivan wrote: > The big missing piece is lossless failover. People are currently > doing it with DRBD, various clustering things, &c., and those are > complicated to set up and maintain. Well, we'll see at the end of this (we hope) how a setup procedure of DRBD vs. PG warm standby works from a standing start. I wouldn't be overly optimistic. ;-)
Mathias Brossard wrote: > From what I gather from those slides it seems to me that the NTT solution > is synchronous not asynchronous. In my opinion it's even better, but I do > understand that others might prefer asynchronous. I'm going to speculate, > but I would think it should be possible (without a substancial rewrite) to > support both modes (or even some intermediate modes, like DRBD on Linux). Well, we already have asynchronous. The whole point is adding synchronous.
Tom Lane wrote: > We believe that the most appropriate base technology for this is > probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. Now how do we get our hands on their code?
On Thu, May 29, 2008 at 3:59 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > Merlin Moncure wrote: >> Read only slave is the #1 most anticipated feature in the >> circles I run with. > > Do these circles not know about slony and londiste? Sure. For various reasons mentioned elsewhere on this thread, a PITR based replication solution would have a lot of appeal, assuming it could be made to work. The main thing is that it is completely non-invasive to the master database and all the work is handled on the standby. It also operates on a trivial mechanism to transfer the files. merlin
Robert, > 1.) Partial replication. > 2.) WAN replication. > 3.) Bi-directional replication. (Yes, this is evil but there are > problems where it is indispensable.) > 4.) Upgrade support. Aside from database upgrade (how would this ever > really work between versions?), it would not support zero-downtime app > upgrades, which depend on bi-directional replication tricks. > 5.) Heterogeneous replication. > 6.) Finally, performance scaling using scale-out over large numbers of > replicas. I think it’s possible to get tunnel vision on this—it’s not a > big requirement in the PG community because people don’t use PG in the > first place when they want to do this. They use MySQL, which has very > good replication for performance scaling, though it’s rather weak for > availability. Let's not try to boil the ocean, hey? From my perspective, the above use cases are what complex tools like Slony, Bucardo, Skytools, Continuent, pgCluster, pgPool2, etc., etc. are for. Now, if you're saying that you want to develop row-based replication so that Continuent will work better, I'm all for it; but saying that we *shouldn't* implement the current spec which satisfies large numbers of users because it doesn't support *all* users is a recipe for self-defeat. We can't satisfy all users with one implementation, and we shouldn't try. I think, for that matter, that work on the common replication hooks supporting the external replication packages should continue. We need these for precisely the reasons you state. But ... single-master, single-slave, synch or asynch, whole-installation local network replication is a case which covers a *lot* of users' needs ... I'd argue the numerical majority. --Josh
On Thu, May 29, 2008 at 09:54:03PM +0200, Peter Eisentraut wrote: > David Fetter wrote: > > Either one of these would be great, but something that involves > > machines that stay useless most of the time is just not going to > > work. > > Lots of people do use warm standby already anyway, just not based on > mechanisms built into PostgreSQL. So defining away this need is > completely unrealistic based on my experience. Even if there were a > read-only slave, lots of applications couldn't make use of it. > > Anyway, a common approach to making better use of the hardware is to > put some other service on the otherwise-standby machine, which in > turn uses your master database server machine as its failover > target. Unless you run *only* a database, there would usually be > some candidate that you could set up that way. > > Another common approach is to realize that for some the costs of a > downtime risk are higher than buying some extra hardware. > I think the consensus in the core team was that having synchronous > log shipping in 8.4 would already be a worthwhile feature by itself. If that was in fact the consensus of the core team, and what I've been seeing from several core members in this thread makes that idea unclear, it's out of step with the stated goal of the feature. Having some kind of half-way, doesn't-actually-quite-work-out-of-the-box "replication" will make things worse and not better. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > On Thu, May 29, 2008 at 09:54:03PM +0200, Peter Eisentraut wrote: >> I think the consensus in the core team was that having synchronous >> log shipping in 8.4 would already be a worthwhile feature by itself. > If that was in fact the consensus of the core team, and what I've been > seeing from several core members in this thread makes that idea > unclear, it's out of step with the stated goal of the feature. Having > some kind of half-way, doesn't-actually-quite-work-out-of-the-box > "replication" will make things worse and not better. What is your justification for denigrating this plan with that? Or are you merely complaining because we know we won't be all the way there in 8.4? regards, tom lane
On Thu, May 29, 2008 at 04:44:19PM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > On Thu, May 29, 2008 at 09:54:03PM +0200, Peter Eisentraut wrote: > >> I think the consensus in the core team was that having > >> synchronous log shipping in 8.4 would already be a worthwhile > >> feature by itself. I'm sorry if I was unclear on this. I don't disagree that it would be a worthy feature. I do disagree with the idea, given this goal, that we can expect to ship 8.4 without a built-in replication and have that be good enough by itself. > > If that was in fact the consensus of the core team, and what I've > > been seeing from several core members in this thread makes that > > idea unclear, it's out of step with the stated goal of the > > feature. Having some kind of half-way, > > doesn't-actually-quite-work-out-of-the-box "replication" will make > > things worse and not better. > > What is your justification for denigrating this plan with that? Or > are you merely complaining because we know we won't be all the way > there in 8.4? Again, just my humble opinion, but given the stated goal, which I agree with, I'd say it's worth holding up 8.4 until some kind of out-of-the-box replication advances that goal, where Yet Another Toolkit Suitable For People Who Are Already Database Kernel Hackers To Build Replication Systems On Top Of does not. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote: > > What is your justification for denigrating this plan with that? Or > > are you merely complaining because we know we won't be all the way > > there in 8.4? > > Again, just my humble opinion, but given the stated goal, which I > agree with, I'd say it's worth holding up 8.4 until some kind of > out-of-the-box replication advances that goal, where Yet Another > Toolkit Suitable For People Who Are Already Database Kernel Hackers To > Build Replication Systems On Top Of does not. I am confused why we would hold 8.4 when no one said we should hold 8.3 because we didn't have built-in replication. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
David, >> I think the consensus in the core team was that having synchronous >> log shipping in 8.4 would already be a worthwhile feature by itself. > > If that was in fact the consensus of the core team, It is. > and what I've been > seeing from several core members in this thread makes that idea > unclear, it's out of step with the stated goal of the feature. Having > some kind of half-way, doesn't-actually-quite-work-out-of-the-box > "replication" will make things worse and not better. So, you've got a better implementation up your sleeve? I really don't get where you're coming from on this. Frankly, your e-mails seem gauged to be merely disruptive without any intent of constructive input. If you're opposed to working on replication in the core, then just say so. If you think that there's an easier way to develop M-S replication in the core than using WAL, then please present your solution. --Josh
On Thu, May 29, 2008 at 04:54:04PM -0400, Bruce Momjian wrote: > David Fetter wrote: > > > What is your justification for denigrating this plan with that? > > > Or are you merely complaining because we know we won't be all > > > the way there in 8.4? > > > > Again, just my humble opinion, but given the stated goal, which I > > agree with, I'd say it's worth holding up 8.4 until some kind of > > out-of-the-box replication advances that goal, where Yet Another > > Toolkit Suitable For People Who Are Already Database Kernel > > Hackers To Build Replication Systems On Top Of does not. > > I am confused why we would hold 8.4 when no one said we should hold > 8.3 because we didn't have built-in replication. If people on core had come to the idea that we needed to build in replication *before* 8.3 came out, they certainly didn't announce it. Now is a great time to mention this because it gives everybody time to: 1. Come to a consensus on what the out-of-the-box replication should be, and 2. Build, test and debug whatever the consensus out-of-the-box replication turns out to be. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Bruce Momjian <bruce@momjian.us> writes: > David Fetter wrote: >> Again, just my humble opinion, but given the stated goal, which I >> agree with, I'd say it's worth holding up 8.4 until some kind of >> out-of-the-box replication advances that goal, where Yet Another >> Toolkit Suitable For People Who Are Already Database Kernel Hackers To >> Build Replication Systems On Top Of does not. > I am confused why we would hold 8.4 when no one said we should hold 8.3 > because we didn't have built-in replication. Possibly I am misinterpreting what David said, but I think his opinion is that synchronous log shipping in itself is useless, and that we should hold 8.4 until we have that *and* slave-side query execution. Suffice it to say that I disagree with the first position, and as for the second, I will never again be willing to hold a release for one single feature, no matter what it is. We've learned the folly of managing the project that way several times over. regards, tom lane
On Thu, May 29, 2008 at 01:55:42PM -0700, Josh Berkus wrote: > David, > >>> I think the consensus in the core team was that having synchronous >>> log shipping in 8.4 would already be a worthwhile feature by itself. >> >> If that was in fact the consensus of the core team, > > It is. > >> and what I've been seeing from several core members in this thread >> makes that idea unclear, it's out of step with the stated goal of >> the feature. Having some kind of half-way, >> doesn't-actually-quite-work-out-of-the-box "replication" will make >> things worse and not better. > > So, you've got a better implementation up your sleeve? Nope. > I really don't get where you're coming from on this. Frankly, your > e-mails seem gauged to be merely disruptive without any intent of > constructive input. I'm sorry they've come across that way. That was not my intent. > If you're opposed to working on replication in the core, then just > say so. If you think that there's an easier way to develop M-S > replication in the core than using WAL, then please present your > solution. I think having master-slave replication in the core using WAL is a *great* thing to do, doable, a good path to go on, etc., and I think it's worth holding up 8.4 until we have at least one actual out-of-the-box version of same. People have hinted that we might be able to get both a synchronous one and an asynchronous one based on WAL, which would be even better. :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David, > I think having master-slave replication in the core using WAL is a > *great* thing to do, doable, a good path to go on, etc., and I think > it's worth holding up 8.4 until we have at least one actual > out-of-the-box version of same. Ah, ok. Well, I can tell you that the core team is also united on the value of time-based as opposed to feature-based release cycles. *if* we can develop this in time for 8.4, everybody would be overjoyed, but I also think we should be realistic. > People have hinted that we might be able to get both a synchronous one > and an asynchronous one based on WAL, which would be even better. :) That's the idea, yes. No reason to dump asynch WAL copying when it's already working. --Josh
On Thu, May 29, 2008 at 01:39:29PM -0700, David Fetter wrote: > > I think the consensus in the core team was that having synchronous > > log shipping in 8.4 would already be a worthwhile feature by itself. > > If that was in fact the consensus of the core team, and what I've been > seeing from several core members in this thread makes that idea > unclear, it's out of step with the stated goal of the feature. Having > some kind of half-way, doesn't-actually-quite-work-out-of-the-box > "replication" will make things worse and not better. I don't agree with that. It seems to me that the stated goal is fourfold: 1. Get something that is simple (out of the box). 2. Get something that is generally useful, even if not universally useful. 3. Get something that is implementable in stages, with each stage representing an increase in functionality. 4. Aim for 8.4, but be realistic about chances. It seems to me that sync. replication to a write-only standby actually meets those goals. I can think of more than one case where I have spent lots of time, money, or both attempting to achieve this goal with various add-on hardware and software systems that mostly worked except when they didn't. These additions were complicated to set up, painful to use, and documented by sadists who couldn't write. I agree that there are plenty of people unwilling to use a system that requires additional hardware that mostly can't be used for other things. But those people already have alternatives (maybe even more, and simple ones, soon). The synchronous-needing crowd currently have nothing. The proposed feature would be a huge improvement. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
Peter Eisentraut wrote: > Mathias Brossard wrote: >> From what I gather from those slides it seems to me that the NTT solution >> is synchronous not asynchronous. In my opinion it's even better, but I do >> understand that others might prefer asynchronous. I'm going to speculate, >> but I would think it should be possible (without a substancial rewrite) to >> support both modes (or even some intermediate modes, like DRBD on Linux). > > Well, we already have asynchronous. The whole point is adding synchronous. I don't think the current log shipping solution matches, as Tom puts it, the « need for simple, built-in replication » (from a user perspective, I wouldn't call it "simple"). I pointed out that the NTT solution is synchronous because Tom said in the first part of his email that: > In practice, simple asynchronous single-master-multiple-slave> replication covers a respectable fraction of use cases,so we have> concluded that we should allow such a feature to be included in the> core project. ... and yet "the most appropriate base technology for this" is synchronous and maybe I should have also pointed out in my previous mail is that it doesn't support multiple slaves. Also, as other have pointed out there are different interpretations of "synchronous" depending on wether the WAL data has reached the other end of the network connection, a safe disk checkpoint or the slave DB itself. Sincerely, -- Mathias Brossard
On Thursday 29 May 2008 12:13:20 Bruce Momjian wrote: > David Fetter wrote: > > On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote: > > > Josh Berkus wrote: > > > > Publishing the XIDs back to the master is one possibility. We > > > > also looked at using "spillover segments" for vacuumed rows, but > > > > that seemed even less viable. > > > > > > > > I'm also thinking, for *async replication*, that we could simply > > > > halt replication on the slave whenever a transaction passes minxid > > > > on the master. However, the main focus will be on synchrounous > > > > hot standby. > > > > > > Another idea I discussed with Tom is having the slave _delay_ > > > applying WAL files until all slave snapshots are ready. > > > > Either one of these would be great, but something that involves > > machines that stay useless most of the time is just not going to work. > > Right, the ultimate target is to have the slave be read-only, but we > need to get the streaming of WAL logs done first. > I would have thought the read only piece would have been more important than the synchronous piece. In my experience readable slaves is the big selling point in both Oracle and MySQL's implementations, and people are not nearly as concerned if there is a small asynchronous window. It would also be easier to implement on some level; we have already solved the asynchronus wal shipping problem, so we would just need to solve the read-only bits. For synchronus hot standby, you have to solve both the synchronus shipping and the read-only bits. Seems like more work with less upside that read-only slaves vs. pitr warm standby we have now. Interesting that core views this differently. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Thu, 2008-05-29 at 17:42 -0400, Robert Treat wrote: > > I would have thought the read only piece would have been more important than > the synchronous piece. In my experience readable slaves is the big selling > point in both Oracle and MySQL's implementations, and people are not nearly > as concerned if there is a small asynchronous window. The read only piece is the more important piece from a market perspective. Joshua D. Drake
Robert Treat <xzilla@users.sourceforge.net> writes: > I would have thought the read only piece would have been more important than > the synchronous piece. In my experience readable slaves is the big selling > point in both Oracle and MySQL's implementations, and people are not nearly > as concerned if there is a small asynchronous window. Actually, it's the "easy to use" piece that's near the front of my own priority list. In any case, the expectation that we can get the log shipping fixed sooner than we can get read-only slaves has nothing to do with the relative desirability of the pieces; it's a technical judgment based on what we can see of the problems' difficulty. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I'd first want to applaud core decision: having bare PostgreSQL propose a reliable and simple to set-up synchronous replication solution is an excellent perspective! ... Le 29 mai 08 à 23:42, Robert Treat a écrit : > I would have thought the read only piece would have been more > important than > the synchronous piece. In my experience readable slaves is the big > selling > point in both Oracle and MySQL's implementations, and people are not > nearly > as concerned if there is a small asynchronous window. ... Even more so when you're confronted to this exact problem. A fellow PG user ended up having both the WAL and the data replicated by DRBD (protocol C) and some heartbeat scripts to do the automatic failover. This wasn't easy to setup, and to some extend we're still concerned about the reliability part of it. We know about the "easy to use" part of it: we didn't get it. While at it, would it be possible for the "simple" part of the core team statement to include automatic failover? That would mean for current master when it's going to stop on error (fatal) to tell the slave to warm-up. Of course in case of more severe crash the slave would have to get started by other means, but covering the fatal error path and have the master restart as a slave would only add up to the reliability... wouldn't it? > It would also be easier to implement on some level; we have already > solved the > asynchronus wal shipping problem, so we would just need to solve the > read-only bits. For synchronus hot standby, you have to solve both the > synchronus shipping and the read-only bits. Seems like more work > with less > upside that read-only slaves vs. pitr warm standby we have now. > > Interesting that core views this differently. core seems to think read-only slave is more complex than synchronous slave, in term of slave read only long transaction and master vacuums for example. Regards, - -- dim -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iEYEARECAAYFAkg/Kk8ACgkQlBXRlnbh1bneKACeMK+fSp8VExctndo46X76NTxV atIAn2UYw1g/4RPddypqirrZcqg5C7gm =JeA6 -----END PGP SIGNATURE-----
Dimitri Fontaine <dim@hi-media.com> writes: > While at it, would it be possible for the "simple" part of the core > team statement to include automatic failover? No, I think it would be a useless expenditure of energy. Failover includes a lot of things that are not within our purview: switching IP addresses to point to the new server, some kind of STONITH solution to keep the original master from coming back to life, etc. Moreover there are already projects/products concerned with those issues. It might be useful to document where to find solutions to that problem, but we can't take it on as part of core Postgres. regards, tom lane
Joshua D. Drake wrote: > On Thu, 2008-05-29 at 17:42 -0400, Robert Treat wrote: > > >> I would have thought the read only piece would have been more important than >> the synchronous piece. In my experience readable slaves is the big selling >> point in both Oracle and MySQL's implementations, and people are not nearly >> as concerned if there is a small asynchronous window. >> > > The read only piece is the more important piece from a market > perspective. > > > You must be gauging a different market from the one I'm in. I have just come back from a meeting with a (quite technically savvy) customer who was quite excited by the news and saw the possibility of read-only slaves as a nice to have extra rather than a must-have-or-it's-not-worth anything feature. I'm really quite astounded and rather saddened by the waves of negativity I have seen today. cheers andrew
On Thu, 2008-05-29 at 18:39 -0400, Andrew Dunstan wrote: > > Joshua D. Drake wrote: > > On Thu, 2008-05-29 at 17:42 -0400, Robert Treat wrote: > > > > You must be gauging a different market from the one I'm in. I have just > come back from a meeting with a (quite technically savvy) customer who One customer does not make a hundred. I am not saying that the shipping isn't valid, just that those that I talk to are more interested in the read only slave. Consider that we have any number of ways to solve the problem we are considering implementing now. DRBD being just one of them. Especially considering we aren't talking about failover etc.. > was quite excited by the news and saw the possibility of read-only > slaves as a nice to have extra rather than a must-have-or-it's-not-worth > anything feature. > > I'm really quite astounded and rather saddened by the waves of > negativity I have seen today. I have seen much more positive than negative. I think most are just trying to figure out exactly what we are talking about. I for example am not really against the feature at all. I am suspect of the idea of only shipping 2 components of a 3 component feature though. Sincerely, Joshua D. Drake
"Joshua D. Drake" <jd@commandprompt.com> writes: > One customer does not make a hundred. I am not saying that the shipping > isn't valid, just that those that I talk to are more interested in the > read only slave. Consider that we have any number of ways to solve the > problem we are considering implementing now. DRBD being just one of > them. In principle, sure, but there's a big gap between theory and practice here. What you are arguing for is the "give the users a toolkit and let them figure it out" approach that David was lambasting upthread. People want the bits to go from point A to point B; they don't want to have to research, design, test, and administer their own solution for moving the bits. Furthermore, as several people have pointed out from their own bitter experiences, getting a reliable and easy-to-use solution is not as easy as it sounds. I think we have nontrivial work in front of us to build a simple, reliable, community-tested log shipping solution; and it's not very sexy work either. But it needs to get done, and it really needs to get done first. There's no point in having read-only slave queries if you don't have a trustworthy method of getting the data to them. regards, tom lane
On Thu, 2008-05-29 at 19:02 -0400, Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > One customer does not make a hundred. I am not saying that the shipping > > isn't valid, just that those that I talk to are more interested in the > > read only slave. Consider that we have any number of ways to solve the > > problem we are considering implementing now. DRBD being just one of > > them. > > In principle, sure, but there's a big gap between theory and practice > here. What you are arguing for is the "give the users a toolkit and > let them figure it out" approach that David was lambasting upthread. I think maybe my actual argument isn't coming through. What I am arguing for is not shipping XY without Z. That is all. (and no, I don't think we should hold up 8.4). In short. Let's do the feature but don't ship half of it. Ship the whole thing. If that means it doesn't get into 8.4, then it doesn't get into 8.4. > People want the bits to go from point A to point B; they don't want > to have to research, design, test, and administer their own solution > for moving the bits. You don't have to convince me. I agree with you. > I think we have nontrivial > work in front of us to build a simple, reliable, community-tested > log shipping solution; and it's not very sexy work either. But it > needs to get done, and it really needs to get done first. There's > no point in having read-only slave queries if you don't have a > trustworthy method of getting the data to them. O.k. I was with you until here. Log shipping ala pg_standby works fine now sans read-only slave. No, it isn't out of the box which I can see an argument for but it is certainly trustworthy. Or do you mean the synchronous part? Sincerely, Joshua D. rake
tgl@sss.pgh.pa.us (Tom Lane) writes: > As I said originally, we have no expectation that the proposed features > will displace the existing replication projects for "high end" > replication problems ... and I'd characterize all of Robert's concerns > as "high end" problems. We are happy to let those be solved outside > the core project. I expect that one of the effects of having an "improved PITR" as a 'core' replication solution will be that Slony-I will become less popular amongst "naive users," and that is *not* a bad thing. I am quite sure that there are a lot of cases today where users would be better served by what falls out of this. It will mean that those that are left will, despite being fewer, be more cognizant that they are getting what they require. We should see somewhat fewer "expectation gaps," and that should allow more people to be more satisfied with what they are getting. -- let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;; http://linuxdatabases.info/info/linux.html Lisp stoppped itself FEP Command:
"Joshua D. Drake" <jd@commandprompt.com> writes: > I think maybe my actual argument isn't coming through. What I am arguing > for is not shipping XY without Z. That is all. (and no, I don't think we > should hold up 8.4). So we should keep all the work out of the tree until every part is done? No thanks; especially not when there is a perfectly respectable use-case for parts X and Y alone (whether it suits *your* uses or not). >> There's no point in having read-only slave queries if you don't have a >> trustworthy method of getting the data to them. > O.k. I was with you until here. Log shipping ala pg_standby works fine > now sans read-only slave. No, it isn't out of the box which I can see an > argument for but it is certainly trustworthy. Or do you mean the > synchronous part? How much testing has pg_standby really gotten? Some, sure, but it's a contrib module that wasn't even there before 8.3. Even ignoring the lag issue, I wouldn't trust it a whole lot if I were a DBA responsible for valuable data. As much as some folk would like to think that contrib is mainstream, it's not really in the same league as far as testing coverage goes. regards, tom lane
On Thu, May 29, 2008 at 7:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Thu, 2008-05-29 at 19:02 -0400, Tom Lane wrote: >> I think we have nontrivial >> work in front of us to build a simple, reliable, community-tested >> log shipping solution; and it's not very sexy work either. But it >> needs to get done, and it really needs to get done first. There's >> no point in having read-only slave queries if you don't have a >> trustworthy method of getting the data to them. > > O.k. I was with you until here. Log shipping ala pg_standby works fine > now sans read-only slave. No, it isn't out of the box which I can see an > argument for but it is certainly trustworthy. Or do you mean the > synchronous part? I disagree...setting up pg_standby is more complex than it really has to be. There are several examples in the archives of people getting their standby solutions busted with partial wal files, etc. I helped beta test pg_standby and there are a few 'gotchas' in getting it set up properly. pg_standby is not the problem (although there are some odd things about it), it's getting files from point a to point b. It would be nice to have 'pg_archive' which mirrors pg_standby and handles the work on the client side for example. While some of us can work magic with rsync, etc. It would be nice to get things running with few .conf settings and no external dependencies. merlin
On Thu, 29 May 2008, Tom Lane wrote: > There's no point in having read-only slave queries if you don't have a > trustworthy method of getting the data to them. This is a key statement that highlights the difference in how you're thinking about this compared to some other people here. As far as some are concerned, the already working log shipping *is* a trustworthy method of getting data to the read-only slaves. There are plenty of applications (web oriented ones in particular) where if you could direct read-only queries against a slave, the resulting combination would be a giant improvement over the status quo even if that slave was as much as archive_timeout behind the master. That quantity of lag is perfectly fine for a lot of the same apps that have read scalability issues. If you're someone who falls into that camp, the idea of putting the sync replication job before the read-only slave one seems really backwards. I fully accept that it may be the case that it doesn't make technical sense to tackle them in any order besides sync->read-only slaves because of dependencies in the implementation between the two. If that's the case, it would be nice to explicitly spell out what that was to deflect criticism of the planned prioritization. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
* Tom Lane <tgl@sss.pgh.pa.us> [080529 20:22]: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > I think maybe my actual argument isn't coming through. What I am arguing > > for is not shipping XY without Z. That is all. (and no, I don't think we > > should hold up 8.4). > > So we should keep all the work out of the tree until every part is done? > No thanks; especially not when there is a perfectly respectable use-case > for parts X and Y alone (whether it suits *your* uses or not). Thank you. I would be really disappointed if some sort of synchronous WAL shipping patch was done and considered good, but just not applied because some semi-related "read only queries while WAL applying" patch wasn't yet considered done and good... Then again, the opposite is also true. If someone sat down and did the hard work and made read-only queries runnable on a slave applying WAL and the patch was considered done and good, I would be disappointed if it wasn't applied because the synchronous WAL shipping wasn't done. Both features are worthy on their own. But the first done should not be held up because the other is not ready. But because core specifically mentioned "ease of use" as one of the goals, I really think the "built-in WAL shipping" should be *their* focuss first. Having read-only queries that no-one can use (well, sure, some of us could set it up reliably) isn't that useful to "general PostgreSQL" community that I think they were thinking of when deciding on this. But their focus doesn't force anybody interested in other features to not work on their features. It's an OSS community... And the fact that NTT had a nice presentation on it means it's probably actually doable for 8.4:http://www.pgcon.org/2008/schedule/attachments/61_Synchronous%20Log%20Shipping%20Replication.pdf a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Tom Lane wrote: >>> There's no point in having read-only slave queries if you don't have a >>> trustworthy method of getting the data to them. > >> O.k. I was with you until here. Log shipping ala pg_standby works fine >> now sans read-only slave. No, it isn't out of the box which I can see an >> argument for but it is certainly trustworthy. Or do you mean the >> synchronous part? > > How much testing has pg_standby really gotten? Some, sure, but it's a > contrib module that wasn't even there before 8.3. Even ignoring the lag > issue, I wouldn't trust it a whole lot if I were a DBA responsible for > valuable data. As much as some folk would like to think that contrib > is mainstream, it's not really in the same league as far as testing > coverage goes. This is a pretty telling statement. Sincerely, Joshua D. Drake
Greg, > I fully accept that it may be the case that it doesn't make technical > sense to tackle them in any order besides sync->read-only slaves because > of dependencies in the implementation between the two. If that's the > case, it would be nice to explicitly spell out what that was to deflect > criticism of the planned prioritization. There's a very simple reason to prioritize the synchronous log shipping first; NTT may open source their solution and we'll get it a lot sooner than the other components. That is, we expect that synch log shipping is *easier* than read-only slaves and will get done sooner. Since there are quite a number of users who could use this, whether or not they can run queries on the slaves, why not ship that feature as soon as its done? There's also a number of issues with using the currently log shipping method for replication. In additon to the previously mentioned setup pains, there's the 16MB chunk size for shipping log segments, which is fine for data warehouses but kind of sucks for a web application with a 3GB database which may take 2 hours to go though 16MB. So we have to change the shipping method anyway, and if we're doing that, why not work on synch? Mind you, if someone wanted to get started on read-only slaves *right now* I can't imagine anyone would object. There's a number of problems to solve with recovery mode, table locking etc. that can use some work even before we deal with changes to log shipping, or XID writeback or any of the other issues. So, volunteers? -- Josh Berkus PostgreSQL @ Sun San Francisco
On Thu, May 29, 2008 at 07:02:56PM -0400, Tom Lane wrote: > People want the bits to go from point A to point B; they don't want > to have to research, design, test, and administer their own solution > for moving the bits. I agree with this. I think I probably know as well as most people -- perhaps less well than maybe two people in the world -- where most of the bodies in Slony are hidden, and I still find it a pain to administer. Other systems are only somewhat less troublesome; and I really do know about the concepts involved. I'm not tripping on important things. It's just some work to set up. Other systems hide that work. Given that (for instance) psql is really very easy to use once you know a few things, the ongoing pain of simple replication in Postgres is a big wart. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Thu, May 29, 2008 at 9:26 PM, Josh Berkus <josh@agliodbs.com> wrote: >> I fully accept that it may be the case that it doesn't make technical >> sense to tackle them in any order besides sync->read-only slaves because >> of dependencies in the implementation between the two. If that's the >> case, it would be nice to explicitly spell out what that was to deflect >> criticism of the planned prioritization. > > There's a very simple reason to prioritize the synchronous log shipping first; > NTT may open source their solution and we'll get it a lot sooner than the > other components. That's a good argument. I just read the NTT document and the stuff looks fantastic. You've convinced me...it just doesn't seem prudent to forge ahead with hot standby without dealing with all the syncnronous changes to wal logging first. I just want you guys to understand how important hot standby is to a lot of people. sync logging maybe less so, but having a proof of concept implementation significantly alters the bang/buck ratio. > That is, we expect that synch log shipping is *easier* than read-only slaves > and will get done sooner. Since there are quite a number of users who could > use this, whether or not they can run queries on the slaves, why not ship > that feature as soon as its done? I think what dfetter, etc. were saying is that we should elevate the hot standby stuff to a requirement, or at least a future requirement. IOW, we should try and avoid doing anything which would make it harder than it already is. Please understand that I don't thing people on the list were trying to be negative...the failure of hot standby to materialize in the 8.3 cycle was a bitter pill for many people. I personally see this new thinking as a hugely positive development. > There's also a number of issues with using the currently log shipping method > for replication. In additon to the previously mentioned setup pains, there's > the 16MB chunk size for shipping log segments, which is fine for data > warehouses but kind of sucks for a web application with a 3GB database which > may take 2 hours to go though 16MB. So we have to change the shipping method > anyway, and if we're doing that, why not work on synch? well, there is the archive_timeout setting...but point taken. A big use case for hot standby is OLTP environments where you get to combine HA and reporting server into a single box. > Mind you, if someone wanted to get started on read-only slaves *right now* I > can't imagine anyone would object. There's a number of problems to solve > with recovery mode, table locking etc. that can use some work even before we > deal with changes to log shipping, or XID writeback or any of the other > issues. So, volunteers? As I see it, sync logging, hot standby, and improved setup features are all mostly orthogonal.. Florian took some pretty decent notes during his analysis and outlined the problem areas pretty well. That would be a starting point. It just strikes me for all this stuff to having even remote chance of making 8.4 the work needs to be divided up into teams and conquered separately. merlin
Josh Berkus wrote: > Greg, > > >> I fully accept that it may be the case that it doesn't make technical >> sense to tackle them in any order besides sync->read-only slaves because >> of dependencies in the implementation between the two. If that's the >> case, it would be nice to explicitly spell out what that was to deflect >> criticism of the planned prioritization. >> > > There's a very simple reason to prioritize the synchronous log shipping first; > NTT may open source their solution and we'll get it a lot sooner than the > other components. > I have been reading the slides from the NTT presentation, and I now really regret not having gone to that talk. It does seem quite heavy, though, including new background processes, heartbeat etc. > That is, we expect that synch log shipping is *easier* than read-only slaves > and will get done sooner. Since there are quite a number of users who could > use this, whether or not they can run queries on the slaves, why not ship > that feature as soon as its done? > Indeed. > There's also a number of issues with using the currently log shipping method > for replication. In additon to the previously mentioned setup pains, there's > the 16MB chunk size for shipping log segments, which is fine for data > warehouses but kind of sucks for a web application with a 3GB database which > may take 2 hours to go though 16MB. So we have to change the shipping method > anyway, and if we're doing that, why not work on synch? > Well, yes, but you do know about archive_timeout, right? No need to wait 2 hours. cheers andrew
Greg Smith <gsmith@gregsmith.com> writes: > I fully accept that it may be the case that it doesn't make technical > sense to tackle them in any order besides sync->read-only slaves because > of dependencies in the implementation between the two. Well, it's certainly not been my intention to suggest that no one should start work on read-only-slaves before we finish the other part. The point is that I expect the log shipping issues will be done first because they're easier, and it would be pointless to not release that feature if we had it. But since you mention it: one of the plausible answers for fixing the vacuum problem for read-only slaves is to have the slaves push an xmin back upstream to the master to prevent premature vacuuming. The current design of pg_standby is utterly incapable of handling that requirement. So there might be an implementation dependency there, depending on how we want to solve that problem. regards, tom lane
On Fri, May 30, 2008 at 10:40 AM, Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br /><divclass="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt0pt 0.8ex; padding-left: 1ex;"> But since you mention it: one of the plausible answers for fixing the<br /> vacuum problemfor read-only slaves is to have the slaves push an xmin<br /> back upstream to the master to prevent premature vacuuming. The current<br /> design of pg_standby is utterly incapable of handling that requirement.<br /> So there mightbe an implementation dependency there, depending on how<br /> we want to solve that problem.<br /></blockquote></div><br/>I think it would be best to not make the slave interfere with the master's operations; that's onlygoing to increase the operational complexity of such a solution.<br /><br />There could be multiple slaves followinga master, some serving data-warehousing queries, some for load-balancing reads, some others just for disaster recovery,and then some just to mitigate human errors by re-applying the logs with a delay.<br /><br />I don't think any oneinstallation would see all of the above mentioned scenarios, but we need to take care of multiple slaves operating offof a single master; something similar to cascaded Slony-I.<br /><br />My two cents.<br /><br />Best regards,<br />-- <br/>gurjeet[.singh]@EnterpriseDB.com<br />singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com<br /><br />EnterpriseDB<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /><br />Mail sent from my BlackLaptopdevice
On Thu, May 29, 2008 at 01:58:34PM -0700, David Fetter wrote: > If people on core had come to the idea that we needed to build in > replication *before* 8.3 came out, they certainly didn't announce it. > > Now is a great time to mention this because it gives everybody time to: > > 1. Come to a consensus on what the out-of-the-box replication should > be, and > > 2. Build, test and debug whatever the consensus out-of-the-box > replication turns out to be. None of that is an argument for why this has to go in 8.4. I argued in Ottawa that the idea that you have to plan a feature for _the next release_ is getting less tenable with each release. This is because major new features for Postgres are now often big and complicated. The days of big gains from single victories are mostly over (though there are exceptions, like HOT). Postgres is already mature. As for the middle-aged person with a mortgage, longer-term planning is simply a necessary part of life now. There are two possibilities here. One is to have huge releases on much longer timetables. I think this is unsustainable in a free project, because people will get bored and go away if they don't get to use the results of their work in a reasonably short time frame. The other is to accept that sometimes, planning and development for new features will have to start a long time before actual release -- maybe planning and some coding for 2 releases out. That allows large features like the one we're discussing to be developed responsibly without making everything else wait for it. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On 5/30/08, Gurjeet Singh <singh.gurjeet@gmail.com> wrote: > On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > But since you mention it: one of the plausible answers for fixing the > > vacuum problem for read-only slaves is to have the slaves push an xmin > > back upstream to the master to prevent premature vacuuming. The current > > design of pg_standby is utterly incapable of handling that requirement. > > So there might be an implementation dependency there, depending on how > > we want to solve that problem. > > I think it would be best to not make the slave interfere with the master's > operations; that's only going to increase the operational complexity of such > a solution. I disagree - it's better to consider syncronized WAL-slaves as equal to master, so having queries there affect master is ok. You need to remeber this solution tries not to replace 100-node Slony-I setups. You can run sanity checks on slaves or use them to load-balance read-only OLTP queries, but not random stuff. > There could be multiple slaves following a master, some serving > data-warehousing queries, some for load-balancing reads, some others just > for disaster recovery, and then some just to mitigate human errors by > re-applying the logs with a delay. To run warehousing queries you better use Slony-I / Londiste. For warehousring you want different / more indexes on tables anyway, so I think it's quite ok to say "don't do it" for complex queries on WAL-slaves. > I don't think any one installation would see all of the above mentioned > scenarios, but we need to take care of multiple slaves operating off of a > single master; something similar to cascaded Slony-I. Again, the synchronized WAL replication is not generic solution. Use Slony/Londiste if you want to get totally independent slaves. Thankfully the -core has set concrete and limited goals, that means it is possible to see working code in reasonable time. I think that should apply to read-only slaves too. If we try to make it handle any load, it will not be finished in any time. Now if we limit the scope I've seen 2 variants thus far: 1) Keep slave max in sync, let the load there affect master (xmin). - Slave can be used to load-balance OLTP load - Slaveshould not be used for complex queries. 2) If long query is running, let slave lag (avoid applying WAL data). - Slave cannot be used to load-balance OLTP load -Slave can be used for complex queries (although no new indexes or temp tables can be created). I think 1) is more important (and more easily implementable) case. For 2) we already have solutions (Slony/Londiste/Bucardo, etc) so there is no point to make effort to solve this here. -- marko
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote: >> >>> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>>> But since you mention it: one of the plausible answers for fixing the >>>> vacuum problem for read-only slaves is to have the slaves push an xmin >>>> back upstream to the master to prevent premature vacuuming. >>>> >>> I think it would be best to not make the slave interfere with the >>> master's operations; that's only going to increase the operational >>> complexity of such a solution. >>> > > >> We ruled that out as the-only-solution a while back. It does have the >> beauty of simplicity, so it may exist as an option or possibly the only >> way, for 8.4. >> > > Yeah. The point is that it's fairly clear that we could make that work. > A solution that doesn't impact the master at all would be nicer, but > it's not at all clear to me that one is possible, unless we abandon > WAL-shipping as the base technology. > > > Quite. Before we start ruling things out let's know what we think we can actually do. I hope that NTT will release their code ASAP so we will have a better idea of what we have and what we need. cheers andrew
Hi Tom, Thanks for the reasoned reply. As you saw from point #2 in my comments, I think you should do this feature. I hope this answers Josh Berkus' concern about my comments. You make a very interesting comment which seems to go to the heart of this design approach: > About the only thing that would make me want to consider row-based > replication in core would be if we determine that read-only slave > queries are impractical atop a WAL-log-shipping implementation. It's possible I'm misunderstanding some of the implementation issues, but it is striking that the detailed responses to your proposal list a number of low-level dependencies between master and slave states when replicating WAL records. It appears that you are designing a replication mechanism that works effectively between a master and a relatively small number of "nearby" slaves. This is clearly an important use case but it also seems clear that the WAL approach is not a general-purpose approach to replication. In other words, you'll incrementally get to that limited end point I describe. This will still leave a lot to be desired on read scaling, not to mention many other cases. Hence my original comments. However, rather than harp on that further I will open up a separate thread to describe a relatively small set of extensions to PostgreSQL that would be enabling for a wide range of replication applications. Contrary to popular opinion these extensions are actually well understood at the theory level and have been implemented as prototypes as well as in commercial patches multiple times in different databases. Those of us who are deeply involved in replication deserve just condemnation for not stepping up and getting our thoughts out on the table. Meanwhile, I would be interested in your reaction to these thoughts on the scope of the real-time WAL approach. There's obviously tremendous interest in this feature. A general description that goes beyond the NTT slides would be most helpful for further discussions. Cheers, Robert P.s., The NTT slides were really great. Takahiro and Masao deserve congratulations on an absolutely first-rate presentation. On 5/29/08 9:09 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Andrew Sullivan <ajs@commandprompt.com> writes: >> On Thu, May 29, 2008 at 12:05:18PM -0700, Robert Hodges wrote: >>> people are starting to get religion on this issue I would strongly >>> advocate a parallel effort to put in a change-set extraction API >>> that would allow construction of comprehensive master/slave >>> replication. > >> You know, I gave a talk in Ottawa just last week about how the last >> effort to develop a comprehensive API for replication failed. > > Indeed, core's change of heart on this issue was largely driven by > Andrew's talk and subsequent discussion. We had more or less been > waiting for the various external replication projects to tell us > what they wanted in this line, and it was only the realization that > no such thing was likely to happen that forced us to think seriously > about what could be done within the core project. > > As I said originally, we have no expectation that the proposed features > will displace the existing replication projects for "high end" > replication problems ... and I'd characterize all of Robert's concerns > as "high end" problems. We are happy to let those be solved outside > the core project. > > About the only thing that would make me want to consider row-based > replication in core would be if we determine that read-only slave > queries are impractical atop a WAL-log-shipping implementation. > Which could happen; in fact I think that's the main risk of the > proposed development plan. But I also think that the near-term > steps of the plan are worth doing anyway, for various other reasons, > and so we won't be out too much effort if the plan fails. > > regards, tom lane >
Simon Riggs <simon@2ndquadrant.com> writes: > On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote: >> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> But since you mention it: one of the plausible answers for fixing the >>> vacuum problem for read-only slaves is to have the slaves push an xmin >>> back upstream to the master to prevent premature vacuuming. >> >> I think it would be best to not make the slave interfere with the >> master's operations; that's only going to increase the operational >> complexity of such a solution. > We ruled that out as the-only-solution a while back. It does have the > beauty of simplicity, so it may exist as an option or possibly the only > way, for 8.4. Yeah. The point is that it's fairly clear that we could make that work. A solution that doesn't impact the master at all would be nicer, but it's not at all clear to me that one is possible, unless we abandon WAL-shipping as the base technology. regards, tom lane
On Fri, May 30, 2008 at 9:31 AM, Marko Kreen <markokr@gmail.com> wrote: > On 5/30/08, Gurjeet Singh <singh.gurjeet@gmail.com> wrote: >> >> I think it would be best to not make the slave interfere with the master's >> operations; that's only going to increase the operational complexity of such >> a solution. > > I disagree - it's better to consider syncronized WAL-slaves > as equal to master, so having queries there affect master is ok. > > You need to remeber this solution tries not to replace 100-node Slony-I > setups. You can run sanity checks on slaves or use them to load-balance > read-only OLTP queries, but not random stuff. > >> There could be multiple slaves following a master, some serving >> data-warehousing queries, some for load-balancing reads, some others just >> for disaster recovery, and then some just to mitigate human errors by >> re-applying the logs with a delay. > > To run warehousing queries you better use Slony-I / Londiste. For > warehousring you want different / more indexes on tables anyway, > so I think it's quite ok to say "don't do it" for complex queries > on WAL-slaves. > >> I don't think any one installation would see all of the above mentioned >> scenarios, but we need to take care of multiple slaves operating off of a >> single master; something similar to cascaded Slony-I. > > Again, the synchronized WAL replication is not generic solution. > Use Slony/Londiste if you want to get totally independent slaves. I strongly agree with Gurjeet. The warm standby replication mechanism is pretty simple and is wonderfully flexible with the one big requirement that your clusters have to be mirrors of each other. Synchronous wal replication obviously needs some communication channel from the slave back to the master. Hopefully, it will be possible to avoid this for asynchronous shipping. merlin
On Fri, 2008-05-30 at 11:30 -0400, Andrew Dunstan wrote: > > Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > >> On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote: > >> > >>> On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> > >>>> But since you mention it: one of the plausible answers for fixing the > >>>> vacuum problem for read-only slaves is to have the slaves push an xmin > >>>> back upstream to the master to prevent premature vacuuming. > >>>> > >>> I think it would be best to not make the slave interfere with the > >>> master's operations; that's only going to increase the operational > >>> complexity of such a solution. > >>> > >> We ruled that out as the-only-solution a while back. It does have the > >> beauty of simplicity, so it may exist as an option or possibly the only > >> way, for 8.4. > >> > > Yeah. The point is that it's fairly clear that we could make that work. > > A solution that doesn't impact the master at all would be nicer, but > > it's not at all clear to me that one is possible, unless we abandon > > WAL-shipping as the base technology. > > > Quite. Before we start ruling things out let's know what we think we can > actually do. Let me re-phrase: I'm aware of that possibility and believe we can and could do it for 8.4. My assessment is that people won't find it sufficient and I am looking at other alternatives also. There may be a better one possible for 8.4, there may not. Hence I've said "something in 8.4, something better later". There is no need to decide that is the only way forward, yet. I hope and expect to put some of these ideas into a more concrete form, but this has not yet happened. Nothing has slipped, not having any trouble getting on with it, just that my plans were to not start it yet. I think having a detailed design ready for review by September commit fest is credible. > I hope that NTT will release their code ASAP so we will have a better > idea of what we have and what we need. That has very little to do with Hot Standby, though there could be patch conflicts, which is why I'm aiming to get WAL streaming done first. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 29 May 2008 09:22:26 -0700 Steve Atkins wrote: > On May 29, 2008, at 9:12 AM, David Fetter wrote: > > > Either one of these would be great, but something that involves > > machines that stay useless most of the time is just not going to work. > > I have customers who are thinking about warm standby functionality, and > the only thing stopping them deploying it is complexity and maintenance, > not the cost of the HA hardware. If trivial-to-deploy replication that > didn't offer read-only access of the slaves were available today I'd bet > that most of them would be using it. Sure, have a similar customer. They are right now using a set of Perl-scripts which ship the logfiles to the slave, take care of the status, apply the logfiles, validate checksums ect ect. The whole thing works very well in combination with RedHat cluster software, but it took several weeks to implement the current solution. Not everyone wants to spend the time and the manpower to implement a "simple" replication. Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
On Fri, 2008-05-30 at 12:31 +0530, Gurjeet Singh wrote: > On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > But since you mention it: one of the plausible answers for > fixing the > vacuum problem for read-only slaves is to have the slaves push > an xmin > back upstream to the master to prevent premature vacuuming. > The current > design of pg_standby is utterly incapable of handling that > requirement. > So there might be an implementation dependency there, > depending on how > we want to solve that problem. > > I think it would be best to not make the slave interfere with the > master's operations; that's only going to increase the operational > complexity of such a solution. > > There could be multiple slaves following a master, some serving > data-warehousing queries, some for load-balancing reads, some others > just for disaster recovery, and then some just to mitigate human > errors by re-applying the logs with a delay. Agreed. We ruled that out as the-only-solution a while back. It does have the beauty of simplicity, so it may exist as an option or possibly the only way, for 8.4. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 29 May 2008 18:29:01 -0400 Tom Lane wrote: > Dimitri Fontaine <dim@hi-media.com> writes: > > While at it, would it be possible for the "simple" part of the core > > team statement to include automatic failover? > > No, I think it would be a useless expenditure of energy. Failover > includes a lot of things that are not within our purview: switching > IP addresses to point to the new server, some kind of STONITH solution > to keep the original master from coming back to life, etc. Moreover > there are already projects/products concerned with those issues. True words. Failover is not and should not be part of PostgreSQL. But PG can help the failover solution, as example: an easy-to-use interface about the current slave status comes into my mind. Other ideas might also be possible. > It might be useful to document where to find solutions to that problem, > but we can't take it on as part of core Postgres. Ack Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > Well, yes, but you do know about archive_timeout, right? No need to wait > 2 hours. Then you ship 16 MB binary stuff every 30 second or every minute but you only have some kbyte real data in the logfile. This must be taken into account, especially if you ship the logfile over the internet (means: no high-speed connection, maybe even pay-per-traffic) to the slave. Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
On Thu, May 29, 2008 at 7:42 PM, Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br /><divclass="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt0pt 0.8ex; padding-left: 1ex;"> The big problem<br /> is that long-running slave-side queries might still need tuplesthat are<br /> vacuumable on the master, and so replication of vacuuming actions would<br /> cause the slave's queriesto deliver wrong answers.</blockquote></div><br />Another issue with read-only slaves just popped up in my head.<br/><br />How do we block the readers on the slave while it is replaying an ALTER TABLE or similar command that requiresExclusive lock and potentially alters the table's structure. Or does the WAL replay already takes an x-lock on sucha table?<br /><br clear="all" />Best regards,<br />-- <br />gurjeet[.singh]@EnterpriseDB.com<br />singh.gurjeet@{ gmail| hotmail | indiatimes | yahoo }.com<br /><br />EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/><br />Mail sent from my BlackLaptop device
On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote: > Then you ship 16 MB binary stuff every 30 second or every minute but > you only have some kbyte real data in the logfile. Not if you use pg_clearxlogtail ( http://www.2ndquadrant.com/replication.htm ), which got lost in the giant March commitfest queue but should probably wander into contrib as part of 8.4. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 2008-05-30 at 11:12 -0700, Robert Hodges wrote: > This is clearly an important use case but it also seems clear that > the WAL approach is not a general-purpose approach to replication. I think we cannot make such a statement yet, if ever. I would note that log-based replication is now the mainstay of commercial database replication techniques for loosely-coupled groups of servers. It would seem strange to assume that it should not be good for us too, simply because we know it to be difficult. IMHO the project has a pretty good track record of delivering functionality that looked hard at first glance. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Friday 30 May 2008 01:10:20 Tom Lane wrote: > Greg Smith <gsmith@gregsmith.com> writes: > > I fully accept that it may be the case that it doesn't make technical > > sense to tackle them in any order besides sync->read-only slaves because > > of dependencies in the implementation between the two. > > Well, it's certainly not been my intention to suggest that no one should > start work on read-only-slaves before we finish the other part. The > point is that I expect the log shipping issues will be done first > because they're easier, and it would be pointless to not release that > feature if we had it. > > But since you mention it: one of the plausible answers for fixing the > vacuum problem for read-only slaves is to have the slaves push an xmin > back upstream to the master to prevent premature vacuuming. The current > design of pg_standby is utterly incapable of handling that requirement. > So there might be an implementation dependency there, depending on how > we want to solve that problem. > Sure, but whose to say that after synchronous wal shipping is "finished" it wont need a serious re-write due to new needs from the hot standby feature. I think going either way carries some risk. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Thursday 29 May 2008 22:59:21 Merlin Moncure wrote: > On Thu, May 29, 2008 at 9:26 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> I fully accept that it may be the case that it doesn't make technical > >> sense to tackle them in any order besides sync->read-only slaves because > >> of dependencies in the implementation between the two. If that's the > >> case, it would be nice to explicitly spell out what that was to deflect > >> criticism of the planned prioritization. > > > > There's a very simple reason to prioritize the synchronous log shipping > > first; NTT may open source their solution and we'll get it a lot sooner > > than the other components. > > That's a good argument. I just read the NTT document and the stuff > looks fantastic. You've convinced me... It would be a better argument if the NTT guys hadn't said that they estimated 6 months time before the code would be released, which puts us beyond 8.4. Now it is possible that the time frame could be sooner, but unless someone already has the patch, this reminds me a little too much of the arguments for including windows support in a single release because we already had a work port/patch set to go from. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Thursday 29 May 2008 20:31:31 Greg Smith wrote: > On Thu, 29 May 2008, Tom Lane wrote: > > There's no point in having read-only slave queries if you don't have a > > trustworthy method of getting the data to them. > > This is a key statement that highlights the difference in how you're > thinking about this compared to some other people here. As far as some > are concerned, the already working log shipping *is* a trustworthy method > of getting data to the read-only slaves. There are plenty of applications > (web oriented ones in particular) where if you could direct read-only > queries against a slave, the resulting combination would be a giant > improvement over the status quo even if that slave was as much as > archive_timeout behind the master. That quantity of lag is perfectly fine > for a lot of the same apps that have read scalability issues. > > If you're someone who falls into that camp, the idea of putting the sync > replication job before the read-only slave one seems really backwards. > Just looking at it from an overall market perspective, synchronous log shipping pretty much only addresses failover needs, where as read-only slaves address both failover and scaling issues. (Note I say address, not solve). -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Andreas 'ads' Scherbaum wrote: > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > > >> Well, yes, but you do know about archive_timeout, right? No need to wait >> 2 hours. >> > > Then you ship 16 MB binary stuff every 30 second or every minute but > you only have some kbyte real data in the logfile. This must be taken > into account, especially if you ship the logfile over the internet > (means: no high-speed connection, maybe even pay-per-traffic) to the > slave. > > > > Sure there's a price to pay. But that doesn't mean the facility doesn't exist. And I rather suspect that most of Josh's customers aren't too concerned about traffic charges or affected by such bandwidth restrictions. Certainly, none of my clients are, and they aren't in the giant class. Shipping a 16Mb file, particularly if compressed, every minute or so, is not such a huge problem for a great many commercial users, and even many domestic users. cheers andrew
On Sat, May 31, 2008 at 1:52 AM, Greg Smith <gsmith@gregsmith.com> wrote:
This means we need to modify pg_standby to not check for filesize when reading XLogs.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
Mail sent from my BlackLaptop device
On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote:Not if you use pg_clearxlogtail ( http://www.2ndquadrant.com/replication.htm ), which got lost in the giant March commitfest queue but should probably wander into contrib as part of 8.4.Then you ship 16 MB binary stuff every 30 second or every minute but
you only have some kbyte real data in the logfile.
This means we need to modify pg_standby to not check for filesize when reading XLogs.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
Mail sent from my BlackLaptop device
> Andreas 'ads' Scherbaum wrote: > > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > > > > > >> Well, yes, but you do know about archive_timeout, right? No need to wait > >> 2 hours. > >> > > > > Then you ship 16 MB binary stuff every 30 second or every minute but > > you only have some kbyte real data in the logfile. This must be taken > > into account, especially if you ship the logfile over the internet > > (means: no high-speed connection, maybe even pay-per-traffic) to the > > slave. > > > > > > > > > > Sure there's a price to pay. But that doesn't mean the facility doesn't > exist. And I rather suspect that most of Josh's customers aren't too > concerned about traffic charges or affected by such bandwidth > restrictions. Certainly, none of my clients are, and they aren't in the > giant class. Shipping a 16Mb file, particularly if compressed, every > minute or so, is not such a huge problem for a great many commercial > users, and even many domestic users. Sumitomo Electric Co., Ltd., a 20 billion dollars selling company in Japan (parent company of Sumitomo Electric Information Systems Co., Ltd., which is one of the Recursive SQL development support company) uses 100 PostgreSQL servers. They are doing backups by using log shipping to another data center and have problems with the amount of the transferring log data. They said this is one of the big problems they have with PostgreSQL and hope it will be solved in the near future. -- Tatsuo Ishii SRA OSS, Inc. Japan
On Sat, 2008-05-31 at 02:48 +0530, Gurjeet Singh wrote: > > Not if you use pg_clearxlogtail > ( http://www.2ndquadrant.com/replication.htm ), which got lost > in the giant March commitfest queue but should probably wander > into contrib as part of 8.4. > > This means we need to modify pg_standby to not check for filesize when > reading XLogs. > > Best regards, > It does. Joshua D. Drake >
On Sat, 31 May 2008, Gurjeet Singh wrote: >> Not if you use pg_clearxlogtail > > This means we need to modify pg_standby to not check for filesize when > reading XLogs. No, the idea is that you run the segments through pg_clearxlogtail | gzip, which then compresses lightly used segments massively because all the unused bytes are 0. File comes out the same size at the other side, but you didn't ship a full 16MB if there was only a few KB used. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 2008-05-30 at 01:10 -0400, Tom Lane wrote: > Greg Smith <gsmith@gregsmith.com> writes: > > I fully accept that it may be the case that it doesn't make technical > > sense to tackle them in any order besides sync->read-only slaves because > > of dependencies in the implementation between the two. > > Well, it's certainly not been my intention to suggest that no one should > start work on read-only-slaves before we finish the other part. The > point is that I expect the log shipping issues will be done first > because they're easier, and it would be pointless to not release that > feature if we had it. Agreed. I'm arriving late to a thread that seems to have grown out of all proportion. AFAICS streaming WAL and hot standby are completely orthogonal features. Streaming WAL is easier and if NTT can release their code to open source we may get this in the Sept commit fest. Hot Standby is harder and it was my viewpoint at PGCon that we may not have a perfect working version of this by the end of 8.4. We are very likely to have something working, but maybe not the whole feature set as we might wish to have. I expect to be actively working on this "soon". I definitely do want to see WAL streaming going in as early as possible and before end of 8.4, otherwise code conflicts and other difficulties are likely to push out the 8.4 date and/or Hot Standby. So as I see it, Tom has only passed on my comments on this, not added or removed anything. The main part of the announcement was really about bringing the WAL streaming into core and effectively favouring it over a range of other projects. Can we all back off a little on this for now? Various concerns have been validly expressed, but it will all come good AFAICS. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Le vendredi 30 mai 2008, Tom Lane a écrit : > No, I think it would be a useless expenditure of energy. Failover > includes a lot of things that are not within our purview: switching > IP addresses to point to the new server, some kind of STONITH solution > to keep the original master from coming back to life, etc. Moreover > there are already projects/products concerned with those issues. Well, I forgot that there's in fact no active plan to put pgbouncer features into core at the moment (I'm sure I've read something about this on the lists though). If it was the case, the slave could proxy queries to the master, and stop proxying but serve them if the master tells it it's dying. This way, no need to switch IP addresses, the clients just connect as usual and get results back and do not have to know whether the host they're qerying against is a slave or a master. This level of smartness is into -core. The STONITH part in case of known failure (fatal) does not seem that hard either, as the master at fatal time could write somewhere it's now a slave and use this at next startup time (recovery.conf?). If it can't even do that, well, we're back to crash situation with no provided automatic failover solution. Not handled failure cases obviously will continue to exist. I'm not asking for "all cases managed in -core please", just for "some level of effort" on the topic. Of course, I'm just the one asking questions and trying to raise ideas, so I'm perfectly fine with your current answer (useless expenditure of energy) even if somewhat disagreeing on the useless part of it :) As for the integrated pgbouncer daemon part, I'm thinking this would allow the infamous part 3 of the proposal (read-only slave) to get pretty simple to setup when ready: the slave knows who its master is, and as soon as an XID is needed the transaction queries are forwarded/proxied to it. Thanks again Florian ! > It might be useful to document where to find solutions to that problem, > but we can't take it on as part of core Postgres. Even the part when it makes sense (provided it does and I'm not completely off tracks here)? Regards, -- dim
On Fri, 30 May 2008 16:22:41 -0400 (EDT) Greg Smith wrote: > On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote: > > > Then you ship 16 MB binary stuff every 30 second or every minute but > > you only have some kbyte real data in the logfile. > > Not if you use pg_clearxlogtail ( > http://www.2ndquadrant.com/replication.htm ), which got lost in the giant > March commitfest queue but should probably wander into contrib as part of > 8.4. Yes, this topic was discussed several times in the past but to solve this it needs a patch/solution which is integrated into PG itself, not contrib. Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
On Thu, 2008-05-29 at 10:12 -0400, Tom Lane wrote: > The Postgres core team met at PGCon to discuss a few issues, the largest > of which is the need for simple, built-in replication for PostgreSQL. > Historically the project policy has been to avoid putting replication > into core PostgreSQL, so as to leave room for development of competing > solutions, recognizing that there is no "one size fits all" replication > solution. However, it is becoming clear that this policy is hindering > acceptance of PostgreSQL to too great an extent, compared to the benefit > it offers to the add-on replication projects. Users who might consider > PostgreSQL are choosing other database systems because our existing > replication options are too complex to install and use for simple cases. > In practice, simple asynchronous single-master-multiple-slave > replication covers a respectable fraction of use cases, so we have > concluded that we should allow such a feature to be included in the core > project. We emphasize that this is not meant to prevent continued > development of add-on replication projects that cover more complex use > cases. > > We believe that the most appropriate base technology for this is > probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. > We hope that such a feature can be completed for 8.4. Ideally this > would be coupled with the ability to execute read-only queries on the > slave servers, but we see technical difficulties that might prevent that > from being completed before 8.5 or even further out. (The big problem > is that long-running slave-side queries might still need tuples that are > vacuumable on the master, and so replication of vacuuming actions would > cause the slave's queries to deliver wrong answers.) > > Again, this will not replace Slony, pgPool, Continuent, Londiste, or > other systems for many users, as it will be not be highly scalable nor > support long-distance replication nor replicating less than an entire > installation. But it is time to include a simple, reliable basic > replication feature in the core system. I'm in full support of this and commend the work of the NTT team. The goals and timescales are realistic and setting a timetable in this way will help planning for many users, I'm expecting to lead the charge on the Hot Standby project. The problem mentioned is just one of the issues, though overall I'm now optimistic about our eventual success in that area. I'm discussing this now with a couple of sponsors and would welcome serious financial commitments to this goal. Please contact me off-list if you agree also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Andrew, > Sure there's a price to pay. But that doesn't mean the facility doesn't > exist. And I rather suspect that most of Josh's customers aren't too > concerned about traffic charges or affected by such bandwidth > restrictions. Certainly, none of my clients are, and they aren't in the > giant class. Shipping a 16Mb file, particularly if compressed, every > minute or so, is not such a huge problem for a great many commercial > users, and even many domestic users. The issue is that when you're talking about telecommunications companies (and similar) once a minute isn't adequate. Those folks want at least every second, or even better synchronous. Anyway, this is a pretty pointless discussion given that we want both capabilities, and stuff will get implemented in the order it makes technical sense. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
On Fri, 30 May 2008 17:05:57 -0400 Andrew Dunstan wrote: > Andreas 'ads' Scherbaum wrote: > > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > > > >> Well, yes, but you do know about archive_timeout, right? No need to wait > >> 2 hours. > > > > Then you ship 16 MB binary stuff every 30 second or every minute but > > you only have some kbyte real data in the logfile. This must be taken > > into account, especially if you ship the logfile over the internet > > (means: no high-speed connection, maybe even pay-per-traffic) to the > > slave. > > Sure there's a price to pay. But that doesn't mean the facility doesn't > exist. And I rather suspect that most of Josh's customers aren't too > concerned about traffic charges or affected by such bandwidth > restrictions. Certainly, none of my clients are, and they aren't in the > giant class. Shipping a 16Mb file, particularly if compressed, every > minute or so, is not such a huge problem for a great many commercial > users, and even many domestic users. The real problem is not the 16 MB, the problem is: you can't compress this file. If the logfile is rotated it still contains all the old binary data which is not a good starter for compression. So you may have some kB changes in the wal logfile every minute but you still copy 16 MB data. Sure, it's not so much - but if you rotate a logfile every minute this still transfers 16*60*24 = ~23 GB a day. Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
Tatsuo Ishii wrote: >> Andreas 'ads' Scherbaum wrote: >> >>> On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: >>> >>> >>> >>>> Well, yes, but you do know about archive_timeout, right? No need to wait >>>> 2 hours. >>>> >>>> >>> Then you ship 16 MB binary stuff every 30 second or every minute but >>> you only have some kbyte real data in the logfile. This must be taken >>> into account, especially if you ship the logfile over the internet >>> (means: no high-speed connection, maybe even pay-per-traffic) to the >>> slave. >>> >>> >>> >>> >>> >> Sure there's a price to pay. But that doesn't mean the facility doesn't >> exist. And I rather suspect that most of Josh's customers aren't too >> concerned about traffic charges or affected by such bandwidth >> restrictions. Certainly, none of my clients are, and they aren't in the >> giant class. Shipping a 16Mb file, particularly if compressed, every >> minute or so, is not such a huge problem for a great many commercial >> users, and even many domestic users. >> > > Sumitomo Electric Co., Ltd., a 20 billion dollars selling company in > Japan (parent company of Sumitomo Electric Information Systems Co., > Ltd., which is one of the Recursive SQL development support company) > uses 100 PostgreSQL servers. They are doing backups by using log > shipping to another data center and have problems with the amount of > the transferring log data. They said this is one of the big problems > they have with PostgreSQL and hope it will be solved in the near > future. > > Excellent data point. Now, what I'd like to know is whether they are getting into trouble simply because of the volume of log data generated or because they have a short archive_timeout set. If it's the former (which seems more likely) then none of the ideas I have seen so far in this discussion seemed likely to help, and that would indeed be a major issue we should look at. Another question is this: are they being overwhelmed by the amount of network traffic generated, or by difficulty in postgres producers and/or consumers to keep up? If it's network traffic, then perhaps compression would help us. Maybe we need to set some goals for the level of log volumes we expect to be able to create/send/comsume. cheers andrew
On Sat, May 31, 2008 at 3:41 AM, Greg Smith <gsmith@gregsmith.com> wrote:
Got it. I remember reading about pg_clearxlogtail in these mailing lists; but somehow forgot how it actually worked!
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
Mail sent from my BlackLaptop device
On Sat, 31 May 2008, Gurjeet Singh wrote:Not if you use pg_clearxlogtailThis means we need to modify pg_standby to not check for filesize when
reading XLogs.
No, the idea is that you run the segments through pg_clearxlogtail | gzip, which then compresses lightly used segments massively because all the unused bytes are 0. File comes out the same size at the other side, but you didn't ship a full 16MB if there was only a few KB used.
Got it. I remember reading about pg_clearxlogtail in these mailing lists; but somehow forgot how it actually worked!
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
Mail sent from my BlackLaptop device
Le vendredi 30 mai 2008, Dimitri Fontaine a écrit : > This way, no need to switch IP addresses, the clients just connect as usual > and get results back and do not have to know whether the host they're > qerying against is a slave or a master. This level of smartness is into > -core. Oh, and if you want clients to connect to a single IP and hit either the master or the slave with some weights to choose one or the other, and a way to remove from pool on failure etc, I think using haproxy in TCP mode would do it. HaProxy is really nice for this purpose. http://haproxy.1wt.eu/ Regards, -- dim
On Fri, May 30, 2008 at 6:47 PM, Andreas 'ads' Scherbaum <adsmail@wars-nicht.de> wrote: > On Fri, 30 May 2008 17:05:57 -0400 Andrew Dunstan wrote: >> Andreas 'ads' Scherbaum wrote: >> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: >> > >> >> Well, yes, but you do know about archive_timeout, right? No need to wait >> >> 2 hours. >> > >> > Then you ship 16 MB binary stuff every 30 second or every minute but >> > you only have some kbyte real data in the logfile. This must be taken >> > into account, especially if you ship the logfile over the internet >> > (means: no high-speed connection, maybe even pay-per-traffic) to the >> > slave. >> >> Sure there's a price to pay. But that doesn't mean the facility doesn't >> exist. And I rather suspect that most of Josh's customers aren't too >> concerned about traffic charges or affected by such bandwidth >> restrictions. Certainly, none of my clients are, and they aren't in the >> giant class. Shipping a 16Mb file, particularly if compressed, every >> minute or so, is not such a huge problem for a great many commercial >> users, and even many domestic users. > > The real problem is not the 16 MB, the problem is: you can't compress > this file. If the logfile is rotated it still contains all the > old binary data which is not a good starter for compression. Using bzip2 in my archive_command script, my WAL files are normally compressed to between 2MB and 5MB, depending on the write load (larger, and more of them, in the middle of the day). bzip2 compression is more expensive and rotated WAL files are not particularly compressable to be sure, but due to (and given) the nature of the data bzip2 works pretty well, and much better than gzip. > > So you may have some kB changes in the wal logfile every minute but you > still copy 16 MB data. Sure, it's not so much - but if you rotate a > logfile every minute this still transfers 16*60*24 = ~23 GB a day. > I archived 1965 logs yesterday on one instance of my app totalling 8.5GB ... not to bad, really. -- Mike Rylander| VP, Research and Design| Equinox Software, Inc. / The Evergreen Experts| phone: 1-877-OPEN-ILS (673-6457)|email: miker@esilibrary.com| web: http://www.esilibrary.com
On Sat, May 31, 2008 at 2:18 AM, Mike Rylander <mrylander@gmail.com> wrote: > On Fri, May 30, 2008 at 6:47 PM, Andreas 'ads' Scherbaum > <adsmail@wars-nicht.de> wrote: >> On Fri, 30 May 2008 17:05:57 -0400 Andrew Dunstan wrote: >>> Andreas 'ads' Scherbaum wrote: >>> > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: >>> > >>> >> Well, yes, but you do know about archive_timeout, right? No need to wait >>> >> 2 hours. >>> > >>> > Then you ship 16 MB binary stuff every 30 second or every minute but >>> > you only have some kbyte real data in the logfile. This must be taken >>> > into account, especially if you ship the logfile over the internet >>> > (means: no high-speed connection, maybe even pay-per-traffic) to the >>> > slave. >>> >>> Sure there's a price to pay. But that doesn't mean the facility doesn't >>> exist. And I rather suspect that most of Josh's customers aren't too >>> concerned about traffic charges or affected by such bandwidth >>> restrictions. Certainly, none of my clients are, and they aren't in the >>> giant class. Shipping a 16Mb file, particularly if compressed, every >>> minute or so, is not such a huge problem for a great many commercial >>> users, and even many domestic users. >> >> The real problem is not the 16 MB, the problem is: you can't compress >> this file. If the logfile is rotated it still contains all the >> old binary data which is not a good starter for compression. > > Using bzip2 in my archive_command script, my WAL files are normally > compressed to between 2MB and 5MB, depending on the write load > (larger, and more of them, in the middle of the day). bzip2 > compression is more expensive and rotated WAL files are not > particularly compressable to be sure, but due to (and given) the > nature of the data bzip2 works pretty well, and much better than gzip. Compression especially is going to negate one of the big advantages of wal shipping, namely that it is cheap investment in terms of load to the main. A gigabit link can ship a lot of log files, you can always bond and 10gige is coming. IMO the key trick is to make sure you don't send the log file more than once from the same source...i.e cascading relay. merlin
Merlin Moncure wrote: > On Sat, May 31, 2008 at 2:18 AM, Mike Rylander <mrylander@gmail.com> wrote: >> On Fri, May 30, 2008 at 6:47 PM, Andreas 'ads' Scherbaum > Compression especially is going to negate one of the big advantages of > wal shipping, namely that it is cheap investment in terms of load to > the main. A gigabit link can ship a lot of log files, you can always Who has a gigabit link between Dallas and Atlanta? That is the actual problem here. Switch to Switch compression is a waste of time (if you aren't running GiGE, what are you doing???). Sincerely, Joshua D. Drake
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi Merlin, <br /><br /> My point here is that withreasonably small extensions to the core you can build products that are a lot better than SLONY. Triggers do not coverDDL, among other issues, and it’s debatable whether they are the best way to implement quorum policies like Google’ssemi-synchronous replication. As I mentioned separately this topic deserves another thread which I promise to start. <br /><br /> It is of course possible to meet some of these needs with an appropriate client interface to WAL shipping. There’s no a-priori reason why built-in PostgreSQL slaves need to be the only client. I would put a vote in forcovering this possibility in the initial replication design. We are using a very similar approach in our own master/slavereplication product. <br /><br /> Thanks, Robert<br /><br /> P.S., No offense intended to Jan Wieck et al. Thereare some pretty cool things in SLONY. <br /><br /> On 5/29/08 8:16 PM, "Merlin Moncure" <mmoncure@gmail.com>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">OnThu, May 29, 2008 at 3:05 PM, Robert Hodges<br /> <robert.hodges@continuent.com> wrote:<br/> > Third, you can't stop with just this feature. (This is the BUT part of the<br /> > post.) The use casesnot covered by this feature area actually pretty<br /> > large. Here are a few that concern me:<br /> ><br />> 1.) Partial replication.<br /> > 2.) WAN replication.<br /> > 3.) Bi-directional replication. (Yes, this isevil but there are problems<br /> > where it is indispensable.)<br /> > 4.) Upgrade support. Aside from databaseupgrade (how would this ever<br /> > really work between versions?), it would not support zero-downtime app<br/> > upgrades, which depend on bi-directional replication tricks.<br /> > 5.) Heterogeneous replication.<br />> 6.) Finally, performance scaling using scale-out over large numbers of<br /> > replicas. I think it's possibleto get tunnel vision on this—it's not a big<br /> > requirement in the PG community because people don't use PGin the first<br /> > place when they want to do this. They use MySQL, which has very good<br /> > replication forperformance scaling, though it's rather weak for<br /> > availability.<br /><br /> These type of things are what Slonyis for. Slony is trigger based.<br /> This makes it more complex than log shipping style replication, but<br /> provideslots of functionality.<br /><br /> wal shipping based replication is maybe the fastest possible<br /> solution...youare already paying the overhead so it comes virtually<br /> for free from the point of view of the master.<br/><br /> mysql replication is imo nearly worthless from backup standpoint.<br /><br /> merlin<br /><br /></span></font></blockquote><fontface="Verdana, Helvetica, Arial"><span style="font-size:12.0px"><br /><br /> -- <br />Robert Hodges, CTO, Continuent, Inc.<br /> Email: robert.hodges@continuent.com<br /> Mobile: +1-510-501-3728 Skype: hodgesrm<br /></span></font>
On Sun, Jun 1, 2008 at 11:58 AM, Robert Hodges <robert.hodges@continuent.com> wrote: > Hi Merlin, > > My point here is that with reasonably small extensions to the core you can > build products that are a lot better than SLONY. Triggers do not cover > DDL, among other issues, and it's debatable whether they are the best way to > implement quorum policies like Google's semi-synchronous replication. As I > mentioned separately this topic deserves another thread which I promise to > start. These issues are much discussed and well understood. At this point, the outstanding points of discussion are technical...how to make this thing work. merlin
"Merlin Moncure" <mmoncure@gmail.com> writes: > On Sun, Jun 1, 2008 at 11:58 AM, Robert Hodges > <robert.hodges@continuent.com> wrote: >> My point here is that with reasonably small extensions to the core you can >> build products that are a lot better than SLONY. > These issues are much discussed and well understood. Well, what we know is that previous attempts to define replication hooks to be added to the core have died for lack of interest. Maybe Robert can start a new discussion that will actually get somewhere; if so, more power to him. (Is the replica-hooks-discuss list still working?) But that is entirely orthogonal to what is proposed in this thread, which is to upgrade the existing PITR support into a reasonably useful replication feature. regards, tom lane
On Thu, May 29, 2008 at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The Postgres core team met at PGCon to discuss a few issues, the largest > of which is the need for simple, built-in replication for PostgreSQL. [...] > We believe that the most appropriate base technology for this is 1> probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon. > We hope that such a feature can be completed for 8.4. Ideally this > would be coupled with the ability to execute read-only queries on the > slave servers, but we see technical difficulties that might prevent that > from being completed before 8.5 or even further out. (The big problem > is that long-running slave-side queries might still need tuples that are > vacuumable on the master, and so replication of vacuuming actions would > cause the slave's queries to deliver wrong answers.) > > Again, this will not replace Slony, pgPool, Continuent, Londiste, or > other systems for many users, as it will be not be highly scalable nor > support long-distance replication nor replicating less than an entire > installation. But it is time to include a simple, reliable basic > replication feature in the core system. Hello! I thought I would share a few thoughts of my own about the issue. I have a hands-on experience with Oracle and MySQL apart from PostgreSQL so I hope it will be a bit interesting. The former has a feature called "physical standby", which looks quite like our WAL-shipping based replication. Simply archived logs are replayed on the standby database. A primary database and standby database are connected, and can stream the logs directly. They either copy the log when its finished (as we do now) or can do it in coninuous manner (as I hope we will be able to). It is possible to have a synchronous replication (where "COMMIT" on primary database succeeds when the data is safely stored on the standby database). I think such a feature would be a great advantage for PostgreSQL (where you cannot afford to loose any transactions). Their standby database is not accessible. It can be opened read-only, but during that time replication stops. So PostgreSQL having read-only and still replicating standby database would be great. The other method is "logical standby" which works by dissecting WAL-logs and recreating DDLs/DMLs from it. Never seen anyone use it. ;-) Then we have a mysql replication -- done by replaying actual DDLs/DMLs on the slaves. This approach has issues, most notably when slaves are highly loaded and lag behind the master -- so you end up with infrastructure to monitor lags and turn off slaves which lag too much. Also it is painful to setup -- you have to stop, copy, configure and run. * Back to PostgreSQL world As for PostgreSQL solutions we have a slony-I, which is great as long as you don't have too many people managing the database and/or your schema doesn't change too frequently. Perhaps it would be maintainable more easily if there would be to get DDLs (as DDL triggers or similar). Its main advantages for me is ability to prepare complex setups and easily add new slaves). The pgpool solution is quite nice but then again adding a new slave is not so easy. And being a filtering layer between client and server it feels a bit "fragile" (I know it is not, but then again it is harder to convince someone that "yes it will work 100% right all the time"). * How I would like PostgreSQL WAL-replication to evolve: First of all it would be great if a slave/standby would contact the master and maintain the state with it (tell it its xmin, request a log to stream, go online-streaming). Especially I hope that it should be possible to make a switchover (where the two databases exchange roles), and in this the direct connection between the two should help. In detail, I think it should go like this: * A slave database starts up, checks that it works as a "replica" (hopefully it would not be a postgresql.conf constant, but rather some file maintained by the database). * It would connect to the master database, tell where in the WAL it is now, and request a log N. * If log N is not available, request a log from external supplied script (so that it could be fetched from log archive repository somewhere, recovered from a backup tape, etc). * Continue asking, until we get to the logs which are available at master database. * Continue replaying until we get within max_allowed_replication_lag time, and open our slave for read-only queries. * If we start lagging too much perhaps close the read-only access to the database (perhaps configurable?). I think that replication should be easy to set up. I think our archive_command is quite easy, but many a person come with a lot of misconceptions how it works (and it takes time to explain them how it actually work, especially what is archive_command for, and that pg_start_backup() doesn't actually _do_ backup, but just tells PostgreSQL that backup is being done). Easy to setup and easy to switchover (change the direction of replication) and failover. Also, it would be interesting to be able to have a shared-storage solution where a single database would modify data and multiple read-only slaves would access the data. The WAL-shipping would be used then to "push" modified pages into read-only slaves' memory. And each checkpoint would allow read-only slaves to release these pages from memory. Kind of replication without actually doing any writes. * high available libpq? Other thing worth to consider is perhaps adding a HA-capability to libpq. Currently I use a setup where I have machine with database and pg_bouncer (for connection pooling at database level) and on client machines I have a pgpool on localhost (for transparent failover). I think some level of replication awareness could be built-in into libpq. It would allow to make simpler, but still HA, setups. Perhaps ability to specify a standby in connect string, something like: "host=foo1.bar;db=pgdb;failover=yes;stbyhost=foo2.bar" ...a failover="yes" would allow libpq to try connecting to stbyhost if host is not available AND succeed if stbyhost announces itself as a new master. This would make switchover easier (clients would be told, that we are closing, and that they should connect to a new master). Also some way of supplying multiple hosts, where client should pick a host at random, would be quite useful, say: "host=foo1.bar,foo2.bar,foo3.bar;db=pgdb;loadbalance=random" Why in libpq and not in [language/framework/application]? Well it would help with not reinventing the wheel. And if we want a painless switchover, client should be aware that there is a standby ready. Also, being optional it wouldn't change anything for those who don't want it. Regards, Dawid PS: As for pgpool/pg_bouncer -- I wish there would be a way for them to tell the database that they are proxying for someone else (so that I could check in pg_stat_activity who is on the far end of the database connection, not "localhost" ;)). -- Solving [site load issues] with [more database replication] is a lot like solving your own personal problems with heroin - at first it sorta works, but after a while things just get out of hand.
David Fetter wrote: > This part is a deal-killer. It's a giant up-hill slog to sell warm > standby to those in charge of making resources available because the > warm standby machine consumes SA time, bandwidth, power, rack space, > etc., but provides no tangible benefit, and this feature would have > exactly the same problem. > > IMHO, without the ability to do read-only queries on slaves, it's not > worth doing this feature at all. > That's not something that squares with my experience *at all*, which admitedly is entirely in investment banks. Business continuity is king, and in some places the warm standby rep from the database vendor is trusted more than block-level rep from the SAN vendor (though that may be changing to some extent in favour of the SAN). James
Aidan Van Dyk wrote: > The whole single-threaded WAL replay problem is going to rear it's ugly > head here too, and mean that a slave *won't* be able to keep up with a > busy master if it's actually trying to apply all the changes in > real-time. Is there a reason to commit at the same points that the master committed? Wouldn't relaxing that mean that at least you would get 'big' commits and some economy of scale? It might not be too bad. All I can say is that Sybase warm standby is useful, even though the rep for an update that changes a hundred rows is a hundred updates keyed on primary key, which is pretty sucky in terms of T-SQL performance.
On Fri, 2008-05-30 at 15:16 -0400, Robert Treat wrote: > On Friday 30 May 2008 01:10:20 Tom Lane wrote: > > Greg Smith <gsmith@gregsmith.com> writes: > > > I fully accept that it may be the case that it doesn't make technical > > > sense to tackle them in any order besides sync->read-only slaves because > > > of dependencies in the implementation between the two. > > > > Well, it's certainly not been my intention to suggest that no one should > > start work on read-only-slaves before we finish the other part. The > > point is that I expect the log shipping issues will be done first > > because they're easier, and it would be pointless to not release that > > feature if we had it. > > > > But since you mention it: one of the plausible answers for fixing the > > vacuum problem for read-only slaves is to have the slaves push an xmin > > back upstream to the master to prevent premature vacuuming. The current > > design of pg_standby is utterly incapable of handling that requirement. > > So there might be an implementation dependency there, depending on how > > we want to solve that problem. > > > > Sure, but whose to say that after synchronous wal shipping is "finished" it > wont need a serious re-write due to new needs from the hot standby feature. I > think going either way carries some risk. The simplest form of synchronous wal shipping would not even need postgresql running on slave, just a small daemon which reports when wal blocks are a) received and b) synced to disk. This setup would just guarantee "no data loss on single machine failure". form there on you could add various features, including support for both switchover and failover, async replication to multiple slaves, etc. the only thing that needs anything additional from slave wal-receiving daemon is when you want the kind of wal-sync which would guarantee that read-only query on slave issued after commit returns from master sees latest data. for this kinds of guarantees you need at least feedback about wal-replay, but possibly also shared transaction numbers and shared snapshots, to be sure that OLTP type queries see the latest and OLAP queries are not denied seeing VACUUMED on master. -------------- Hannu
On Thu, 2008-05-29 at 13:37 -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote: > >> The only question I have is... what does this give us that PITR > >> doesn't give us? > > > It looks like a wrapper for PITR to me, so the gain would be ease of > > use. > > A couple of points about that: > > * Yeah, ease of use is a huge concern here. We're getting beat up > because people have to go find a separate package (and figure out > which one they want), install it, learn how to use it, etc. It doesn't > help that the most mature package is Slony which is, um, not very > novice-friendly or low-admin-complexity. I personally got religion > on this about two months ago when Red Hat switched their bugzilla > from Postgres to MySQL because the admins didn't want to deal with Slony > any more. People want simple. > > * The proposed approach is trying to get to "real" replication > incrementally. Getting rid of the loss window involved in file-by-file > log shipping is step one, Actually we can already do better than file-by-file by using pg_xlogfile_name_offset() which was added sometime in 2006. SkyTools for example does this to get no more than a few seconds failure window. Doing this synchronously would be of course better. probably we should use the same "modes/protocols" as DRBD when determining when a "sync" wal write is "done" quote from http://www.slackworks.com/~dkrovich/DRBD/usingdrbdsetup.html#AEN76 Table 1. DRBD Protocols Protocol Description A A write operation is complete as soon as the data is written to disk and sent to the network. B A write operation is complete as soon as a reception acknowledgement arrives. C A write operation is complete as soon as a write acknowledgement arrives. There are also additional paramaters you can pass to the disk and net options. See the drbdsetup man page for additional information /end quote > and I suspect that step two is going to be > fixing performance issues in WAL replay to ensure that slaves can keep > up. After that we'd start thinking about how to let slaves run > read-only queries. But even without read-only queries, this will be > a useful improvement for HA/backup scenarios. > > regards, tom lane >
On Thu, 2008-05-29 at 12:05 -0700, Robert Hodges wrote: > Hi everyone, > > First of all, I’m absolutely delighted that the PG community is > thinking seriously about replication. > > Second, having a solid, easy-to-use database availability solution > that works more or less out of the box would be an enormous benefit to > customers. Availability is the single biggest problem for customers > in my experience and as other people have commented the alternatives > are not nice. It’s an excellent idea to build off an existing feature > —PITR is already pretty useful and the proposed features are solid > next steps. The fact that it does not solve all problems is not a > drawback but means it’s likely to get done in a reasonable timeframe. > > Third, you can’t stop with just this feature. (This is the BUT part > of the post.) The use cases not covered by this feature area actually > pretty large. Here are a few that concern me: > > 1.) Partial replication. > 2.) WAN replication. 1.) & 2.) are better done asunc, the domain of Slony-I/Londiste > 3.) Bi-directional replication. (Yes, this is evil but there are > problems where it is indispensable.) Sure, it is also a lot harder and always has several dimensions (performanse/availability7locking) which play against each other > 4.) Upgrade support. Aside from database upgrade (how would this ever > really work between versions?), it would not support zero-downtime app > upgrades, which depend on bi-directional replication tricks. Or you could use zero-downtime app upgrades, which don't depend on this :P > 5.) Heterogeneous replication. > 6.) Finally, performance scaling using scale-out over large numbers of > replicas. I think it’s possible to get tunnel vision on this—it’s not > a big requirement in the PG community because people don’t use PG in > the first place when they want to do this. They use MySQL, which has > very good replication for performance scaling, though it’s rather weak > for availability. Again, doing scale-out over large number of replicas should either be async or for sync use some broadcast channel to all slaves (and still be a performance problem on master, as it has to wait for slowest slave). > As a consequence, I don’t see how you can get around doing some sort > of row-based replication like all the other databases. Is'nt WAL-base replication "some sort of row-based replication" ? > Now that people are starting to get religion on this issue I would > strongly advocate a parallel effort to put in a change-set extraction > API that would allow construction of comprehensive master/slave > replication. Triggers. see pgQ's logtrigga()/logutrigga(). See slides for Marko Kreen's presentation at pgCon08. > (Another approach would be to make it possible for third party apps > to read the logs and regenerate SQL.) which logs ? WAL or SQL command logs ? > There are existing models for how to do change set extraction; we have > done it several times at my company already. There are also research > projects like GORDA that have looked fairly comprehensively at this > problem. pgQ with its triggers does a pretty good job of change-set extraction. ------------------ Hannu
Hi Hannu, Hi Hannu, On 6/1/08 2:14 PM, "Hannu Krosing" <hannu@krosing.net> wrote: > >> As a consequence, I don¹t see how you can get around doing some sort >> of row-based replication like all the other databases. > > Is'nt WAL-base replication "some sort of row-based replication" ? > Yes, in theory. However, there's a big difference between replicating physical WAL records and doing logical replication with SQL statements. Logical replication requires extra information to reconstruct primary keys. (Somebody tell me if this is already in the WAL; I'm learning the code as fast as possible but assuming for now it's not.) > >> Now that people are starting to get religion on this issue I would >> strongly advocate a parallel effort to put in a change-set extraction >> API that would allow construction of comprehensive master/slave >> replication. > > Triggers. see pgQ's logtrigga()/logutrigga(). See slides for Marko > Kreen's presentation at pgCon08. > > Thanks very much for the pointer. The slides look interesting. Robert
adsmail@wars-nicht.de ("Andreas 'ads' Scherbaum") writes: > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > >> Well, yes, but you do know about archive_timeout, right? No need to wait >> 2 hours. > > Then you ship 16 MB binary stuff every 30 second or every minute but > you only have some kbyte real data in the logfile. This must be taken > into account, especially if you ship the logfile over the internet > (means: no high-speed connection, maybe even pay-per-traffic) to the > slave. If you have that kind of scenario, then you have painted yourself into a corner, and there isn't anything that can be done to extract you from it. Consider: If you have so much update traffic that it is too much to replicate via WAL-copying, why should we expect that other mechanisms *wouldn't* also overflow the connection? If you haven't got enough network bandwidth to use this feature, then nobody is requiring that you use it. It seems like a perfectly reasonable prerequisite to say "this requires that you have enough bandwidth." -- (reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc")) http://www3.sympatico.ca/cbbrowne/ "There's nothing worse than having only one drunk head." -- Zaphod Beeblebrox
On Thu, 2008-05-29 at 23:37 +0200, Mathias Brossard wrote: > I pointed out that the NTT solution is synchronous because Tom said in > the first part of his email that: > > > In practice, simple asynchronous single-master-multiple-slave > > replication covers a respectable fraction of use cases, so we have > > concluded that we should allow such a feature to be included in the > > core project. > > ... and yet "the most appropriate base technology for this" is > synchronous and maybe I should have also pointed out in my previous mail > is that it doesn't support multiple slaves. I don't think that you need too many slaves in sync mode. Probably 1-st slave sync and others async from there on will be good enough. > Also, as other have pointed out there are different interpretations of > "synchronous" depending on wether the WAL data has reached the other end > of the network connection, a safe disk checkpoint or the slave DB itself. Probably all DRBD-s levels ( A) data sent to network, B) data received, C) data written to disk) should be supported + C1) data replayed in slave DB. C1 meaning that it can be done in parallel with C) Then each DBA can set it up depending on what he trusts - network, slave's power supply or slaves' disk. Also, the case of slave failure should be addressed. I don't think that the best solution is halting all ops on master if slave/network fails. Maybe we should allow also a setup with 2-3 slaves, where operations can continue when at least 1 slave is "syncing" ? -------------- Hannu
On Mon, 02 Jun 2008 11:52:05 -0400 Chris Browne wrote: > adsmail@wars-nicht.de ("Andreas 'ads' Scherbaum") writes: > > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > > > >> Well, yes, but you do know about archive_timeout, right? No need to wait > >> 2 hours. > > > > Then you ship 16 MB binary stuff every 30 second or every minute but > > you only have some kbyte real data in the logfile. This must be taken > > into account, especially if you ship the logfile over the internet > > (means: no high-speed connection, maybe even pay-per-traffic) to the > > slave. > > If you have that kind of scenario, then you have painted yourself into > a corner, and there isn't anything that can be done to extract you > from it. You are misunderstanding something. It's perfectly possible that you have a low-traffic database with changes every now and then. But you have to copy a full 16 MB logfile every 30 seconds or every minute just to have the slave up-to-date. > Consider: If you have so much update traffic that it is too much to > replicate via WAL-copying, why should we expect that other mechanisms > *wouldn't* also overflow the connection? For some MB real data you copy several GB logfiles per day - that's a lot overhead, isn't it? > If you haven't got enough network bandwidth to use this feature, then > nobody is requiring that you use it. It seems like a perfectly > reasonable prerequisite to say "this requires that you have enough > bandwidth." If you have a high-traffic database, then of course you need an other connection as if you only have a low-traffic or a mostly read-only database. But that's not the point. Copying an almost unused 16 MB WAL logfile is just overhead - especially because the logfile is not compressable very much because of all the leftovers from earlier use. Kind regards -- Andreas 'ads' Scherbaum German PostgreSQL User Group
On Mon, 2008-06-02 at 22:40 +0200, Andreas 'ads' Scherbaum wrote: > On Mon, 02 Jun 2008 11:52:05 -0400 Chris Browne wrote: > > > adsmail@wars-nicht.de ("Andreas 'ads' Scherbaum") writes: > > > On Thu, 29 May 2008 23:02:56 -0400 Andrew Dunstan wrote: > > > > > >> Well, yes, but you do know about archive_timeout, right? No need to wait > > >> 2 hours. > > > > > > Then you ship 16 MB binary stuff every 30 second or every minute but > > > you only have some kbyte real data in the logfile. This must be taken > > > into account, especially if you ship the logfile over the internet > > > (means: no high-speed connection, maybe even pay-per-traffic) to the > > > slave. > > > > If you have that kind of scenario, then you have painted yourself into > > a corner, and there isn't anything that can be done to extract you > > from it. > > You are misunderstanding something. It's perfectly possible that you > have a low-traffic database with changes every now and then. But you > have to copy a full 16 MB logfile every 30 seconds or every minute just > to have the slave up-to-date. To repeat my other post in this thread: Actually we can already do better than file-by-file by using pg_xlogfile_name_offset() which was added sometime in 2006. walmgr.py from SkyTools package for example does this to get no more than a few seconds failure window and it copies just the changed part of WAL to slave. pg_xlogfile_name_offset() was added just for this purpose - to enable WAL shipping scripts to query, where inside the logfile current write pointer is. It is not synchronous, but it can be made very close, within subsecond if you poll it frequently enough. ------------------- Hannu
On Sun, Jun 01, 2008 at 01:43:22PM -0400, Tom Lane wrote: > power to him. (Is the replica-hooks-discuss list still working?) But Yes. And silent as ever. :-) A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
Hannu Krosing wrote: > The simplest form of synchronous wal shipping would not even need > postgresql running on slave, just a small daemon which > reports when wal > blocks are a) received and b) synced to disk. While that does sound simple, I'd presume that most people would want the guarantee of the same version of postgresql installedwherever the logs are ending up, with the log receiver speaking the same protocol version as the log sender. I imaginethat would be most easily achieved through using something like the continuously restoring startup mode of currentpostgresql. However variations on this kind of daemon can be used to perform testing, configuring it to work well, go slow, pause, notrespond, disconnect, or fail in particular ways, emulating disk full, etc. Regards, Stephen Denne. -- At the Datamail Group we value teamwork, respect, achievement, client focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by replying immediately, destroy it and do not copy, disclose or use it in any way. The Datamail Group, through our GoGreen programme, is committed to environmental sustainability. Help us in our efforts by not printing this email. __________________________________________________________________ This email has been scanned by the DMZGlobal BusinessQuality Electronic Messaging Suite. Please see http://www.dmzglobal.com/dmzmessaging.htm for details. __________________________________________________________________
Hello Andrew, Andrew Sullivan wrote: > Yes. And silent as ever. :-) Are the slides of your PgCon talk available for download somewhere? BTW: up until recently, there was yet another mailing list: pgreplication-general@gborg.postgresql.org. It was less focused on hooks and got at least some traffic. :-) Are those mails still archived somewhere? Regards Markus
Stephen Denne wrote: > Hannu Krosing wrote: >> The simplest form of synchronous wal shipping would not even need >> postgresql running on slave, just a small daemon which >> reports when wal >> blocks are a) received and b) synced to disk. > > While that does sound simple, I'd presume that most people would want the guarantee of the same version of postgresql installedwherever the logs are ending up, with the log receiver speaking the same protocol version as the log sender. I imaginethat would be most easily achieved through using something like the continuously restoring startup mode of currentpostgresql. Hmm, WAL version compatibility is an interesting question. Most minor releases hasn't changed the WAL format, and it would be nice to allow running different minor versions in the master and slave in those cases. But it's certainly not unheard of to change the WAL format. Perhaps we should introduce a WAL version number, similar to catalog version? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> > Hmm, WAL version compatibility is an interesting question. Most minor > releases hasn't changed the WAL format, and it would be nice to allow As I remember, high minor version should read all WALs from lowers, but it isn't true for opposite case and between different major versions. > running different minor versions in the master and slave in those cases. > But it's certainly not unheard of to change the WAL format. Perhaps we > should introduce a WAL version number, similar to catalog version? Agree. Right now it only touches warm-stand-by servers, but introducing simple log-shipping and based on it replication will cause a lot of unobvious errors/bugs. Is it possible to use catalog version number as WAL version? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, 2008-06-04 at 11:13 +0300, Heikki Linnakangas wrote: > Hmm, WAL version compatibility is an interesting question. Most minor > releases hasn't changed the WAL format, and it would be nice to allow > running different minor versions in the master and slave in those cases. > But it's certainly not unheard of to change the WAL format. Perhaps we > should introduce a WAL version number, similar to catalog version? Would that also cover possible differences in page size, 32bit OS vs. 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these things can have an influence on how the data is written and possibly make the WAL incompatible with other postgres instances, even if the exact same version... Cheers, Csaba.
Csaba Nagy wrote: > On Wed, 2008-06-04 at 11:13 +0300, Heikki Linnakangas wrote: >> Hmm, WAL version compatibility is an interesting question. Most minor >> releases hasn't changed the WAL format, and it would be nice to allow >> running different minor versions in the master and slave in those cases. >> But it's certainly not unheard of to change the WAL format. Perhaps we >> should introduce a WAL version number, similar to catalog version? > > Would that also cover possible differences in page size, 32bit OS vs. > 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these > things can have an influence on how the data is written and possibly > make the WAL incompatible with other postgres instances, even if the > exact same version... These are already covered by the information in pg_control. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Teodor Sigaev wrote: > Is it possible to use catalog version number as WAL version? No, because we don't change the catalog version number in minor releases, even though we might change WAL format. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jun 04, 2008 at 09:24:20AM +0200, Markus Schiltknecht wrote: > > Are the slides of your PgCon talk available for download somewhere? There weren't any slides, really (there were 4 that I put up in case the cases I was discussing needed back-references, but they didn't). Joshua tells me that I'm supposed to make the paper readable and put it up on Command Prompt's website, so I will soon. > BTW: up until recently, there was yet another mailing list: > pgreplication-general@gborg.postgresql.org. It was less focused on hooks > and got at least some traffic. :-) Are those mails still archived > somewhere? Unless whoever was operating that list moved it to pgfoundry, I doubt it (except on backups somewhere). A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Hmm, WAL version compatibility is an interesting question. Most minor > releases hasn't changed the WAL format, and it would be nice to allow > running different minor versions in the master and slave in those cases. > But it's certainly not unheard of to change the WAL format. Perhaps we > should introduce a WAL version number, similar to catalog version? Yeah, perhaps. In the past we've changed the WAL page ID field for this; I'm not sure if that's enough or not. It does seem like a good idea to have a way to check that the slaves aren't trying to read a WAL version they don't understand. Also, it's possible that the WAL format doesn't change across a major update, but you still couldn't work with say an 8.4 master and an 8.3 slave, so maybe we need the catalog version ID in there too. regards, tom lane
On Wed, 2008-06-04 at 10:40 -0400, Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > > Hmm, WAL version compatibility is an interesting question. Most minor > > releases hasn't changed the WAL format, and it would be nice to allow > > running different minor versions in the master and slave in those cases. > > But it's certainly not unheard of to change the WAL format. Perhaps we > > should introduce a WAL version number, similar to catalog version? > > Yeah, perhaps. In the past we've changed the WAL page ID field for > this; I'm not sure if that's enough or not. It does seem like a good > idea to have a way to check that the slaves aren't trying to read a > WAL version they don't understand. Also, it's possible that the WAL > format doesn't change across a major update, but you still couldn't > work with say an 8.4 master and an 8.3 slave, so maybe we need the > catalog version ID in there too. And something dependent on datetime being integer. We probably won't need to encode presence of user defined types, like PostGis , being present ? ----- Hannu
Hannu Krosing <hannu@krosing.net> writes: > On Wed, 2008-06-04 at 10:40 -0400, Tom Lane wrote: >> "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Hmm, WAL version compatibility is an interesting question. Most minor > releases hasn't changed the WAL format, and it would be nice to allow > running different minor versions in the master and slave in those cases. > But it's certainly not unheard of to change the WAL format. Perhaps we > should introduce a WAL version number, similar to catalog version? >> >> Yeah, perhaps. In the past we've changed the WAL page ID field for >> this; I'm not sure if that's enough or not. It does seem like a good >> idea to have a way to check that the slaves aren't trying to read a >> WAL version they don't understand. Also, it's possible that the WAL >> format doesn't change across a major update, but you still couldn't >> work with say an 8.4 master and an 8.3 slave, so maybe we need the >> catalog version ID in there too. > And something dependent on datetime being integer. This thread is getting out of hand, actually. Heikki's earlier comment about pg_control reminded me that we already have a unique system identifier stored in pg_control and check that against WAL headers. So I think we already have enough certainty that the master and slaves have the same pg_control and hence are the same for everything checked by pg_control. However, since by definition pg_control doesn't change in a minor upgrade, there isn't any easy way to enforce a rule like "slaves must be same or newer minor version as the master". I'm not sure that we actually *want* to enforce such a rule, though. Most of the time, the other way around would work fine. regards, tom lane
On Wed, 2008-06-04 at 11:37 -0400, Tom Lane wrote: > This thread is getting out of hand, actually. Agreed. We should start new threads for specific things. Please. > However, since by definition pg_control doesn't change in a minor > upgrade, there isn't any easy way to enforce a rule like "slaves must be > same or newer minor version as the master". I'm not sure that we > actually *want* to enforce such a rule, though. Definitely don't want to prevent minor version mismatches. We want to be able to upgrade a standby, have it catch up with the master then switchover to the new version. Otherwise we'd have to take whole replicated system down to do minor upgrades/backouts. Ugh! -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-06-04 at 14:17 +0300, Heikki Linnakangas wrote: > > Would that also cover possible differences in page size, 32bit OS vs. > > 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these > > things can have an influence on how the data is written and possibly > > make the WAL incompatible with other postgres instances, even if the > > exact same version... > > These are already covered by the information in pg_control. Another thing that can change between systems is the collation behavior, which can corrupt indexes (and other bad things). Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Wed, 2008-06-04 at 14:17 +0300, Heikki Linnakangas wrote: >> These are already covered by the information in pg_control. > Another thing that can change between systems is the collation behavior, > which can corrupt indexes (and other bad things). That is covered by pg_control, at least to the extent of forcing the same value of LC_COLLATE. regards, tom lane
On Wed, 2008-06-04 at 14:23 -0400, Tom Lane wrote: > That is covered by pg_control, at least to the extent of forcing the > same value of LC_COLLATE. But the same LC_COLLATE means different things on different systems. Even "en_US" means something different on Mac versus Linux. Regards,Jeff Davis
Well, WAL format doesn't only depend on WAL itself, but also depend on each resource manager. If we introduce WAL format version identification, ISTM that we have to take care of the matching of resource manager in the master and the slave as well. 2008/6/4 Heikki Linnakangas <heikki@enterprisedb.com>: > Stephen Denne wrote: >> >> Hannu Krosing wrote: >>> >>> The simplest form of synchronous wal shipping would not even need >>> postgresql running on slave, just a small daemon which reports when wal >>> blocks are a) received and b) synced to disk. >> >> While that does sound simple, I'd presume that most people would want the >> guarantee of the same version of postgresql installed wherever the logs are >> ending up, with the log receiver speaking the same protocol version as the >> log sender. I imagine that would be most easily achieved through using >> something like the continuously restoring startup mode of current >> postgresql. > > Hmm, WAL version compatibility is an interesting question. Most minor > releases hasn't changed the WAL format, and it would be nice to allow > running different minor versions in the master and slave in those cases. But > it's certainly not unheard of to change the WAL format. Perhaps we should > introduce a WAL version number, similar to catalog version? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- ------ Koichi Suzuki
"Koichi Suzuki" <koichi.szk@gmail.com> writes: > Well, WAL format doesn't only depend on WAL itself, but also depend on > each resource manager. If we introduce WAL format version > identification, ISTM that we have to take care of the matching of > resource manager in the master and the slave as well. That seems a bit overdesigned. What are the prospects that two builds of the same Postgres version are going to have different sets of resource managers in them? regards, tom lane
If the version of the master and the slave is different and we'd still like to allow log shipping replication, we need a negotiation if WAL format for the two is compatible. I hope it is not in our scope and I'm worrying too much. 2008/6/5 Tom Lane <tgl@sss.pgh.pa.us>: > "Koichi Suzuki" <koichi.szk@gmail.com> writes: >> Well, WAL format doesn't only depend on WAL itself, but also depend on >> each resource manager. If we introduce WAL format version >> identification, ISTM that we have to take care of the matching of >> resource manager in the master and the slave as well. > > That seems a bit overdesigned. What are the prospects that two builds > of the same Postgres version are going to have different sets of > resource managers in them? > > regards, tom lane > -- ------ Koichi Suzuki
"Jeff Davis" <pgsql@j-davis.com> writes: > On Wed, 2008-06-04 at 14:17 +0300, Heikki Linnakangas wrote: >> > Would that also cover possible differences in page size, 32bit OS vs. >> > 64bit OS, different timestamp flavour, etc. issues ? AFAIR, all these >> > things can have an influence on how the data is written and possibly >> > make the WAL incompatible with other postgres instances, even if the >> > exact same version... >> >> These are already covered by the information in pg_control. > > Another thing that can change between systems is the collation behavior, > which can corrupt indexes (and other bad things). Well, yes and no. It's entirely possible, for example, for a minor release of an OS to tweak the collation rules for a collation without changing the name. For the sake of argument they might just be fixing a bug in the collation rules. From the point of view of the OS that's a minor bug fix that they might not foresee causing data corruption problems. Pegging pg_control to a particular release of the OS would be pretty terrible though. I don't really see an out for this. But it's another roadblock to consider akin to "not-really-immutable index expressions" for any proposal which depends on re-finding index pointers :( -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gurjeet Singh wrote: > On Fri, May 30, 2008 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > But since you mention it: one of the plausible answers for fixing the > > vacuum problem for read-only slaves is to have the slaves push an xmin > > back upstream to the master to prevent premature vacuuming. The current > > design of pg_standby is utterly incapable of handling that requirement. > > So there might be an implementation dependency there, depending on how > > we want to solve that problem. > > > > I think it would be best to not make the slave interfere with the master's > operations; that's only going to increase the operational complexity of such > a solution. > > There could be multiple slaves following a master, some serving For the slave to not interfere with the master at all, we would need to delay application of WAL files on each slave until visibility on that slave allows the WAL to be applied, but in that case we would have long-running transactions delay data visibility of all slave sessions. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Andreas 'ads' Scherbaum wrote: > On Fri, 30 May 2008 16:22:41 -0400 (EDT) Greg Smith wrote: > > > On Fri, 30 May 2008, Andreas 'ads' Scherbaum wrote: > > > > > Then you ship 16 MB binary stuff every 30 second or every minute but > > > you only have some kbyte real data in the logfile. > > > > Not if you use pg_clearxlogtail ( > > http://www.2ndquadrant.com/replication.htm ), which got lost in the giant > > March commitfest queue but should probably wander into contrib as part of > > 8.4. > > Yes, this topic was discussed several times in the past but to > solve this it needs a patch/solution which is integrated into PG > itself, not contrib. Agreed. I realize why we are not zeroing those bytes (for performance), but can't we have the archiver zero those bytes before calling the 'archive_command'? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Agreed. I realize why we are not zeroing those bytes (for performance), > but can't we have the archiver zero those bytes before calling the > 'archive_command'? Perhaps make the zeroing user-settable. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Bruce Momjian <bruce@momjian.us> writes: > Gurjeet Singh wrote: >> There could be multiple slaves following a master, some serving > For the slave to not interfere with the master at all, we would need to > delay application of WAL files on each slave until visibility on that > slave allows the WAL to be applied, but in that case we would have > long-running transactions delay data visibility of all slave sessions. Right, but you could segregate out long-running queries to one slave server that could be further behind than the others. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Agreed. I realize why we are not zeroing those bytes (for performance), > but can't we have the archiver zero those bytes before calling the > 'archive_command'? The archiver doesn't know any more about where the end-of-data is than the archive_command does. Moreover, the archiver doesn't know whether the archive_command cares. I think the separate module is a fine solution. It should also be pointed out that the whole thing becomes uninteresting if we get real-time log shipping implemented. So I see absolutely no point in spending time integrating pg_clearxlogtail now. regards, tom lane
On Mon, 9 Jun 2008, Tom Lane wrote: > It should also be pointed out that the whole thing becomes uninteresting > if we get real-time log shipping implemented. So I see absolutely no > point in spending time integrating pg_clearxlogtail now. There are remote replication scenarios over a WAN (mainly aimed at disaster recovery) that want to keep a fairly updated database without putting too much traffic over the link. People in that category really want zeroed tail+compressed archives, but probably not the extra overhead that comes with shipping smaller packets in a real-time implementation. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Just for information. In terms of archive compression, I have archive log compression which will be found in http://pgfoundry.org/projects/pglesslog/ This feature is also included in NTT's synchronized log shipping replication presented in the last PGCon. 2008/6/10 Greg Smith <gsmith@gregsmith.com>: > On Mon, 9 Jun 2008, Tom Lane wrote: > >> It should also be pointed out that the whole thing becomes uninteresting >> if we get real-time log shipping implemented. So I see absolutely no >> point in spending time integrating pg_clearxlogtail now. > > There are remote replication scenarios over a WAN (mainly aimed at disaster > recovery) that want to keep a fairly updated database without putting too > much traffic over the link. People in that category really want zeroed > tail+compressed archives, but probably not the extra overhead that comes > with shipping smaller packets in a real-time implementation. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- ------ Koichi Suzuki
"Greg Smith" <gsmith@gregsmith.com> writes: > On Mon, 9 Jun 2008, Tom Lane wrote: > >> It should also be pointed out that the whole thing becomes uninteresting >> if we get real-time log shipping implemented. So I see absolutely no >> point in spending time integrating pg_clearxlogtail now. > > There are remote replication scenarios over a WAN (mainly aimed at disaster > recovery) that want to keep a fairly updated database without putting too much > traffic over the link. People in that category really want zeroed > tail+compressed archives, but probably not the extra overhead that comes with > shipping smaller packets in a real-time implementation. Instead of zeroing bytes and depending on compression why not just pass an extra parameter to the archive command with the offset to the logical end of data. The archive_command could just copy from the start to that point and not bother transferring the rest. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark wrote: > Instead of zeroing bytes and depending on compression why not just pass an > extra parameter to the archive command with the offset to the logical end of > data. Because the archiver process doesn't have that information. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
All, > > For the slave to not interfere with the master at all, we would need to > > delay application of WAL files on each slave until visibility on that > > slave allows the WAL to be applied, but in that case we would have > > long-running transactions delay data visibility of all slave sessions. > > Right, but you could segregate out long-running queries to one slave > server that could be further behind than the others. I still see having 2 different settings: Synchronous: XID visibility is pushed to the master. Maintains synchronous failover, and users are expected to run *1* master to *1* slave for most installations. Asynchronous: replication stops on the slave whenever minxid gets out of synch. Could have multiple slaves, but noticeable lag between master and slave. -- Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> wrote: > I still see having 2 different settings: > > Synchronous: XID visibility is pushed to the master. Maintains synchronous > failover, and users are expected to run *1* master to *1* slave for most > installations. > > Asynchronous: replication stops on the slave whenever minxid gets out of > synch. Could have multiple slaves, but noticeable lag between master and > slave. I agree with you that we have sync/async option in log-shipping. Also, we could have another setting - synchronous-shipping and asynchronous-flushing. We won't lose transactions if both servers are down at once and can avoid delays to flush wal files into primary's disks. As for multiple slaves, we could have a cascading configuration; WAL receiver also delivers WAL records to other servers. I think it is simple that the postgres core has only one-to-one replication and multiple slaves are supported by 3rd party's WAL receivers. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
>>> On Mon, Jun 9, 2008 at 9:48 PM, in message <Pine.GSO.4.64.0806092243080.11286@westnet.com>, Greg Smith <gsmith@gregsmith.com> wrote: > On Mon, 9 Jun 2008, Tom Lane wrote: > >> It should also be pointed out that the whole thing becomes uninteresting >> if we get real-time log shipping implemented. So I see absolutely no >> point in spending time integrating pg_clearxlogtail now. > > There are remote replication scenarios over a WAN (mainly aimed at > disaster recovery) that want to keep a fairly updated database without > putting too much traffic over the link. People in that category really > want zeroed tail+compressed archives, but probably not the extra overhead > that comes with shipping smaller packets in a real-time implementation. We ship the WAL files over a (relatively) slow WAN for disaster recovery purposes, and we would be fine with replacing our current techniques with real-time log shipping as long as: (1) We can do it asynchronously. (i.e., we don't have to wait for WAN latency to commit transactions.) (2) It can ship to multiple targets. (Management dictates that we have backups at the site of origin as well as our central site. A failure to replicate to one must not delay the other.) (3) It doesn't consume substantially more WAN bandwidth overall. A solution which fails to cover any of these leaves pg_clearxlogtail interesting to us. -Kevin
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi everyone, <br /><br /> This is a quick update ona promise I made early in June to suggest requirements as well as ways to add replication hooks that would support logicalreplication, as opposed to the physical replication work currently underway based on NTT’s code. <br /><br /> Well,June was a pretty busy month, so it has taken a while to get back to this. However, we are now beginning to examineoptions for PostgreSQL logical replication. To make a long story short we are willing to commit resources to thisproblem or fund other people to do it for us. If you are interested please contact me directly. Meanwhile, we are quiteserious about this problem and intend to work on helpful additions to PostgreSQL in this area. I will post more aswe make progress. <br /><br /> Thanks, Robert <br /><br /> -- <br /> Robert Hodges, CTO, Continuent, Inc.<br /> Email: robert.hodges@continuent.com<br /><br /> P.s., Happy 12th birthday everyone! </span></font>
On 7/10/08, Robert Hodges <robert.hodges@continuent.com> wrote: > This is a quick update on a promise I made early in June to suggest > requirements as well as ways to add replication hooks that would support > logical replication, as opposed to the physical replication work currently > underway based on NTT's code. > > Well, June was a pretty busy month, so it has taken a while to get back to > this. However, we are now beginning to examine options for PostgreSQL > logical replication. To make a long story short we are willing to commit > resources to this problem or fund other people to do it for us. If you are > interested please contact me directly. Meanwhile, we are quite serious > about this problem and intend to work on helpful additions to PostgreSQL in > this area. I will post more as we make progress. Well, I'm not exactly sure what you are planning. It's OK to do draft design privately, but before actually starting coding, the design should be discussed in -hackers. And I'm not exactly sure what you mean in "logical replication"? Way to log DDL statements? Do you want to log DML also? FWIW, here's very draft design for functionality that could be used to make current Slony-I/Londiste-like solutions to replicate DDL also. 1. CREATE DDL TRIGGER statement that allows to call function for all DDL statements. Only filtering that makes sense here is filtering by area: tables/functions/views/etc. It must be possible to do AFTER trigger. Whether BEFORE trigger for DDL make sense or not, I'm not sure. 2. When function is called, following information is given: - Object type the event was for (table/view/function) - Array of object names. - SQL statement as text. The trigger function can filter further based on object names whether it does want to log the event or not. Trying to make the trigger run on only subset of events is complex, and parsing the SQL to pieces for trigger to understand it better is also complex and neither is needed. Unless there are some common situation where such simple design fails to work, I would not make the scheme more complex. Also the design should be based on assumption that the target side is exactly in sync. Eg. DROP CASCADE should be replicated as DROP CASCADE. We should not make scheme more complex to survive cases where target is not in sync. That way madness lies. The effect should be like same SQL statements are applied to target by hand, no more, no less. -- marko
<font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">Hi Marko, <br /><br /> No fear, we definitely willdiscuss on pgsql-hackers. I just wanted to make sure that people understood we are still committed to solving this problemand will one way or another commit resources to help. <br /><br /> Just to be clear, by logical replication I meanreplication based on sending SQL or near-SQL (e.g., generic DML events) between servers. Physical replication on theother hand uses internal formats to replicate changes without intervening conversion to SQL, for example by shipping WALrecords. There are advantages to each for different applications. BTW, I heard this nomenclature from Simon Riggs. Itseems quite helpful.<br /><br /> The DDL trigger proposal is interesting and would be a very useful feature addition toPostgreSQL. To execute correctly it may also be necessary to know which database you were using at the time the SQL wasissued. <br /><br /> For our part we are looking for ways to replicate most or all data on a server as efficiently aspossible. Generic call-outs at commit time or reading the log directly are attractive approaches. Depending on the implementationyou can avoid double writes of replicated data on the master host. Also, it avoids the management headacheof ensuring that triggers are correctly installed. It seems as if one of these generic approaches could hook intoWAL record transport. <br /><br /> Cheers, Robert<br /><br /> On 7/10/08 4:56 AM, "Marko Kreen" <markokr@gmail.com>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span style="font-size:12.0px">On7/10/08, Robert Hodges <robert.hodges@continuent.com> wrote:<br /> > This is a quickupdate on a promise I made early in June to suggest<br /> > requirements as well as ways to add replication hooksthat would support<br /> > logical replication, as opposed to the physical replication work currently<br /> >underway based on NTT's code.<br /> ><br /> > Well, June was a pretty busy month, so it has taken a while to getback to<br /> > this. However, we are now beginning to examine options for PostgreSQL<br /> > logical replication. To make a long story short we are willing to commit<br /> > resources to this problem or fund other peopleto do it for us. If you are<br /> > interested please contact me directly. Meanwhile, we are quite serious<br/> > about this problem and intend to work on helpful additions to PostgreSQL in<br /> > this area. I willpost more as we make progress.<br /><br /> Well, I'm not exactly sure what you are planning. It's OK to do<br /> draftdesign privately, but before actually starting coding, the<br /> design should be discussed in -hackers.<br /><br />And I'm not exactly sure what you mean in "logical replication"?<br /> Way to log DDL statements? Do you want to log DMLalso?<br /><br /> FWIW, here's very draft design for functionality that could be used<br /> to make current Slony-I/Londiste-likesolutions to replicate DDL also.<br /><br /> 1. CREATE DDL TRIGGER statement that allows to call function<br/> for all DDL statements.<br /><br /> Only filtering that makes sense here is filtering by area:<br /> tables/functions/views/etc.<br /><br /> It must be possible to do AFTER trigger. Whether BEFORE<br /> triggerfor DDL make sense or not, I'm not sure.<br /><br /> 2. When function is called, following information is given:<br/><br /> - Object type the event was for (table/view/function)<br /> - Array of object names.<br /> - SQLstatement as text.<br /><br /> The trigger function can filter further based on object names<br /> whether it doeswant to log the event or not.<br /><br /> Trying to make the trigger run on only subset of events is complex,<br /> andparsing the SQL to pieces for trigger to understand it better<br /> is also complex and neither is needed. Unless thereare some common<br /> situation where such simple design fails to work, I would not make<br /> the scheme more complex.<br/><br /> Also the design should be based on assumption that the target side<br /> is exactly in sync. Eg. DROPCASCADE should be replicated as DROP CASCADE.<br /> We should not make scheme more complex to survive cases where target<br/> is not in sync. That way madness lies. The effect should be like<br /> same SQL statements are applied to targetby hand, no more, no less.<br /><br /> --<br /> marko<br /><br /></span></font></blockquote><font face="Verdana, Helvetica,Arial"><span style="font-size:12.0px"><br /><br /> -- <br /> Robert Hodges, CTO, Continuent, Inc.<br /> Email: robert.hodges@continuent.com<br /> Mobile: +1-510-501-3728 Skype: hodgesrm<br /></span></font>
markokr@gmail.com ("Marko Kreen") writes: > Also the design should be based on assumption that the target side > is exactly in sync. Eg. DROP CASCADE should be replicated as DROP CASCADE. > We should not make scheme more complex to survive cases where target > is not in sync. That way madness lies. The effect should be like > same SQL statements are applied to target by hand, no more, no less. We have, already, in 8.4, a handling of triggers for TRUNCATE; the reason why support hasn't made it into Slony-I yet relates quite exactly to this... The trouble comes in if you do TRUNCATE CASCADE; I'm not quite sure how to collect together the multiple recordings of the trigger functions that would be collected as a result; for it all to work, safely, on the remote node, we'd need to apply all of those truncates at once. Note also that there is an issue with coordination of schemas; Slony-I shuts off the RI triggers on subscribers, so that the target is fairly certain to not be *entirely* in sync, by express intent. Those are legitimate differences between source and target. -- select 'cbbrowne' || '@' || 'linuxfinances.info'; http://cbbrowne.com/info/lsf.html Rules of the Evil Overlord #145. "My dungeon cell decor will not feature exposed pipes. While they add to the gloomy atmosphere, they are good conductors of vibrations and a lot of prisoners know Morse code." <http://www.eviloverlord.com/>
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Agreed. I realize why we are not zeroing those bytes (for performance), > > but can't we have the archiver zero those bytes before calling the > > 'archive_command'? > > The archiver doesn't know any more about where the end-of-data is than > the archive_command does. Moreover, the archiver doesn't know whether > the archive_command cares. I think the separate module is a fine > solution. > > It should also be pointed out that the whole thing becomes uninteresting > if we get real-time log shipping implemented. So I see absolutely no > point in spending time integrating pg_clearxlogtail now. People doing PITR are still going to be saving these files, and for a long time, so I think this is still something we should try to address. Added to TODO: o Reduce PITR WAL file size by removing full page writes and by removing trailing bytes to improve compression -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Added to TODO: > > o Reduce PITR WAL file size by removing full page writes and > by removing trailing bytes to improve compression If we remove full page writes, how does hint bit setting get propagated to the slave? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Bruce Momjian wrote: > > > Added to TODO: > > > > o Reduce PITR WAL file size by removing full page writes and > > by removing trailing bytes to improve compression > > If we remove full page writes, how does hint bit setting get propagated > to the slave? We would remove full page writes that are needed for crash recovery, but perhaps keep other full pages. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > > > Added to TODO: > > > > > > o Reduce PITR WAL file size by removing full page writes and > > > by removing trailing bytes to improve compression > > > > If we remove full page writes, how does hint bit setting get propagated > > to the slave? > > We would remove full page writes that are needed for crash recovery, but > perhaps keep other full pages. How do you tell which is which? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Bruce Momjian wrote: > > Alvaro Herrera wrote: > > > Bruce Momjian wrote: > > > > > > > Added to TODO: > > > > > > > > o Reduce PITR WAL file size by removing full page writes and > > > > by removing trailing bytes to improve compression > > > > > > If we remove full page writes, how does hint bit setting get propagated > > > to the slave? > > > > We would remove full page writes that are needed for crash recovery, but > > perhaps keep other full pages. > > How do you tell which is which? The WAL format would have to be modified to indicate which entries can be discarded. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +