Thread: Constant WAL replay
We have toyed around with PostgreSQL's WAL structure and we are wondering whether it is (theoretically) possible to replay WAL records coming from a remote host while a system is working in read only mode. The idea: We are looking for a way to implement a synchronous single-master / multiple slaves systems. Meanwhile we are able to serialize / deserialize WAL records and send them to a group communication system which transports those records to the slave database. This is not hard to do. The problem is: How can we replay data on the slave while the slave is answering read-only SQL-statements? As far as I understand the code we can either recover the system or answer SQL statements. Which kind of modifications are necessary to replay and answer at the same time? (assuming that locks and transactional information are available on the slave). Input is very much appreciated. Best regards, Hans -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/664/393 39 74 www.cybertec.at, www.postgresql.at
Hans-Jürgen Schönig <postgres@cybertec.at> writes: > We have toyed around with PostgreSQL's WAL structure and we are > wondering whether it is (theoretically) possible to replay WAL records > coming from a remote host while a system is working in read only mode. There's been some idle speculation about that, and offhand I don't know of any showstopper reasons why it couldn't be done ... but no one's actually looked into making it happen. Feel free ;-) > Which kind of modifications are necessary to > replay and answer at the same time? (assuming that locks and > transactional information are available on the slave). I think you should definitely assume that locks are *not* available, and probably not any live transaction info. The discussions before included the idea of being able to do queries against a known-broken, non-recoverable database without making it any worse than it was before. In the current code there is no such thing as a hard read-only behavior --- for example we will try to update commit-status hint bits no matter what. Allowing that to be turned off would be interesting for a number of purposes, such as burning a database onto CD. regards, tom lane
Hans-Jürgen Schönig wrote: > The idea: We are looking for a way to implement a synchronous > single-master / multiple slaves systems. > Meanwhile we are able to serialize / deserialize WAL records and send > them to a group communication system which transports those records to > the slave database. BTW, what is the benefit to using a GCS here? -Neil
> what. Allowing that to be turned off would be interesting for a number > of purposes, such as burning a database onto CD. FWIW, Oracle suggests a "transportable tablespace" for this feature. Which is a tablespace that is not written too and which can be read by any database. Would that solve the purposes you mean? Greetings, Klaus
Klaus Naumann <lists@distinctmind.de> writes: >> what. Allowing that to be turned off would be interesting for a number >> of purposes, such as burning a database onto CD. > FWIW, Oracle suggests a "transportable tablespace" for this feature. > Which is a tablespace that is not written too and which can be read by > any database. > Would that solve the purposes you mean? It's a very long way from here to there. In particular, since different installations have different transaction histories, the XIDs in the table could not be transportable. You'd almost be forced to build something like the non-MVCC, XID-less table type that was being speculated about up-thread. regards, tom lane
On Sun, Apr 24, 2005 at 11:41:17AM -0400, Tom Lane wrote: > Klaus Naumann <lists@distinctmind.de> writes: > >> what. Allowing that to be turned off would be interesting for a number > >> of purposes, such as burning a database onto CD. > > > FWIW, Oracle suggests a "transportable tablespace" for this feature. > > Which is a tablespace that is not written too and which can be read by > > any database. > > Would that solve the purposes you mean? > > It's a very long way from here to there. In particular, since different > installations have different transaction histories, the XIDs in the > table could not be transportable. Unless the tables are frozen first. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "En las profundidades de nuestro inconsciente hay una obsesiva necesidad de un universo lógico y coherente. Pero el universo real se halla siempre un paso más allá de la lógica" (Irulan)
On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-Jürgen Schönig wrote: > The idea: We are looking for a way to implement a synchronous > single-master / multiple slaves systems. > Meanwhile we are able to serialize / deserialize WAL records and send > them to a group communication system which transports those records to > the slave database. > This is not hard to do. In fact, I believe Command Prompt's Mammoth Replicator does exactly this. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La victoria es para quien se atreve a estar solo"
Alvaro Herrera wrote: > On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-Jürgen Schönig wrote: > > >>The idea: We are looking for a way to implement a synchronous >>single-master / multiple slaves systems. >>Meanwhile we are able to serialize / deserialize WAL records and send >>them to a group communication system which transports those records to >>the slave database. >>This is not hard to do. > > > In fact, I believe Command Prompt's Mammoth Replicator does exactly > this. Very close. We don't use the WAL (yet, slated for probably 8.1) but we do use a transaction log shipping method. So the implementation is almost the same. Sincerely, Joshua D. Drake Command Prompt, Inc.
Joshua D. Drake wrote: > Alvaro Herrera wrote: > > On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-J?rgen Sch?nig wrote: > > > > > >>The idea: We are looking for a way to implement a synchronous > >>single-master / multiple slaves systems. > >>Meanwhile we are able to serialize / deserialize WAL records and send > >>them to a group communication system which transports those records to > >>the slave database. > >>This is not hard to do. > > > > > > In fact, I believe Command Prompt's Mammoth Replicator does exactly > > this. > > Very close. We don't use the WAL (yet, slated for probably 8.1) but we > do use a transaction log shipping method. So the implementation is > almost the same. Can you run queries on the slave? If so, how do you handle xid collisions? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
>> >>Very close. We don't use the WAL (yet, slated for probably 8.1) but we >>do use a transaction log shipping method. So the implementation is >>almost the same. > > > Can you run queries on the slave? If so, how do you handle xid collisions? You can run any query that does not modify data on a replicated table. You can run any non data modifying query on any of the tables. Sincerely, Joshua D. Drake >
Joshua D. Drake wrote: > >> > >>Very close. We don't use the WAL (yet, slated for probably 8.1) but we > >>do use a transaction log shipping method. So the implementation is > >>almost the same. > > > > > > Can you run queries on the slave? If so, how do you handle xid collisions? > > You can run any query that does not modify data on a replicated table. > You can run any non data modifying query on any of the tables. So, do you modify the slave to prevent it from grabbing an xid? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Neil Conway wrote: > Hans-Jürgen Schönig wrote: > >> The idea: We are looking for a way to implement a synchronous >> single-master / multiple slaves systems. >> Meanwhile we are able to serialize / deserialize WAL records and send >> them to a group communication system which transports those records to >> the slave database. > > > BTW, what is the benefit to using a GCS here? > > -Neil currently i have used it because it is simple. at the moment we are just doing experiments to see how log information can be treated. shipping the data is not the problem - the real problem is getting it back in. good ideas are welcome ;) best regards, hans -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/664/393 39 74 www.cybertec.at, www.postgresql.at
Joshua D. Drake wrote: > Alvaro Herrera wrote: > >> On Sun, Apr 24, 2005 at 08:10:34AM +0200, Hans-Jürgen Schönig wrote: >> >> >>> The idea: We are looking for a way to implement a synchronous >>> single-master / multiple slaves systems. >>> Meanwhile we are able to serialize / deserialize WAL records and send >>> them to a group communication system which transports those records >>> to the slave database. >>> This is not hard to do. >> >> >> >> In fact, I believe Command Prompt's Mammoth Replicator does exactly >> this. > > > Very close. We don't use the WAL (yet, slated for probably 8.1) but we > do use a transaction log shipping method. So the implementation is > almost the same. > > Sincerely, > > Joshua D. Drake > Command Prompt, Inc. Joshua, This sounds interesting. If you don't use the WAL but a transaction log shipping - what does it mean in terms of PostgreSQL? Do you create your own transaction log? What really interests me here: Where is you code located in order to make sure that this things can work reliably? Best regards, Hans -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/664/393 39 74 www.cybertec.at, www.postgresql.at
> > > Joshua, > > This sounds interesting. If you don't use the WAL but a transaction log > shipping - what does it mean in terms of PostgreSQL? Do you create your > own transaction log? Yes. > What really interests me here: Where is you code located in order to > make sure that this things can work reliably? We are integrated into the PostgreSQL backend. Sincerely, Joshua D. Drake > > Best regards, > > Hans > -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedication Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
Joshua D. Drake wrote: > > > > > > Joshua, > > > > This sounds interesting. If you don't use the WAL but a transaction log > > shipping - what does it mean in terms of PostgreSQL? Do you create your > > own transaction log? > > Yes. > > > What really interests me here: Where is you code located in order to > > make sure that this things can work reliably? > > We are integrated into the PostgreSQL backend. So they have to get a new PostgreSQL release from you for every minor upgrade, I assume at no cost? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Joshua D. Drake wrote: >> Joshua, >> >> This sounds interesting. If you don't use the WAL but a transaction >> log shipping - what does it mean in terms of PostgreSQL? Do you >> create your own transaction log? > > > Yes. > >> What really interests me here: Where is you code located in order to >> make sure that this things can work reliably? > > > We are integrated into the PostgreSQL backend. .... which is a large place ;). best regards, hans
>> >>>What really interests me here: Where is you code located in order to >>>make sure that this things can work reliably? >> >>We are integrated into the PostgreSQL backend. > > > So they have to get a new PostgreSQL release from you for every minor > upgrade, I assume at no cost? Yep :) Sincerely, Joshua D. Drake Command Prompt, Inc. > -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedication Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote: > >> Simon Riggs <simon@2ndquadrant.com> writes: > >>> Short patch enclosed to turn off writing of commit-status hint bits. > >> > >> Doesn't this entirely destroy the ability to truncate clog, and > >> therefore the ability to survive XID wraparound? > > > I hope not for all our sakes, since the hint bits are not WAL logged and > > anything that relies upon them would be fragile. > > We don't rely on any one write of them to work, but that doesn't mean > that we can indefinitely postpone writing them. OK, I think I understand where you're coming from now. A table can't be migrated to read-only media until all of its tuples have an xmin of FrozenTransactionId; I said the following, which was wrong: "This should allow migration of older child tables to hierarchical storage when using a large historical table design." The patch doesn't directly contribute to that goal, though is of value in a large historical table design with many read only child tables (and other situations). But that comment wasn't the only inspiration for the patch. When VACUUM freezes the xid, it *does* make sense at that point to update the hint bits as a performance optimization. That isn't required though, and *can* be indefinitely postponed, AFAICS. All of the tqual routines will still work just as well without the hint bits set. If you know different, I'll need a deeper explanation before I understand. IMHO, the direction the patch is going in is still worthwhile because of these issues: 1. Any block read may attempt to set hint bits, which dirties the block and must be written out. So *reads* can result in heavier write activity at checkpoint time. That effects both OLTP and DW systems: Random read transactions against a large table will be worst effected, since we may end up writing the block once for each read. 2. A lazy vacuum may also dirty a block, even when it has done nothing else useful to that block. Worse, if we vacuum a table that is bigger than shared_buffers (or close), then we will end up having to evict dirty buffers that the vacuum itself has written in order to continue the vacuum. Since Vacuum is two-pass, we may end up writing a block *twice*, once where we set the hint bits and then again later where we remove the tuples and re-write. So this patch will allow a normal VACUUM to perform better on larger tables. (1) is a pain, but there's no point solving it without also solving (2). The patch would fail an Assert test during a VACUUM, since the info bits are not actually set if cache_txn_status_with_data = false during VACUUM. As a result of (2), perhaps we should remove all of the SetBufferCommitInfoNeedsSave calls in HeapTupleSatisfiesVacuum, and add a call to SetBufferCommitInfoNeedsSave that overrides cache_txn_status_with_data when we actually freeze a row. (Or perhaps that should be a VACUUM FAST command?) That way we would set the hint bits *only* when we freeze a row and not at any other time. If we further reduced the number of times we dirty the block *at all* on the first pass of a VACUUM, we would reduce the chance of writing twice. We could save the setting of frozen transactions until the second phase, i.e. only dirty the block if (pgchanged && vacrelstats->num_dead_tuples > prev_dead_count) The patch sets cache_txn_status_with_data as a USERSET, with the intention that particular read-only users would not wish to have their read-only transactions turn into write transactions. There was no intention to prevent VACUUM, not to avoid the optimisation of writing hint bits on a VACUUM FREEZE nor to set full read only status - which we discussed previously but is a much longer project. Sorry for any confusion caused in my initial patch submission. Does my longer explanation make sense of what the patch is trying to achieve. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote: >> We don't rely on any one write of them to work, but that doesn't mean >> that we can indefinitely postpone writing them. > OK, I think I understand where you're coming from now. Apparently not :-( > When VACUUM freezes the xid, it *does* make sense at that point to > update the hint bits as a performance optimization. The hint bits are not really relevant when xmin = FrozenTransactionId, since any examiner of the tuple would consider that XID committed anyway. Besides, the hint bit is guaranteed set in that scenario; note the Assert where vacuum is setting it: HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId); /* infomask should be okay already */ Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED); The scenario in which the hint bit *must* be set is where it is for an XID for which we have deleted the relevant section of pg_clog, which we are willing to do well before freeze occurs, if we know that all the relevant XIDs have been hinted. See TruncateCLOG. Your patch breaks that logic by not considering hint-bit updates as changes that must be flushed to disk by checkpoint. > 1. Any block read may attempt to set hint bits, which dirties the block > and must be written out. So *reads* can result in heavier write activity > at checkpoint time. Sure, but the alternative is heavier activity in repeated checks of pg_clog to find out commit state that a previous examiner of the tuple already found out. The patch supposes that one write is worse than N reads, which is clearly a loss at some not-exceedingly-large value of N. If we thought that was a good tradeoff, we might as well not have the hint bits at all. regards, tom lane
On Wed, 2005-07-20 at 13:20 -0400, Tom Lane wrote: > > When VACUUM freezes the xid, it *does* make sense at that point to > > update the hint bits as a performance optimization. > > The hint bits are not really relevant when xmin = FrozenTransactionId, > since any examiner of the tuple would consider that XID committed anyway. > Besides, the hint bit is guaranteed set in that scenario; note the > Assert where vacuum is setting it: > > HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId); > /* infomask should be okay already */ > Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED); > The scenario in which the hint bit *must* be set is where it is for > an XID for which we have deleted the relevant section of pg_clog, > which we are willing to do well before freeze occurs, if we know that > all the relevant XIDs have been hinted. See TruncateCLOG. Your patch > breaks that logic by not considering hint-bit updates as changes that > must be flushed to disk by checkpoint. OK, I see that now. Thank you for showing me that. I will change the patch so that this does not prevent VACUUM from setting hint bits. All of this does raise other questions, but thats probably sensible to raise those on other threads. > > 1. Any block read may attempt to set hint bits, which dirties the block > > and must be written out. So *reads* can result in heavier write activity > > at checkpoint time. > > Sure, but the alternative is heavier activity in repeated checks of > pg_clog to find out commit state that a previous examiner of the tuple > already found out. The patch supposes that one write is worse than N > reads, which is clearly a loss at some not-exceedingly-large value of N. > If we thought that was a good tradeoff, we might as well not have the > hint bits at all. That is not a decision we can make without knowledge of the application. In general, the hint bits are good. In *some* cases, not. I still seek control over that as a designer. Specifically, the scenario I want to optimize is this: - we load a table with lots of real time measurement data, as one child out of a large number of similar child tables - we then immediately create summary tables from the measurements - after this the detailed data is only sporadically accessed, if ever - detail data is dropped from the database after a few weeks - the majority of the database is detail data, so those tables are never vacuumed since no rows are ever deleted from those tables (the tables are dropped) nor is access sufficiently frequent to make it sensible to set hint bits - hence *no* complete database vacuum is run on a regular basis In this design, the second step causes the whole detailed data table to be written out to the database (again) immediately after loading. I would like to be able to prevent that. That design is *not* uncommon, since we might describe it as a "regulatory/monitoring data archiving" application. I have two current clients with exactly this design, plus have worked on similar apps in the past. The PostgreSQL licence cost makes it particularly suitable for that kind of application. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > In general, the hint bits are good. In *some* cases, not. I still seek > control over that as a designer. > Specifically, the scenario I want to optimize is this: > - we load a table with lots of real time measurement data, as one child > out of a large number of similar child tables > - we then immediately create summary tables from the measurements > - after this the detailed data is only sporadically accessed, if ever > - detail data is dropped from the database after a few weeks > - the majority of the database is detail data, so those tables are never > vacuumed since no rows are ever deleted from those tables (the tables > are dropped) nor is access sufficiently frequent to make it sensible to > set hint bits - hence *no* complete database vacuum is run on a regular > basis > In this design, the second step causes the whole detailed data table to > be written out to the database (again) immediately after loading. I > would like to be able to prevent that. This application seems sufficiently off the beaten path to not be worth solving with a knob as klugy and potentially dangerous as suppress_hint_bits. A better way to avoid the write pass, if you must, is to do the summary table creation in the same transaction that loads the data. regards, tom lane
On Fri, 2005-07-22 at 09:40 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > In general, the hint bits are good. In *some* cases, not. I still seek > > control over that as a designer. ... > ... not be worth > solving with a knob as klugy and potentially dangerous as > suppress_hint_bits. I note that heap_get_latest_tid() relies upon the setting of HEAP_XMIN_COMMITTED in the infomask in lieu of checking any return codes from the HeapTupleSatisfies check. Sequences also touch on that. I agree with you that it is potentially dangerous to mess with that, for now. I'll look at Read-Only Tables for 8.2 rather than this quick-fix. Best Regards, Simon Riggs