Thread: Exposing the Xact commit order to the user
In some systems (data warehousing, replication), the order of commits is important, since that is the order in which changes have become visible. This information could theoretically be extracted from the WAL, but scanning the entire WAL just to extract this tidbit of information would be excruciatingly painful. The following is based on ideas that emerged during last weeks PGCon. Consider it an implementation proposal, if you like. We introduce a new set of files. The files represent segments of an infinite array of structures. The present segments are the available "window" of data. Similar to CLOG files, the individual file name will represent the high bits of a "serial" number, the offset of the record inside the file represents the low bits of the "serial". The system will have postgresql.conf options for enabling/disabling the whole shebang, how many shared buffers to allocate for managing access to the data and to define the retention period of the data based on data volume and/or age of the commit records. Each record of the Transaction Commit Info consists of txid xci_transaction_id timestamptz xci_begin_timestamp timestamptz xci_commit_timestamp int64 xci_total_rowcount 32 bytes total. CommitTransaction() inside of xact.c will call a function, that inserts a new record into this array. The operation will for most of the time be nothing than taking a spinlock and adding the record to shared memory. All the data for the record is readily available, does not require further locking and can be collected locally before taking the spinlock. The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the commit_timestamp is what CommitTransaction() just decided to write into the WAL commit record and the total_rowcount is the sum of inserted, updated and deleted heap tuples during the transaction, which should be easily available from the statistics collector, unless row stats are disabled, in which case the datum would be zero. The function will return the "sequence" number which CommitTransaction() in turn will record in the WAL commit record together with the begin_timestamp. While both, the begin as well as the commit timestamp are crucial to determine what data a particular transaction should have seen, the row count is not and will not be recorded in WAL. Checkpoint handling will call a function to flush the shared buffers. Together with this, the information from WAL records will be sufficient to recover this data (except for row counts) during crash recovery. Exposing the data will be done via a set returning function. The SRF takes two arguments. The maximum number of rows to return and the last serial number processed by the reader. The advantage of such SRF is that the result can be used in a query that right away delivers audit or replication log information in transaction commit order. The SRF can return an empty set if no further transactions have committed since, or an error if data segments needed to answer the request have already been purged. Purging of the data will be possible in several different ways. Autovacuum will call a function that drops segments of the data that are outside the postgresql.conf configuration withrespect to maximum age or data volume. There will also be a function reserved for superusers to explicitly purge the data up to a certain serial number. Comments, suggestions? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Exposing the data will be done via a set returning function. The SRF > takes two arguments. The maximum number of rows to return and the last > serial number processed by the reader. The advantage of such SRF is that > the result can be used in a query that right away delivers audit or > replication log information in transaction commit order. The SRF can > return an empty set if no further transactions have committed since, or > an error if data segments needed to answer the request have already been > purged. In light of the proposed purging scheme, how would it be able to distinguish between those two cases (nothing there yet vs. was there but purged)? - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201005231646 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkv5lIAACgkQvJuQZxSWSsiR3gCgvyK/NPd6WmKGUqdo/3fdWIR7 LAQAoJqk3gYpEgtjw10gINDKFXTAnWO5 =sSvK -----END PGP SIGNATURE-----
On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > The system will have postgresql.conf options for enabling/disabling the > whole shebang, how many shared buffers to allocate for managing access > to the data and to define the retention period of the data based on data > volume and/or age of the commit records. It would be nice if this could just be managed out of shared_buffers rather than needing to configure a separate pool just for this feature. But, I'm not sure how much work that is, and if it turns out to be too ugly then I'd say it's not a hard requirement. In general, I think we talked during the meeting about the desirability of folding specific pools into shared_buffers rather than managing them separately, but I'm not aware that we have any cases where we do that today so it might be hard (or not). > Each record of the Transaction Commit Info consists of > > txid xci_transaction_id > timestamptz xci_begin_timestamp > timestamptz xci_commit_timestamp > int64 xci_total_rowcount > > 32 bytes total. Are we sure it's worth including the row count? I wonder if we ought to leave that out and let individual clients of the mechanism track that if they're so inclined, especially since it won't be reliable anyway. > CommitTransaction() inside of xact.c will call a function, that inserts > a new record into this array. The operation will for most of the time be > nothing than taking a spinlock and adding the record to shared memory. > All the data for the record is readily available, does not require > further locking and can be collected locally before taking the spinlock. What happens when you need to switch pages? > The function will return the "sequence" number which CommitTransaction() > in turn will record in the WAL commit record together with the > begin_timestamp. While both, the begin as well as the commit timestamp > are crucial to determine what data a particular transaction should have > seen, the row count is not and will not be recorded in WAL. It would certainly be better if we didn't to bloat the commit xlog records to do this. Is there any way to avoid that? > Checkpoint handling will call a function to flush the shared buffers. > Together with this, the information from WAL records will be sufficient > to recover this data (except for row counts) during crash recovery. Right. > Exposing the data will be done via a set returning function. The SRF > takes two arguments. The maximum number of rows to return and the last > serial number processed by the reader. The advantage of such SRF is that > the result can be used in a query that right away delivers audit or > replication log information in transaction commit order. The SRF can > return an empty set if no further transactions have committed since, or > an error if data segments needed to answer the request have already been > purged. > > Purging of the data will be possible in several different ways. > Autovacuum will call a function that drops segments of the data that are > outside the postgresql.conf configuration with respect to maximum age > or data volume. There will also be a function reserved for superusers to > explicitly purge the data up to a certain serial number. Dunno if autovacuuming this is the right way to go. Seems like that could leave to replication breaks, and it's also more work than not doing that. I'd just say that if you turn this on you're responsible for pruning it, full stop. > Anyone who trades liberty for security deserves neither > liberty nor security. -- Benjamin Franklin +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 5/23/2010 4:48 PM, Greg Sabino Mullane wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > >> Exposing the data will be done via a set returning function. The SRF >> takes two arguments. The maximum number of rows to return and the last >> serial number processed by the reader. The advantage of such SRF is that >> the result can be used in a query that right away delivers audit or >> replication log information in transaction commit order. The SRF can >> return an empty set if no further transactions have committed since, or >> an error if data segments needed to answer the request have already been >> purged. > > In light of the proposed purging scheme, how would it be able to distinguish > between those two cases (nothing there yet vs. was there but purged)? There is a difference between an empty result set and an exception. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/23/2010 8:38 PM, Robert Haas wrote: > On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> The system will have postgresql.conf options for enabling/disabling the >> whole shebang, how many shared buffers to allocate for managing access >> to the data and to define the retention period of the data based on data >> volume and/or age of the commit records. > > It would be nice if this could just be managed out of shared_buffers > rather than needing to configure a separate pool just for this > feature. But, I'm not sure how much work that is, and if it turns out > to be too ugly then I'd say it's not a hard requirement. In general, > I think we talked during the meeting about the desirability of folding > specific pools into shared_buffers rather than managing them > separately, but I'm not aware that we have any cases where we do that > today so it might be hard (or not). I'm not sure the retention policies of the shared buffer cache, the WAL buffers, CLOG buffers and every other thing we try to cache are that easy to fold into one single set of logic. But I'm all ears. > >> Each record of the Transaction Commit Info consists of >> >> txid xci_transaction_id >> timestamptz xci_begin_timestamp >> timestamptz xci_commit_timestamp >> int64 xci_total_rowcount >> >> 32 bytes total. > > Are we sure it's worth including the row count? I wonder if we ought > to leave that out and let individual clients of the mechanism track > that if they're so inclined, especially since it won't be reliable > anyway. Nope, we (my belly and I) are not sure about the absolute worth of the row count. It would be a convenient number to have there, but I can live without it. > >> CommitTransaction() inside of xact.c will call a function, that inserts >> a new record into this array. The operation will for most of the time be >> nothing than taking a spinlock and adding the record to shared memory. >> All the data for the record is readily available, does not require >> further locking and can be collected locally before taking the spinlock. > > What happens when you need to switch pages? Then the code will have to grab another free buffer or evict one. > >> The function will return the "sequence" number which CommitTransaction() >> in turn will record in the WAL commit record together with the >> begin_timestamp. While both, the begin as well as the commit timestamp >> are crucial to determine what data a particular transaction should have >> seen, the row count is not and will not be recorded in WAL. > > It would certainly be better if we didn't to bloat the commit xlog > records to do this. Is there any way to avoid that? If you can tell me how a crash recovering system can figure out what the exact "sequence" number of the WAL commit record at hand should be, let's rip it. > >> Checkpoint handling will call a function to flush the shared buffers. >> Together with this, the information from WAL records will be sufficient >> to recover this data (except for row counts) during crash recovery. > > Right. > >> Exposing the data will be done via a set returning function. The SRF >> takes two arguments. The maximum number of rows to return and the last >> serial number processed by the reader. The advantage of such SRF is that >> the result can be used in a query that right away delivers audit or >> replication log information in transaction commit order. The SRF can >> return an empty set if no further transactions have committed since, or >> an error if data segments needed to answer the request have already been >> purged. >> >> Purging of the data will be possible in several different ways. >> Autovacuum will call a function that drops segments of the data that are >> outside the postgresql.conf configuration with respect to maximum age >> or data volume. There will also be a function reserved for superusers to >> explicitly purge the data up to a certain serial number. > > Dunno if autovacuuming this is the right way to go. Seems like that > could leave to replication breaks, and it's also more work than not > doing that. I'd just say that if you turn this on you're responsible > for pruning it, full stop. It is an option. "Keep it until I tell you" is a perfectly valid configuration option. One you probably don't want to forget about, but valid none the less. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > In light of the proposed purging scheme, how would it be able to distinguish > between those two cases (nothing there yet vs. was there but purged)? > There is a difference between an empty result set and an exception. No, I meant how will the *function* know, if a superuser and/or some background process can purge records at any time? - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201005240928 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkv6f0UACgkQvJuQZxSWSsh0xwCgmXLtKngoBBYX0TxDM2TlJRId AVIAoMHYa3c9Ej2vUJyFufxBR5vDPzQ+ =e1mh -----END PGP SIGNATURE-----
On Sun, May 23, 2010 at 9:44 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > I'm not sure the retention policies of the shared buffer cache, the WAL > buffers, CLOG buffers and every other thing we try to cache are that easy to > fold into one single set of logic. But I'm all ears. I'm not sure either, although it seems like LRU ought to be good enough for most things. I'm more worried about things like whether the BufferDesc abstraction is going to get in the way. >>> CommitTransaction() inside of xact.c will call a function, that inserts >>> a new record into this array. The operation will for most of the time be >>> nothing than taking a spinlock and adding the record to shared memory. >>> All the data for the record is readily available, does not require >>> further locking and can be collected locally before taking the spinlock. >> >> What happens when you need to switch pages? > > Then the code will have to grab another free buffer or evict one. Hopefully not while holding a spin lock. :-) >>> The function will return the "sequence" number which CommitTransaction() >>> in turn will record in the WAL commit record together with the >>> begin_timestamp. While both, the begin as well as the commit timestamp >>> are crucial to determine what data a particular transaction should have >>> seen, the row count is not and will not be recorded in WAL. >> >> It would certainly be better if we didn't to bloat the commit xlog >> records to do this. Is there any way to avoid that? > > If you can tell me how a crash recovering system can figure out what the > exact "sequence" number of the WAL commit record at hand should be, let's > rip it. Hmm... could we get away with WAL-logging the next sequence number just once per checkpoint? When you replay the checkpoint record, you update the control file with the sequence number. Then all the commits up through the next checkpoint just use consecutive numbers starting at that value. > It is an option. "Keep it until I tell you" is a perfectly valid > configuration option. One you probably don't want to forget about, but valid > none the less. As Tom is fond of saying, if it breaks, you get to keep both pieces. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Jan Wieck wrote: > In some systems (data warehousing, replication), the order of > commits is important, since that is the order in which changes > have become visible. This issue intersects with the serializable work I've been doing. While in database transactions using S2PL the above is true, in snapshot isolation and the SSI implementation of serializable transactions, it's not. In particular, the snapshot anomalies which can cause non-serializable behavior happen precisely because the apparent order of execution doesn't match anything so linear as order of commit. I'll raise that receipting example again. You have transactions which grab the current deposit data and insert it into receipts, as payments are received. At some point in the afternoon, the deposit date in a control table is changed to the next day, so that the receipts up to that point can be deposited during banking hours with the current date as their deposit date. A report is printed (and likely a transfer transaction recorded to move "cash in drawer" to "cash in checking", but I'll ignore that aspect for this example). Some receipts may not be committed when the update to the date in the control table is committed. This is "eventually consistent" -- once all the receipts with the old date commit or roll back the database is OK, but until then you might be able to select the new date in the control table and the set of receipts matching the old date without the database telling you that you're missing data. The new serializable implementation fixes this, but there are open R&D items (due to the need to discuss the issues) on the related Wiki page related to hot standby and other replication. Will we be able to support transactional integrity on slave machines? What if the update to the control table and the insert of receipts all happen on the master, but someone decides to move the (now happily working correctly with serializable transactions) reporting to a slave machine? (And by the way, don't get too hung up on this particular example, I could generate dozens more on demand -- the point is that order of commit doesn't always correspond to apparent order of execution; in this case the receipts *appear* to have executed first, because they are using a value "later" updated to something else by a different transaction, even though that other transaction *committed* first.) Replicating or recreating the whole predicate locking and conflict detection on slaves is not feasible for performance reasons. (I won't elaborate unless someone feels that's not intuitively obvious.) The only sane way I can see to have a slave database allow serializable behavior is to WAL-log the acquisition of a snapshot by a serializable transaction, and the rollback or commit, on the master, and to have the serializable snapshot build on a slave exclude any serializable transactions for which there are still concurrent serializable transactions. Yes, that does mean WAL- logging the snapshot acquisition even if the transaction doesn't yet have an xid, and WAL-logging the commit or rollback even if it never acquires an xid. I think this solve the issue Jan raises as long as serializable transactions are used; if they aren't there are no guarantees of transactional integrity no matter how you track commit sequence, unless it can be based on S2PL-type blocking locks. I'll have to leave that to someone else to sort out. -Kevin
On Mon, May 24, 2010 at 11:24 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Jan Wieck wrote: > >> In some systems (data warehousing, replication), the order of >> commits is important, since that is the order in which changes >> have become visible. > > This issue intersects with the serializable work I've been doing. > While in database transactions using S2PL the above is true, in > snapshot isolation and the SSI implementation of serializable > transactions, it's not. I think you're confusing two subtly different things. The way to prove that a set of transactions running under some implementation of serializability is actually serializable is to construct a serial order of execution consistent with the view of the database that each transaction saw. This may or may not match the commit order, as you say. But the commit order is still the order the effects of those transactions have become visible - if we inserted a new read-only transaction into the stream at some arbitrary point in time, it would see all the transactions which committed before it and none of those that committed afterward. So I think Jan's statement is correct. Having said that, I think your concerns about how things will look from a slave's point of view are possibly valid. A transaction running on a slave is essentially a read-only transaction that the master doesn't know about. It's not clear to me whether adding such a transaction to the timeline could result in either (a) that transaction being rolled back or (b) some impact on which other transactions got rolled back. If it did, that would obviously be a problem for serializability on slaves, though your proposed fix sounds like it would be prohibitively expensive for many users. But can this actually happen? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas wrote: > I think you're confusing two subtly different things. The only thing I'm confused about is what benefit anyone expects to get from looking at data between commits in some way other than our current snapshot mechanism. Can someone explain a use case where what Jan is proposing is better than snapshot isolation? It doesn't provide any additional integrity guarantees that I can see. > But the commit order is still the order the effects of those > transactions have become visible - if we inserted a new read-only > transaction into the stream at some arbitrary point in time, it > would see all the transactions which committed before it and none > of those that committed afterward. Isn't that what a snapshot does already? > your proposed fix sounds like it would be prohibitively expensive > for many users. But can this actually happen? How so? The transaction start/end logging, or looking at that data when building a snapshot? -Kevin
On 24/05/10 19:51, Kevin Grittner wrote: > The only thing I'm confused about is what benefit anyone expects to > get from looking at data between commits in some way other than our > current snapshot mechanism. Can someone explain a use case where > what Jan is proposing is better than snapshot isolation? It doesn't > provide any additional integrity guarantees that I can see. Right, it doesn't. What it provides is a way to reconstruct a snapshot at any point in time, after the fact. For example, after transactions A, C, D and B have committed in that order, it allows you to reconstruct a snapshot just like you would've gotten immediately after the commit of A, C, D and B respectively. That's useful replication tools like Slony that needs to commit the changes of those transactions in the slave in the same order as they were committed in the master. I don't know enough of Slony et al. to understand why that'd be better than the current heartbeat mechanism they use, taking a snapshot every few seconds, batching commits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, May 24, 2010 at 12:51 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas wrote: >> I think you're confusing two subtly different things. > > The only thing I'm confused about is what benefit anyone expects to > get from looking at data between commits in some way other than our > current snapshot mechanism. Can someone explain a use case where > what Jan is proposing is better than snapshot isolation? It doesn't > provide any additional integrity guarantees that I can see. It's a tool for replication solutions to use. >> But the commit order is still the order the effects of those >> transactions have become visible - if we inserted a new read-only >> transaction into the stream at some arbitrary point in time, it >> would see all the transactions which committed before it and none >> of those that committed afterward. > > Isn't that what a snapshot does already? Yes, for a particular transaction. But this is to allow transactions to be replayed (in order) on another node. >> your proposed fix sounds like it would be prohibitively expensive >> for many users. But can this actually happen? > > How so? The transaction start/end logging, or looking at that data > when building a snapshot? I guess what I'm asking is - if the reconstructed transaction order inferred by SSI doesn't match the actual commit order, can we get a serialization anomaly on the standby by replaying transactions there in commit order? Can you give an example and explain how your proposal would solve it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Robert Haas's message of dom may 23 20:38:14 -0400 2010: > On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > > The system will have postgresql.conf options for enabling/disabling the > > whole shebang, how many shared buffers to allocate for managing access > > to the data and to define the retention period of the data based on data > > volume and/or age of the commit records. > > It would be nice if this could just be managed out of shared_buffers > rather than needing to configure a separate pool just for this > feature. FWIW we've talked about this for years -- see old discussions about how pg_subtrans becomes a bottleneck in certain cases and you want to enlarge the number of buffers allocated to it (probably easy to find by searching posts from Jignesh). I'm guessing the new notify code would benefit from this as well. It'd be nice to have as a side effect, but if not, IMHO this proposal could simply use a fixed buffer pool like all other slru.c callers until someone gets around to fixing that. Adding more GUC switches for this strikes me as overkill. -- Álvaro Herrera <alvherre@alvh.no-ip.org>
On Mon, May 24, 2010 at 3:07 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > It'd be nice to have as a side effect, but if not, IMHO this proposal > could simply use a fixed buffer pool like all other slru.c callers until > someone gets around to fixing that. Adding more GUC switches for this > strikes me as overkill. I agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Sun, May 23, 2010 at 04:21:58PM -0400, Jan Wieck wrote: > In some systems (data warehousing, replication), the order of commits is > important, since that is the order in which changes have become visible. > This information could theoretically be extracted from the WAL, but > scanning the entire WAL just to extract this tidbit of information would > be excruciatingly painful. This is very interesting to me as I've been doing some (research -- nowhere near production-level) work on building a transactional application-level (i.e. memcached-like) cache atop Postgres. One of the features I needed to support it was basically what you describe. Without getting too far into the details of what I'm doing, I needed to make it clear to a higher layer which commits were visible to a given query. That is, I wanted to know both the order of commits and where particular snapshots fit into this ordering. (A SnapshotData struct obviously contains the visibility information, but a representation in terms of the commit ordering is both more succinct and allows for easy ordering comparisons). Something you might want to consider, then, is adding an interface to find out the timestamp of the current transaction's snapshot, i.e. the timestamp of the most recent committed transaction visible to it. I wouldn't expect this to be difficult to implement as transaction completion/visibility is already synchronized via ProcArrayLock. > Each record of the Transaction Commit Info consists of > > txid xci_transaction_id > timestamptz xci_begin_timestamp > timestamptz xci_commit_timestamp > int64 xci_total_rowcount Another piece of information that seems useful to provide here would be the logical timestamp of the transaction, i.e. a counter that's incremented by one for each transaction. But maybe that's implicit in the log ordering? I'm not clear on why the total rowcount is useful, but perhaps I'm missing something obvious. I've actually implemented some semblance of this on Postgres 8.2, but it sounds like what you're interested in is more sophisticated. In particular, I wasn't at all concerned with durability or WAL stuff, and I had some specific requirements about when it was OK to purge the data. Because of this (and very limited development time), I just threw something together with a simple shared buffer. I don't think I have any useful code to offer, but let me know if there's some way I can help out. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
Robert Haas wrote: > It's a tool for replication solutions to use. I was thrown by the original post referencing "data warehousing". For replication I definitely see that it would be good to provide some facility to grab a coherent snapshot out of the transaction stream, but I'm still not clear on a use case where other solutions aren't better. If you want a *particular* past snapshot, something akin to the transactional caching that Dan Ports mentioned seems best. If you just want a coherent snapshot like snapshot isolation, the current mechanisms seem to work (unless I'm missing something?). If you want solid data integrity querying the most recent replicated data, the proposal I posted earlier in the thread is the best I can see, so far. > if the reconstructed transaction order inferred by SSI doesn't > match the actual commit order, can we get a serialization anomaly > on the standby by replaying transactions there in commit order? Yes. If we don't do *something* to address it, the replicas (slaves) will operate as read-only snapshot isolation, not true serializable. > Can you give an example and explain how your proposal would solve > it? I gave an example (without rigorous proof accompanying it, granted) earlier in the thread. In that example, if you allow a selection against a snapshot which includes the earlier commit (the update of the control table) and before the later commits (the receipts which used the old deposit date) you have exactly the kind of serialization anomaly which the work in progress prevents on the source (master) database -- the receipts *appear* to run in earlier transactions because the see the pre-update deposit date, but they show up out of order. As far as I'm concerned this is only a problem if the user *requested* serializable behavior for all transactions involved. If we send the information I suggested in the WAL stream, then any slave using the WAL stream could build a snapshot for a serializable transaction which excluded serializable transactions from the source which overlap with still-pending serializable transactions on the source. In this example, the update of the control table would not be visible to a serializable transaction on the slave until any overlapping serializable transactions (which would include any receipts using the old date) had also committed, so you could never see the writes out of order. I don't think that passing detailed predicate locking information would be feasible from a performance perspective, but since the slaves are read-only, I think it is fine to pass just the minimal transaction-level information I described. -Kevin
On Mon, May 24, 2010 at 4:03 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas wrote: > >> It's a tool for replication solutions to use. > > I was thrown by the original post referencing "data warehousing". > For replication I definitely see that it would be good to provide > some facility to grab a coherent snapshot out of the transaction > stream, but I'm still not clear on a use case where other solutions > aren't better. If you want a *particular* past snapshot, something > akin to the transactional caching that Dan Ports mentioned seems > best. If you just want a coherent snapshot like snapshot isolation, > the current mechanisms seem to work (unless I'm missing something?). > If you want solid data integrity querying the most recent replicated > data, the proposal I posted earlier in the thread is the best I can > see, so far. Well, AIUI, what you're really trying to do is derive the delta between an old snapshot and a newer snapshot. >> Can you give an example and explain how your proposal would solve >> it? > > I gave an example (without rigorous proof accompanying it, granted) > earlier in the thread. In that example, if you allow a selection > against a snapshot which includes the earlier commit (the update of > the control table) and before the later commits (the receipts which > used the old deposit date) you have exactly the kind of > serialization anomaly which the work in progress prevents on the > source (master) database -- the receipts *appear* to run in earlier > transactions because the see the pre-update deposit date, but they > show up out of order. Yep, I see it now. > As far as I'm concerned this is only a > problem if the user *requested* serializable behavior for all > transactions involved. Agreed. > If we send the information I suggested in the WAL stream, then any > slave using the WAL stream could build a snapshot for a serializable > transaction which excluded serializable transactions from the source > which overlap with still-pending serializable transactions on the > source. In this example, the update of the control table would not > be visible to a serializable transaction on the slave until any > overlapping serializable transactions (which would include any > receipts using the old date) had also committed, so you could never > see the writes out of order. > > I don't think that passing detailed predicate locking information > would be feasible from a performance perspective, but since the > slaves are read-only, I think it is fine to pass just the minimal > transaction-level information I described. I suspect that's still going to be sort of hard on performance, but maybe we should get serializable working and committed on one node first and then worry about how to distribute it. I think there might be other approaches to this problem (global transaction coordinator? standby requests snapshot from primary?). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 5/24/2010 12:51 PM, Kevin Grittner wrote: > Robert Haas wrote: > >> I think you're confusing two subtly different things. > > The only thing I'm confused about is what benefit anyone expects to > get from looking at data between commits in some way other than our > current snapshot mechanism. Can someone explain a use case where > what Jan is proposing is better than snapshot isolation? It doesn't > provide any additional integrity guarantees that I can see. > >> But the commit order is still the order the effects of those >> transactions have become visible - if we inserted a new read-only >> transaction into the stream at some arbitrary point in time, it >> would see all the transactions which committed before it and none >> of those that committed afterward. > > Isn't that what a snapshot does already? It does and the proposed is a mere alternative serving the same purpose. Have you ever looked at one of those queries, that Londiste or Slony issue against the provider DB in order to get all the log data that has been committed between two snapshots? Is that really the best you can think of? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/24/2010 3:10 PM, Dan Ports wrote: > I'm not clear on why the total rowcount is useful, but perhaps I'm > missing something obvious. It is a glimpse into the future. Several years of pain doing replication work has taught me that knowing approximately who much work the next chunk will be "before you select it all" is a really useful thing. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote: > Jan Wieck wrote: > > > In some systems (data warehousing, replication), the order of > > commits is important, since that is the order in which changes > > have become visible. > > This issue intersects with the serializable work I've been doing. > While in database transactions using S2PL the above is true, in > snapshot isolation and the SSI implementation of serializable > transactions, it's not. In particular, the snapshot anomalies which > can cause non-serializable behavior happen precisely because the > apparent order of execution doesn't match anything so linear as > order of commit. All true, but this doesn't pose a problem in snapshot isolation. Maybe this is obvious to everyone else, but just to be clear: a transaction's snapshot is determined entirely by which transactions committed before it snapshotted (and hence are visible to it). Thus, replaying update transactions in the sae order on a slave makes the same sequence of states visible to it. Of course (as in your example) some of these states could expose snapshot isolation anomalies. But that's true on a single-replica system too. Now, stepping into the SSI world... > Replicating or recreating the whole predicate locking and conflict > detection on slaves is not feasible for performance reasons. (I > won't elaborate unless someone feels that's not intuitively > obvious.) The only sane way I can see to have a slave database allow > serializable behavior is to WAL-log the acquisition of a snapshot by > a serializable transaction, and the rollback or commit, on the > master, and to have the serializable snapshot build on a slave > exclude any serializable transactions for which there are still > concurrent serializable transactions. Yes, that does mean WAL- > logging the snapshot acquisition even if the transaction doesn't yet > have an xid, and WAL-logging the commit or rollback even if it never > acquires an xid. One important observation is that any anomaly that occurs on the slave can be resolved by aborting a local read-only transaction. This is a good thing, because the alternatives are too horrible to consider. You could possibly cut the costs of predicate locking by having the master ship with each transaction the list of predicate locks it acquired. But you'd still have to track locks for read-only transactions, so maybe that's not a significant cost improvement. On the other hand, if you're willing to pay the price of serializability on the master, why not the slaves too? Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On May 25, 2010, at 0:42 , Dan Ports wrote: > On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote: >> Jan Wieck wrote: >> >>> In some systems (data warehousing, replication), the order of >>> commits is important, since that is the order in which changes >>> have become visible. >> >> This issue intersects with the serializable work I've been doing. >> While in database transactions using S2PL the above is true, in >> snapshot isolation and the SSI implementation of serializable >> transactions, it's not. In particular, the snapshot anomalies which >> can cause non-serializable behavior happen precisely because the >> apparent order of execution doesn't match anything so linear as >> order of commit. > > All true, but this doesn't pose a problem in snapshot isolation. Maybe > this is obvious to everyone else, but just to be clear: a transaction's > snapshot is determined entirely by which transactions committed before > it snapshotted (and hence are visible to it). Thus, replaying update > transactions in the sae order on a slave makes the same sequence of > states visible to it. The subtle point here is whether you consider the view from the "outside" (in the sense of what a read-only transaction startedat an arbitrary time can or cannot observe), or from the "inside" (what updating transactions can observe and mightbase their updates on). The former case is completely determined by the commit ordering of the transactions, while the latter is not - otherwiseserializability wouldn't be such a hard problem. For some problems, like replication, the former ("outside") view is what matters - if slave synthesizes transactions thatinsert/update/delete the very same tuples as the original transaction did, and commits them in the same order, no read-onlytransaction can observe the difference. But that is *not* a serial schedule of the original transactions, sincethe transactions are *not* the same - the merely touch the same tuples. In fact, if you try replaying the original SQL,you *will* get different results on the slave, and not only because of now() and the like. best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> writes: > The subtle point here is whether you consider the view from the "outside" (in the sense of what a read-only transactionstarted at an arbitrary time can or cannot observe), or from the "inside" (what updating transactions can observeand might base their updates on). > The former case is completely determined by the commit ordering of the transactions, while the latter is not - otherwiseserializability wouldn't be such a hard problem. BTW, doesn't all this logic fall in a heap as soon as you consider read-committed transactions? regards, tom lane
On May 25, 2010, at 3:21 , Tom Lane wrote: > Florian Pflug <fgp@phlo.org> writes: >> The subtle point here is whether you consider the view from the "outside" (in the sense of what a read-only transactionstarted at an arbitrary time can or cannot observe), or from the "inside" (what updating transactions can observeand might base their updates on). > >> The former case is completely determined by the commit ordering of the transactions, while the latter is not - otherwiseserializability wouldn't be such a hard problem. > > BTW, doesn't all this logic fall in a heap as soon as you consider > read-committed transactions? Why would it? There's still a well defined point in time at which the transaction's effects become visible, and every othertransaction commits either before that time or after that time. An observer started between two transactions sees thefirst's changes but not the second's. One replace observing read committed transactions by a series of smaller repeatableread transactions, since the observers are read-only anyway. This of course says nothing about what state the updating transactions themselves see as the current state. For e.g. replicationthat is adequate, since you'd not replay the original commands but rather the effects they had in terms of physicaltuple updates. On replay, the effects of a transaction to therefor not depend on the state the transaction sees. best regards, Florian Pflug
2010/5/25 Dan Ports <drkp@csail.mit.edu>: > On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote: > >> Replicating or recreating the whole predicate locking and conflict >> detection on slaves is not feasible for performance reasons. (I >> won't elaborate unless someone feels that's not intuitively >> obvious.) The only sane way I can see to have a slave database allow >> serializable behavior is to WAL-log the acquisition of a snapshot by >> a serializable transaction, and the rollback or commit, on the >> master, and to have the serializable snapshot build on a slave >> exclude any serializable transactions for which there are still >> concurrent serializable transactions. Yes, that does mean WAL- >> logging the snapshot acquisition even if the transaction doesn't yet >> have an xid, and WAL-logging the commit or rollback even if it never >> acquires an xid. > > One important observation is that any anomaly that occurs on the slave > can be resolved by aborting a local read-only transaction. This is a > good thing, because the alternatives are too horrible to consider. > > You could possibly cut the costs of predicate locking by having the > master ship with each transaction the list of predicate locks it > acquired. But you'd still have to track locks for read-only > transactions, so maybe that's not a significant cost improvement. On > the other hand, if you're willing to pay the price of serializability > on the master, why not the slaves too? I don't understand the problem. According to me, in the context of SSI, a read-only slave can just map SERIALIZABLE to the technical implementation of REPEATABLE READ (i.e., the currently-existing "SERIALIZABLE"). The union of the transactions on the master and the slave(s) will still exhibit SERIALIZABLE behavior because the transactions on the slave cannot write anything and are therefore irrelevant. Is anything wrong with that reasoning? Nicolas
On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote: > In some systems (data warehousing, replication), the order of commits is > important, since that is the order in which changes have become visible. > This information could theoretically be extracted from the WAL, but > scanning the entire WAL just to extract this tidbit of information would > be excruciatingly painful. I think it would be quite simple to read WAL. WALSender reads the WAL file after its been flushed, so it would be simple for it to read a blob of WAL and then extract the commit order from it. Overall though, it would be easier and more efficient to *add* info to WAL and then do all this processing *after* WAL has been transported elsewhere. Extracting info with DDL triggers, normal triggers, commit order and everything else seems like too much work to me. Every other RDBMS has moved away from trigger-based replication and we should give that serious consideration also. -- Simon Riggs www.2ndQuadrant.com
On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote: > I don't understand the problem. According to me, in the context of > SSI, a read-only slave can just map SERIALIZABLE to the technical > implementation of REPEATABLE READ (i.e., the currently-existing > "SERIALIZABLE"). The union of the transactions on the master and the > slave(s) will still exhibit SERIALIZABLE behavior because the > transactions on the slave cannot write anything and are therefore > irrelevant. This, unfortunately, isn't true in SSI. Consider read-only transactions on a single node SSI database -- the situation is the same for read-only transactions that run on a slave. These transactions can be part of anomalies, so they need to be checked for conflicts and potentially aborted. Consider Kevin's favorite example, where one table contains the current date and the other is a list of receipts (initially empty). T1 inserts (select current_date) into receipts, but doesn'tcommit T2 increments current_date and commits T3 reads both current_date and the receipt table T1 commits T3, which is a read-only transaction, sees the incremented date and an empty list of receipts. But T1 later commits a new entry in the receipts table with the old date. No serializable ordering allows this. However, if T3 hadn't performed its read, there'd be no problem; we'd just serialize T1 before T2 and no one would be the wiser. SSI would detect a potential conflict here, which we could resolve by aborting T3. (We could also abort T1, but if this is a replicated system this isn't always an option -- T3 might be running on the slave, so only the slave will know about the conflict, and it can't very well abort an update transaction on the master.) There's another example of a read-only transaction anomaly that could cause similar problems at http://portal.acm.org/citation.cfm?doid=1031570.1031573, but I think this one is easier to follow. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On May 25, 2010, at 20:18 , Dan Ports wrote: > On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote: >> I don't understand the problem. According to me, in the context of >> SSI, a read-only slave can just map SERIALIZABLE to the technical >> implementation of REPEATABLE READ (i.e., the currently-existing >> "SERIALIZABLE"). The union of the transactions on the master and the >> slave(s) will still exhibit SERIALIZABLE behavior because the >> transactions on the slave cannot write anything and are therefore >> irrelevant. > > This, unfortunately, isn't true in SSI. > > Consider read-only transactions on a single node SSI database -- the > situation is the same for read-only transactions that run on a slave. > These transactions can be part of anomalies, so they need to be checked > for conflicts and potentially aborted. > > Consider Kevin's favorite example, where one table contains the current > date and the other is a list of receipts (initially empty). > T1 inserts (select current_date) into receipts, but doesn't commit > T2 increments current_date and commits > T3 reads both current_date and the receipt table > T1 commits > > T3, which is a read-only transaction, sees the incremented date and an > empty list of receipts. But T1 later commits a new entry in the > receipts table with the old date. No serializable ordering allows this. > > However, if T3 hadn't performed its read, there'd be no problem; we'd > just serialize T1 before T2 and no one would be the wiser. Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it? Btw, I still don't get how this follows from the Cahill paper. For a transaction to lie on a dangerous circle, it needs incomingand outgoing edges in the conflict graph, right? But I'd have though that conflicts are always between a reader anda writer or between two writers. So how can a read-only transaction have incoming and outgoing edges? best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> wrote: > Hm, so in fact SSI sometimes allows the database to be > inconsistent, but only as long as nobody tries to observe it? Not exactly. The eventually-persisted state is always consistent, but there can be a transitory committed state which would violate user-defined constraints or business rules *if viewed*. This is what I've been on about -- the commit sequence is not necessarily the same as the apparent order of execution. A read-only transaction, if run before the overlapping commits "settle", can view a state which is not consistent with any serial order of execution, and might therefore break the rules. SSI detects that and rolls one of the transactions back if they're all running at serializable transaction isolation in a single SSI database, but the question is how to handle this when the read happens in a replica. > Btw, I still don't get how this follows from the Cahill paper. For > a transaction to lie on a dangerous circle, it needs incoming and > outgoing edges in the conflict graph, right? At least one of the transactions participating in the cycle does. There's no requirement that they all do. -Kevin
On Tue, May 25, 2010 at 08:35:44PM +0200, Florian Pflug wrote: > Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it? Yes. Note that even while it's in an inconsistent state, you can still perform any query that doesn't observe the inconsistency -- hopefully most queries fall into this category. > Btw, I still don't get how this follows from the Cahill paper. For a transaction to lie on a dangerous circle, it needsincoming and outgoing edges in the conflict graph, right? But I'd have though that conflicts are always between a readerand a writer or between two writers. So how can a read-only transaction have incoming and outgoing edges? Right, the read-only transaction can't have incoming edges, but it can have outgoing edges. So it can't be the "pivot" itself (the transaction with both outgoing and incoming edges), but it can cause *another* transaction to be. In the example I gave, T3 (the r/o transaction) has an outgoing edge to T1, because it didn't see T1's concurrent update. T1 already had an outgoing edge to T2, so adding in this incoming edge from T3 creates the dangerous structure. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
Jan Wieck <JanWieck@Yahoo.com> wrote: > Have you ever looked at one of those queries, that Londiste or > Slony issue against the provider DB in order to get all the log > data that has been committed between two snapshots? Is that really > the best you can think of? No, I admit I haven't. In fact, I was thinking primarily in terms of log-driven situations, like HS. What would be the best place for me to look to come up to speed on your use case? (I'm relatively sure that the issue isn't that there's no information to find, but that a sequential pass over all available information would take a *long* time.) I've been working through the issues on WAL-based replicas, and have some additional ideas and alternatives, but I'd like to see the "big picture", including trigger-based replication, before posting. -Kevin
2010/5/25 Dan Ports <drkp@csail.mit.edu>: > On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote: > >> I don't understand the problem. According to me, in the context of >> SSI, a read-only slave can just map SERIALIZABLE to the technical >> implementation of REPEATABLE READ (i.e., the currently-existing >> "SERIALIZABLE"). The union of the transactions on the master and the >> slave(s) will still exhibit SERIALIZABLE behavior because the >> transactions on the slave cannot write anything and are therefore >> irrelevant. > > This, unfortunately, isn't true in SSI. > > Consider read-only transactions on a single node SSI database -- the > situation is the same for read-only transactions that run on a slave. > These transactions can be part of anomalies, so they need to be checked > for conflicts and potentially aborted. > > Consider Kevin's favorite example, where one table contains the current > date and the other is a list of receipts (initially empty). > T1 inserts (select current_date) into receipts, but doesn't commit > T2 increments current_date and commits > T3 reads both current_date and the receipt table > T1 commits > > T3, which is a read-only transaction, sees the incremented date and an > empty list of receipts. But T1 later commits a new entry in the > receipts table with the old date. No serializable ordering allows this. > However, if T3 hadn't performed its read, there'd be no problem; we'd > just serialize T1 before T2 and no one would be the wiser. > > SSI would detect a potential conflict here, which we could resolve by > aborting T3. (We could also abort T1, but if this is a replicated > system this isn't always an option -- T3 might be running on the > slave, so only the slave will know about the conflict, and it can't > very well abort an update transaction on the master.) Ah, indeed. I made the same reasoning mistake as Florian (presumably) did: I didn't think of the fact that the read-only transaction doesn't need to be the pivot. Nicolas
On May 25, 2010, at 20:48 , Dan Ports wrote: > On Tue, May 25, 2010 at 08:35:44PM +0200, Florian Pflug wrote: >> Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it? > > Yes. Note that even while it's in an inconsistent state, you can still > perform any query that doesn't observe the inconsistency -- hopefully > most queries fall into this category. Yeah, as long as you just walk by without looking, the database is happy ;-) >> Btw, I still don't get how this follows from the Cahill paper. For a transaction to lie on a dangerous circle, it needsincoming and outgoing edges in the conflict graph, right? But I'd have though that conflicts are always between a readerand a writer or between two writers. So how can a read-only transaction have incoming and outgoing edges? > > Right, the read-only transaction can't have incoming edges, but it can > have outgoing edges. So it can't be the "pivot" itself (the transaction > with both outgoing and incoming edges), but it can cause *another* > transaction to be. > > In the example I gave, T3 (the r/o transaction) has an outgoing edge to > T1, because it didn't see T1's concurrent update. T1 already had an > outgoing edge to T2, so adding in this incoming edge from T3 creates > the dangerous structure. Hm, but for there to be an actual problem (and not a false positive), an actual dangerous circle has to exist in the dependencygraph. The existence of a dangerous structure is just a necessary (but not sufficient) and easily checked-for conditionfor that, right? Now, if a read-only transaction only ever has outgoing edges, it cannot be part of a (dangerousor not) circle, and hence any dangerous structure it is part of is a false positive. I guess my line of reasoning is flawed somehow, but I cannot figure out why... best regards, Florian Pflug
2010/5/25 Florian Pflug <fgp@phlo.org>: > On May 25, 2010, at 20:18 , Dan Ports wrote: > >> T3, which is a read-only transaction, sees the incremented date and an >> empty list of receipts. But T1 later commits a new entry in the >> receipts table with the old date. No serializable ordering allows this. >> >> However, if T3 hadn't performed its read, there'd be no problem; we'd >> just serialize T1 before T2 and no one would be the wiser. > > Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it? I would not call this an inconsistent state: it would become inconsistent only after someone (e.g., T3) has observed it _and_ T1 commits. Nicolas
2010/5/25 Florian Pflug <fgp@phlo.org>: > Hm, but for there to be an actual problem (and not a false positive), an > actual dangerous circle has to exist in the dependency graph. The > existence of a dangerous structure is just a necessary (but not > sufficient) and easily checked-for condition for that, right? Now, if a > read-only transaction only ever has outgoing edges, it cannot be part > of a (dangerous or not) circle, and hence any dangerous structure it is > part of is a false positive. > > I guess my line of reasoning is flawed somehow, but I cannot figure out why... In the general case, "wr" dependencies also create "must be serialized before" edges. It seems that those edges can be discarded when finding a pivot, but if you want to go "back to basics": ("<" means "must be serialized before".) * T1 < T2, because T1 reads a version of a data element for which T2 later creates a newer version (rw between T1 and T2). * T3 < T1, because T3 reads a version of a data element for which T1 later creates a newer version (rw between T3 and T1). * T2 < T3, because T2 creates a version of a data element, which is then read by T3 (wr between T2 and T3). (As you can see, those 3 edges form a cycle.) Nicolas
Florian Pflug <fgp@phlo.org> wrote: > Hm, but for there to be an actual problem (and not a false > positive), an actual dangerous circle has to exist in the > dependency graph. The existence of a dangerous structure is just a > necessary (but not sufficient) and easily checked-for condition > for that, right? Now, if a read-only transaction only ever has > outgoing edges, it cannot be part of a (dangerous or not) circle, > and hence any dangerous structure it is part of is a false > positive. > > I guess my line of reasoning is flawed somehow, but I cannot > figure out why... Here's why: We're tracking rw-dependencies, where the "time-arrow" showing effective order of execution points from the reader to the writer (since the reader sees a state prior to the write, it effectively executes before it). These are important because there have to be two such dependencies, one in to the pivot and one out from the pivot, for a problem to exist. (See various works by Dr. Alan Fekete, et al, for details.) But other dependencies can imply an order of execution. In particular, a wr-dependency, where a transaction *can* see data committed by another transaction, implies that the *writer* came first in the order of execution. In this example, the transaction which lists the receipts successfully reads the control table update, but is not able to read the receipt insert. This completes the cycle, making it a real anomaly and not a false positive. Note that the wr-dependency can actually exist outside the database, making it pretty much impossible to accurately tell a false positive from a true anomaly when the pivot exists and the transaction writing data which the pivot can't read commits first. For example, let's say that the update to the control table is committed from an application which, seeing that its update came back without error, proceeds to list the receipts for the old date in a subsequent transaction. You have a wr-dependency which is, in reality, quite real and solid with no way to notice it within the database engine. That's why the techniques used in SSI are pretty hard to improve upon beyond more detailed and accurate tracking of rw-conflicts. -Kevin
On 5/24/2010 9:30 AM, Greg Sabino Mullane wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > >> In light of the proposed purging scheme, how would it be able to distinguish >> between those two cases (nothing there yet vs. was there but purged)? > >> There is a difference between an empty result set and an exception. > > No, I meant how will the *function* know, if a superuser and/or some > background process can purge records at any time? The data contains timestamps which are supposedly taken in commit order. Checking the age of the last entry in the file should be simple enough to determine if the segment matches the "max age" configuration (if set). In the case of a superuser telling what to purge he would just call a function with a serial number (telling the obsolete segments). Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> writes: >> No, I meant how will the *function* know, if a superuser and/or some >> background process can purge records at any time? > The data contains timestamps which are supposedly taken in commit order. You can *not* rely on the commit timestamps to be in exact order. (Perhaps approximate ordering is good enough for what you want here, but just be careful to not fall into the trap of assuming that they're exactly ordered.) regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote: > maybe we should get serializable working and committed on one > node first and then worry about how to distribute it. I think > there might be other approaches to this problem Well, I've got two or three other ideas on how we can manage this for HS, but since I now realize that I've totally misunderstood the main use case for this (which is to support trigger-based replication), I'd like to be clear on something before letting it drop. The big question is, do such replicas need to support serializable access to the data modified by serializable transactions in the source database? That is, is there a need for such replicas to only see states which are possible in some serial order of execution of serializable transactions on the source database? Or to phrase the same question a third way, should there be a way to run queries on such replicas with confidence that what is viewed is consistent with user-defined constraints and business rules? If not, there's no intersection between this feature and SSI. If there is, I think we should think through at least a general strategy sooner, rather than later. -Kevin
On 5/25/2010 12:03 PM, Simon Riggs wrote: > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote: > >> In some systems (data warehousing, replication), the order of commits is >> important, since that is the order in which changes have become visible. >> This information could theoretically be extracted from the WAL, but >> scanning the entire WAL just to extract this tidbit of information would >> be excruciatingly painful. > > I think it would be quite simple to read WAL. WALSender reads the WAL > file after its been flushed, so it would be simple for it to read a blob > of WAL and then extract the commit order from it. > > Overall though, it would be easier and more efficient to *add* info to > WAL and then do all this processing *after* WAL has been transported > elsewhere. Extracting info with DDL triggers, normal triggers, commit > order and everything else seems like too much work to me. Every other > RDBMS has moved away from trigger-based replication and we should give > that serious consideration also. Reading the entire WAL just to find all COMMIT records, then go back to the origin database to get the actual replication log you're looking for is simpler and more efficient? I don't think so. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Tue, 2010-05-25 at 16:41 -0400, Jan Wieck wrote: > On 5/25/2010 12:03 PM, Simon Riggs wrote: > > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote: > > > >> In some systems (data warehousing, replication), the order of commits is > >> important, since that is the order in which changes have become visible. > >> This information could theoretically be extracted from the WAL, but > >> scanning the entire WAL just to extract this tidbit of information would > >> be excruciatingly painful. > > > > I think it would be quite simple to read WAL. WALSender reads the WAL > > file after its been flushed, so it would be simple for it to read a blob > > of WAL and then extract the commit order from it. > > > > Overall though, it would be easier and more efficient to *add* info to > > WAL and then do all this processing *after* WAL has been transported > > elsewhere. Extracting info with DDL triggers, normal triggers, commit > > order and everything else seems like too much work to me. Every other > > RDBMS has moved away from trigger-based replication and we should give > > that serious consideration also. > > Reading the entire WAL just to find all COMMIT records, then go back to > the origin database to get the actual replication log you're looking for > is simpler and more efficient? I don't think so. Agreed, but I think I've not explained myself well enough. I proposed two completely separate ideas; the first one was this: If you must get commit order, get it from WAL on *origin*, using exact same code that current WALSender provides, plus some logic to read through the WAL records and extract commit/aborts. That seems much simpler than the proposal you outlined and as SR shows, its low latency as well since commits write to WAL. No need to generate event ticks either, just use XLogRecPtrs as WALSender already does. I see no problem with integrating that into core, technically or philosophically. -- Simon Riggs www.2ndQuadrant.com
On 5/25/2010 4:50 PM, Simon Riggs wrote: > On Tue, 2010-05-25 at 16:41 -0400, Jan Wieck wrote: >> On 5/25/2010 12:03 PM, Simon Riggs wrote: >> > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote: >> > >> >> In some systems (data warehousing, replication), the order of commits is >> >> important, since that is the order in which changes have become visible. >> >> This information could theoretically be extracted from the WAL, but >> >> scanning the entire WAL just to extract this tidbit of information would >> >> be excruciatingly painful. >> > >> > I think it would be quite simple to read WAL. WALSender reads the WAL >> > file after its been flushed, so it would be simple for it to read a blob >> > of WAL and then extract the commit order from it. >> > >> > Overall though, it would be easier and more efficient to *add* info to >> > WAL and then do all this processing *after* WAL has been transported >> > elsewhere. Extracting info with DDL triggers, normal triggers, commit >> > order and everything else seems like too much work to me. Every other >> > RDBMS has moved away from trigger-based replication and we should give >> > that serious consideration also. >> >> Reading the entire WAL just to find all COMMIT records, then go back to >> the origin database to get the actual replication log you're looking for >> is simpler and more efficient? I don't think so. > > Agreed, but I think I've not explained myself well enough. > > I proposed two completely separate ideas; the first one was this: > > If you must get commit order, get it from WAL on *origin*, using exact > same code that current WALSender provides, plus some logic to read > through the WAL records and extract commit/aborts. That seems much > simpler than the proposal you outlined and as SR shows, its low latency > as well since commits write to WAL. No need to generate event ticks > either, just use XLogRecPtrs as WALSender already does. > > I see no problem with integrating that into core, technically or > philosophically. > Which means that if I want to allow a consumer of that commit order data to go offline for three days or so to replicate the 5 requested, low volume tables, the origin needs to hang on to the entire WAL log from all 100 other high volume tables? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/25/2010 4:16 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >>> No, I meant how will the *function* know, if a superuser and/or some >>> background process can purge records at any time? > >> The data contains timestamps which are supposedly taken in commit order. > > You can *not* rely on the commit timestamps to be in exact order. > (Perhaps approximate ordering is good enough for what you want here, > but just be careful to not fall into the trap of assuming that they're > exactly ordered.) I am well aware of the fact that commit timestamps within the WAL can go backwards and that the serial numbers of this proposed implementation of commit order can even be different from what the timestamps AND the WAL are saying. As long as the serial number (record position inside of segment) is determined while the transaction still holds all its locks, this is going to be good enough for what async replication users today are used to. Again, it will not magically make it possible to determine a serializable order of actions, that happened from transactions running in read committed isolation level, post mortem. I don't even even think that is possible at all. And I don't think anyone proposed a solution for that problem anyways. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/25/2010 3:18 PM, Kevin Grittner wrote: > Jan Wieck <JanWieck@Yahoo.com> wrote: > >> Have you ever looked at one of those queries, that Londiste or >> Slony issue against the provider DB in order to get all the log >> data that has been committed between two snapshots? Is that really >> the best you can think of? > > No, I admit I haven't. In fact, I was thinking primarily in terms > of log-driven situations, like HS. What would be the best place for > me to look to come up to speed on your use case? (I'm relatively > sure that the issue isn't that there's no information to find, but > that a sequential pass over all available information would take a > *long* time.) I've been working through the issues on WAL-based > replicas, and have some additional ideas and alternatives, but I'd > like to see the "big picture", including trigger-based replication, > before posting. In short, what both systems are doing is as follows. An AFTER ROW trigger records the OLD PK and all changed columns, as well as the txid and a global, not cached serial number. Some background process periodically starts a serializable transaction and records the resulting snapshot. To replicate from one consistent state to the next, the replication system now selects all log rows between two snapshots. Between here means it simulates MVCC visibility in the sense of that the writing transaction was in progress when the first snapshot was taken and had committed at the second. The resulting WHERE clause looks something like WHERE (xid > s1.xmax OR (xid >= s1.xmin AND xid IN (s1.xip))) AND (xid < s2.xmin OR (xid <= s2.xmax AND xid NOTIN (s2.xip))) Note that xip here is a comma separated list of txid's. I think it is easy to see that this is not a cheap query. Anyhow, that set of log rows is now ordered by the serial number and applied to the replica. Without this logic, the replication system could not combine multiple origin sessions into one replication session without risking to never find a state, in which it can commit. It may be possible to work with two sessions on the replica and not require any reordering of the original actions at all. I need to think about that for a little longer since this idea just occurred to me a second ago. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/26/2010 7:03 AM, Jan Wieck wrote: > To replicate from one consistent state to the next, the replication > system now selects all log rows between two snapshots. Between here > means it simulates MVCC visibility in the sense of that the writing > transaction was in progress when the first snapshot was taken and had > committed at the second. The resulting WHERE clause looks something like Or it entirely happened between the snapshots, obviously. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Sun, May 23, 2010 at 9:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > Each record of the Transaction Commit Info consists of > > txid xci_transaction_id > timestamptz xci_begin_timestamp > timestamptz xci_commit_timestamp > int64 xci_total_rowcount > So I think you're going about this backwards. Instead of discussing implementation I think you should start with the API the replication system needs. In particular I'm not sure you really want a server-side query at all. I'm wondering if you wouldn't be better off with a public machine-parsable text format version of the WAL. Ie, at the same time as writing out all the nitty gritty to the binary wal we would write out a summary of public data to an xml version containing just parts of the data stream that we can promise won't change, such as transaction id, lsn, timestamp. -- greg
On 5/26/2010 10:04 AM, Greg Stark wrote: > On Sun, May 23, 2010 at 9:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> Each record of the Transaction Commit Info consists of >> >> txid xci_transaction_id >> timestamptz xci_begin_timestamp >> timestamptz xci_commit_timestamp >> int64 xci_total_rowcount >> > > So I think you're going about this backward > > Instead of discussing implementation I think you should start with the > API the replication system needs. In particular I'm not sure you > really want a server-side query at all. I'm wondering if you wouldn't > be better off with a public machine-parsable text format version of > the WAL. Ie, at the same time as writing out all the nitty gritty to > the binary wal we would write out a summary of public data to an xml > version containing just parts of the data stream that we can promise > won't change, such as transaction id, lsn, timestamp. Since the actual row level change information and other event data is found inside of regular tables, identified by TXID and sequence number, I am pretty sure I want that data in a server-side query. What you are proposing is to read the xid's and timestamps with an external process, that now forcibly needs to reside on the DB server itself (neither Londiste nor Slony have that requirement as of today), then bring it back into the DB at least inside the WHERE clause of a query. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > Since the actual row level change information and other event data is > found inside of regular tables, identified by TXID and sequence number, > I am pretty sure I want that data in a server-side query. What you are > proposing is to read the xid's and timestamps with an external process, > that now forcibly needs to reside on the DB server itself (neither > Londiste nor Slony have that requirement as of today), then bring it > back into the DB at least inside the WHERE clause of a query. It depends on how you approach the problem. If you had a process that could scan WAL files (or a platform/version independent representation of these WAL files) you could run that process on any server (the origin server, a replica, or some third server with the software installed). Where you run it involves making trade-offs on the costs of storing transferring and processing the files and would ideally be configurable. You could then have a process that transfers all of the data logged by the triggers to the replicas as soon as it is committed. Basically saying 'copy any rows in sl_log from the origin to the replica that we haven't already sent to that replica' You could then move the work of figuring out the commit order onto the replica where you would combine the output of the WAL scanning process with the transaction data that has been copied to the replica. > > > Jan > -- Steve Singer Afilias Canada Data Services Developer 416-673-1142
On Wed, May 26, 2010 at 11:43 AM, Steve Singer <ssinger@ca.afilias.info> wrote: > Jan Wieck wrote: > >> Since the actual row level change information and other event data is >> found inside of regular tables, identified by TXID and sequence number, I am >> pretty sure I want that data in a server-side query. What you are proposing >> is to read the xid's and timestamps with an external process, that now >> forcibly needs to reside on the DB server itself (neither Londiste nor Slony >> have that requirement as of today), then bring it back into the DB at least >> inside the WHERE clause of a query. > > > It depends on how you approach the problem. > > If you had a process that could scan WAL files (or a platform/version > independent representation of these WAL files) you could run that process on > any server (the origin server, a replica, or some third server with the > software installed). Where you run it involves making trade-offs on the > costs of storing transferring and processing the files and would ideally be > configurable. > > You could then have a process that transfers all of the data logged by the > triggers to the replicas as soon as it is committed. Basically saying 'copy > any rows in sl_log from the origin to the replica that we haven't already > sent to that replica' > > You could then move the work of figuring out the commit order onto the > replica where you would combine the output of the WAL scanning process with > the transaction data that has been copied to the replica. I'm sure it's possible to make this work however you want to do it, but I don't really see what advantage Greg Stark's proposal has over Jan's original proposal. Recording the commits in one extra place at commit time is practically free, especially compared to the overall cost of replication. Rescanning the WAL seems likely to be much more expensive and potentially introduces more failure paths. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 5/26/2010 10:04 AM, Greg Stark wrote: > Instead of discussing implementation I think you should start with the > API the replication system needs. ... but to answer that request, actually I don't even think we should be discussing API specifics. During PGCon, Marco Kreen, Jim Nasby and I were discussing what the requirements of a unified message queue, shared by Londiste and Slony may look like. For some use cases of pgq, there isn't even any interest in user table changes. These are simply a reliable, database backed message passing system. Today both systems use an "agreeable" order of changes selected by rather expensive queries based on serializable snapshot information and a global, non cacheable serial number. This could be replaced with a logic based on the actual commit order of the transactions. This order does not need to be 100% accurate. As long as the order is recorded after all user actions have been performed (trigger queue shut down) and while the transaction is still holding onto its locks, that order is good enough. This will not allow a conflicting transaction, waiting on locks to be released, to appear having committed before the lock conflict winner. It is obvious that in cases where only small portions or even none of the user table changes are needed, holding on to or even parsing the ENTIRE WAL sounds suboptimal for this use case. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > ... but to answer that request, actually I don't even think we should be > discussing API specifics. > How about just API generalities? Like, where do you need this data, on the master or on the slave? Would PGXC like it on the transaction coordinator? What question do you need to answer, do you need to pull out sets of commits in certain ranges or look up specific transaction ids and find out when they committed? Or do you only need to answer which of two transaction ids committed first? -- greg
Could you generate the commit-order log by simply registering a commit hook (RegisterXactCallback(XACT_EVENT_COMMIT)) that writes such a log somewhere in the data directory? That would work with older versions too, no server changes required. It would not get called during recovery, but I believe that would be sufficient for Slony. You could always batch commits that you don't know when they committed as if they committed simultaneously. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/26/2010 1:17 PM, Heikki Linnakangas wrote: > Could you generate the commit-order log by simply registering a commit > hook (RegisterXactCallback(XACT_EVENT_COMMIT)) that writes such a log > somewhere in the data directory? That would work with older versions > too, no server changes required. > That would work, as it seems that the backend keeps holding on to its locks until after calling the callbacks. > It would not get called during recovery, but I believe that would be > sufficient for Slony. You could always batch commits that you don't know > when they committed as if they committed simultaneously. Here you are mistaken. If the origin crashes but can recover not yet flushed to xlog-commit-order transactions, then the consumer has no idea about the order of those commits, which throws us back to the point where we require a non cacheable global sequence to replay the individual actions of those "now batched" transactions in an agreeable order. The commit order data needs to be covered by crash recovery. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 26/05/10 21:43, Jan Wieck wrote: > On 5/26/2010 1:17 PM, Heikki Linnakangas wrote: >> It would not get called during recovery, but I believe that would be >> sufficient for Slony. You could always batch commits that you don't >> know when they committed as if they committed simultaneously. > > Here you are mistaken. If the origin crashes but can recover not yet > flushed to xlog-commit-order transactions, then the consumer has no idea > about the order of those commits, which throws us back to the point > where we require a non cacheable global sequence to replay the > individual actions of those "now batched" transactions in an agreeable > order. > > The commit order data needs to be covered by crash recovery. Perhaps I'm missing something, but I thought that Slony currently uses a heartbeat, and all transactions committed between two beats are banged together and committed as one in the slave so that their relative commit order doesn't matter. Can we not do the same for commits missing from the commit-order log? I'm thinking that the commit-order log would contain two kinds of records: a) Transaction with XID X committed b) All transactions with XID < X committed During normal operation we write the 1st kind of record at every commit. After crash recovery (perhaps at the first commit after recovery or when the slon daemon first polls the server, as there's no hook for end-of-recovery), we write the 2nd kind of record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Perhaps I'm missing something, but I thought that Slony currently uses a > heartbeat, and all transactions committed between two beats are banged > together and committed as one in the slave so that their relative commit > order doesn't matter. I guess Slony does the same as pgq here: all events of all those transactions between two given ticks are batched together in the order of the event commits. (In fact the batches are made at the consumer request, so possibly spreading more than 2 ticks at a time). If you skip that event ordering (within transactions), you can't maintain foreign keys on the slaves, among other things. The idea of this proposal is to be able to get this commit order directly from where the information is maintained, rather than use some sort of user sequence for that. So even ordering the txid and txid_snapshots with respect to WAL commit time (LSN) won't be the whole story, for any given transaction containing more than one event we also need to have them in order. I know Jan didn't forget about it so it must either be in the proposal or easily derived, too tired to recheck. Regards, -- dim
On Wed, May 26, 2010 at 4:11 PM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Perhaps I'm missing something, but I thought that Slony currently uses a >> heartbeat, and all transactions committed between two beats are banged >> together and committed as one in the slave so that their relative commit >> order doesn't matter. > > I guess Slony does the same as pgq here: all events of all those > transactions between two given ticks are batched together in the order > of the event commits. (In fact the batches are made at the consumer > request, so possibly spreading more than 2 ticks at a time). > > If you skip that event ordering (within transactions), you can't > maintain foreign keys on the slaves, among other things. > > The idea of this proposal is to be able to get this commit order > directly from where the information is maintained, rather than use some > sort of user sequence for that. Exactly. > So even ordering the txid and txid_snapshots with respect to WAL commit > time (LSN) won't be the whole story, for any given transaction > containing more than one event we also need to have them in order. I > know Jan didn't forget about it so it must either be in the proposal or > easily derived, too tired to recheck. Right, so the point is - with this proposal, he can switch to using a LOCAL sequence number to order events within the session and then order the sessions using the commit ordering. Right now, he has to use a GLOBAL sequence number because there's no way to know the commit order. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 5/26/2010 3:16 PM, Heikki Linnakangas wrote: > On 26/05/10 21:43, Jan Wieck wrote: >> On 5/26/2010 1:17 PM, Heikki Linnakangas wrote: >>> It would not get called during recovery, but I believe that would be >>> sufficient for Slony. You could always batch commits that you don't >>> know when they committed as if they committed simultaneously. >> >> Here you are mistaken. If the origin crashes but can recover not yet >> flushed to xlog-commit-order transactions, then the consumer has no idea >> about the order of those commits, which throws us back to the point >> where we require a non cacheable global sequence to replay the >> individual actions of those "now batched" transactions in an agreeable >> order. >> >> The commit order data needs to be covered by crash recovery. > > Perhaps I'm missing something, Apparently, more about that at the end. > I'm thinking that the commit-order log would contain two kinds of records: > > a) Transaction with XID X committed > b) All transactions with XID < X committed If that was true then long running transactions would delay all commits for transactions that started after them. Do they? > > During normal operation we write the 1st kind of record at every commit. > After crash recovery (perhaps at the first commit after recovery or when > the slon daemon first polls the server, as there's no hook for > end-of-recovery), we write the 2nd kind of record. I think the callback is also called during backend startup, which means that it could record the first XID to come which is known from the control file and in that case, all < XID's are committed or aborted. Which leads us to your missing piece above, the need for the global non cacheable sequence. Consider two transactions A and B that due to transaction batching between snapshots get applied together. Let the order of actions be 1. A starts 2. B starts 3. B selects a row for update, then updates the row 4. A tries to do the same and blocks 5. B commits 6. A gets the lock, the row, does the update 7. A commits If Slony (or Londiste) would not record the exact order of those individual row actions, then it would not have any idea if within that batch the action of B (higher XID) actually came first. Without that knowledge there is a 50/50 chance of getting your replica out of sync with that simple conflict. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> wrote: > Without this logic, the replication system could not combine > multiple origin sessions into one replication session without > risking to never find a state, in which it can commit. My latest idea for handling this in WAL-based replication involves WAL-logging information about the transaction through which a the committing transaction makes it safe to view. There are a few options here at the detail level that I'm still thinking through. The idea would be that the xmin from read-only queries on the slaves might be somewhere behind where you would expect based on transactions committed. (The details involve such things as where non-serializable transactions fall into the plan on both sides, and whether it's worth the effort to special-case read-only transactions on the master.) I can't say that I'm 100% sure that some lurking detail won't shoot this technique down for HS, but it seems good to me at a conceptual level. I think, however, that this fails to work for systems like Slony and Londiste because there could be transactions writing to tables which are not replication targets, so the snapshot adjustments wouldn't be safe. True? (If not true, I think that adding some sort of xmin value, depending on the answers to the above questions, to Jan's proposed structure might support better transactional integrity, even to the level of full serializable support, at the cost of delaying visibility of committed data.) -Kevin
On 5/26/2010 4:11 PM, Dimitri Fontaine wrote: > So even ordering the txid and txid_snapshots with respect to WAL commit > time (LSN) won't be the whole story, for any given transaction > containing more than one event we also need to have them in order. I > know Jan didn't forget about it so it must either be in the proposal or > easily derived, too tired to recheck. No, that detail is actually not explained in the proposal. When applying all changes in transaction commit order, there is no need for a global sequence. A local counter per backend is sufficient because the total order of <xact-commit-order>, <local-xact-seq> yields a similarly agreeable order of actions. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 26/05/10 23:32, Jan Wieck wrote: > Consider two transactions A and B that due to transaction batching > between snapshots get applied together. Let the order of actions be > > 1. A starts > 2. B starts > 3. B selects a row for update, then updates the row > 4. A tries to do the same and blocks > 5. B commits > 6. A gets the lock, the row, does the update > 7. A commits > > If Slony (or Londiste) would not record the exact order of those > individual row actions, then it would not have any idea if within that > batch the action of B (higher XID) actually came first. Without that > knowledge there is a 50/50 chance of getting your replica out of sync > with that simple conflict. Hmm, I don't see how even a fully reliable WAL-logged commit-order log would save you then. It seems that you need to not only know the relative order of commits, but the order of commits relative to actions within the transactions. I.e. in the above example it's not enough to know that B committed before A, you also have to know that A updated the row only after B committed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/26/2010 12:38 PM, Greg Stark wrote: > On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> ... but to answer that request, actually I don't even think we should be >> discussing API specifics. >> > > How about just API generalities? Like, where do you need this data, on > the master or on the slave? Would PGXC like it on the transaction > coordinator? > > What question do you need to answer, do you need to pull out sets of > commits in certain ranges or look up specific transaction ids and find > out when they committed? Or do you only need to answer which of two > transaction ids committed first? The question I want answered is "what was the order and xid of the next 0..n transactions, that committed after transaction X?" Preferably I would avoid scanning the entire available WAL just to get the next n xid's to process. The proposal assigned a unique serial number (file segment and position driven) to each xid and used that for the ordering as well as identification of the last known transaction. That is certainly a premature implementation detail. In this implementation it wouldn't even matter if a transaction that was recorded actually never made it because it crashed before the WAL flush. It would be reported by this "commit order" feature, but there would be no traces of whatever it did to be found inside the DB, so that anomaly is harmless. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 26/05/10 23:45, Heikki Linnakangas wrote: > On 26/05/10 23:32, Jan Wieck wrote: >> Consider two transactions A and B that due to transaction batching >> between snapshots get applied together. Let the order of actions be >> >> 1. A starts >> 2. B starts >> 3. B selects a row for update, then updates the row >> 4. A tries to do the same and blocks >> 5. B commits >> 6. A gets the lock, the row, does the update >> 7. A commits >> >> If Slony (or Londiste) would not record the exact order of those >> individual row actions, then it would not have any idea if within that >> batch the action of B (higher XID) actually came first. Without that >> knowledge there is a 50/50 chance of getting your replica out of sync >> with that simple conflict. > > Hmm, I don't see how even a fully reliable WAL-logged commit-order log > would save you then. It seems that you need to not only know the > relative order of commits, but the order of commits relative to actions > within the transactions. I.e. in the above example it's not enough to > know that B committed before A, you also have to know that A updated the > row only after B committed. Ok, I think I understand it now. The commit order is enough, because replaying the actions in the order "all actions of B, then all actions of A" yields the same result. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 26/05/10 23:49, Jan Wieck wrote: > In this implementation it wouldn't even matter if a transaction that was > recorded actually never made it because it crashed before the WAL flush. > It would be reported by this "commit order" feature, but there would be > no traces of whatever it did to be found inside the DB, so that anomaly > is harmless. Hmm, I think it would also not matter if the reported commit order doesn't match exactly the order of the commit records, as long as there's no dependency between the two transactions. What I'm after is that I think it would be enough to establish the commit order using deferred triggers that are fired during pre-commit processing. The trigger could get a number from a global sequence to establish the commit order, and write it to a table. So you still need a global sequence, but it's only needed once per commit. (you have to handle deferred triggers that fire after the commit-order trigger. perhaps by getting another number from the global sequence and replacing the previous number with it) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/26/2010 4:52 PM, Heikki Linnakangas wrote: > Ok, I think I understand it now. The commit order is enough, because > replaying the actions in the order "all actions of B, then all actions > of A" yields the same result. Precisely. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/26/2010 5:12 PM, Heikki Linnakangas wrote: > On 26/05/10 23:49, Jan Wieck wrote: >> In this implementation it wouldn't even matter if a transaction that was >> recorded actually never made it because it crashed before the WAL flush. >> It would be reported by this "commit order" feature, but there would be >> no traces of whatever it did to be found inside the DB, so that anomaly >> is harmless. > > Hmm, I think it would also not matter if the reported commit order > doesn't match exactly the order of the commit records, as long as > there's no dependency between the two transactions. > > What I'm after is that I think it would be enough to establish the > commit order using deferred triggers that are fired during pre-commit > processing. The trigger could get a number from a global sequence to > establish the commit order, and write it to a table. So you still need a > global sequence, but it's only needed once per commit. You're not trying to derail this thread into yet another of our famous "commit trigger" battles, are you? > > (you have to handle deferred triggers that fire after the commit-order > trigger. perhaps by getting another number from the global sequence and > replacing the previous number with it) I could imagine a commit trigger as a special case that is fired AFTER the trigger queue was shut down, so any operation that causes any further triggers to fire would automatically abort the transaction. A draconian, but reasonable restriction. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/26/2010 4:34 PM, Kevin Grittner wrote: > Jan Wieck <JanWieck@Yahoo.com> wrote: > >> Without this logic, the replication system could not combine >> multiple origin sessions into one replication session without >> risking to never find a state, in which it can commit. > > My latest idea for handling this in WAL-based replication involves > WAL-logging information about the transaction through which a the > committing transaction makes it safe to view. There are a few > options here at the detail level that I'm still thinking through. > The idea would be that the xmin from read-only queries on the slaves > might be somewhere behind where you would expect based on > transactions committed. (The details involve such things as where > non-serializable transactions fall into the plan on both sides, and > whether it's worth the effort to special-case read-only transactions > on the master.) > > I can't say that I'm 100% sure that some lurking detail won't shoot > this technique down for HS, but it seems good to me at a conceptual > level. Without simulating multiple simultaneous transactions during playback, how are you going to manage that the tuples, already inserted on behalf of the ongoing master transactions, disappear when they abort on the master? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> wrote: > On 5/26/2010 4:34 PM, Kevin Grittner wrote: >> My latest idea for handling this in WAL-based replication >> involves WAL-logging information about the transaction through >> which a the committing transaction makes it safe to view. There >> are a few options here at the detail level that I'm still >> thinking through. The idea would be that the xmin from read-only >> queries on the slaves might be somewhere behind where you would >> expect based on transactions committed. (The details involve >> such things as where non-serializable transactions fall into the >> plan on both sides, and whether it's worth the effort to >> special-case read-only transactions on the master.) >> >> I can't say that I'm 100% sure that some lurking detail won't >> shoot this technique down for HS, but it seems good to me at a >> conceptual level. > > Without simulating multiple simultaneous transactions during > playback, how are you going to manage that the tuples, already > inserted on behalf of the ongoing master transactions, disappear > when they abort on the master? When do writes ever become visible to a snapshot without having been committed? I'm not talking about changing that in any way. I'm talking about deferring visibility of committed transactions until they can be viewed without risking serialization anomalies. This requires, at a minimum, that any concurrent serializable transactions which are not read-only have completed. (Perhaps I'm not understanding your question....) -Kevin
On Wed, May 26, 2010 at 5:38 PM, Greg Stark <gsstark@mit.edu> wrote: > How about just API generalities? Like, where do you need this data, on > the master or on the slave? Would PGXC like it on the transaction > coordinator? > > What question do you need to answer, do you need to pull out sets of > commits in certain ranges or look up specific transaction ids and find > out when they committed? Or do you only need to answer which of two > transaction ids committed first? > This thread has been hard to follow for me. Were any of these questions answered? -- greg
On 5/27/2010 9:59 AM, Greg Stark wrote: > On Wed, May 26, 2010 at 5:38 PM, Greg Stark <gsstark@mit.edu> wrote: >> How about just API generalities? Like, where do you need this data, on >> the master or on the slave? Would PGXC like it on the transaction >> coordinator? >> >> What question do you need to answer, do you need to pull out sets of >> commits in certain ranges or look up specific transaction ids and find >> out when they committed? Or do you only need to answer which of two >> transaction ids committed first? >> > > This thread has been hard to follow for me. Were any of these > questions answered? > Yes. On 5/26/2010 4:49 PM, Jan Wieck wrote: > On 5/26/2010 12:38 PM, Greg Stark wrote: >> > On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >>> >> ... but to answer that request, actually I don't even think we should be >>> >> discussing API specifics. >>> >> >> > >> > How about just API generalities? Like, where do you need this data, on >> > the master or on the slave? Would PGXC like it on the transaction >> > coordinator? >> > >> > What question do you need to answer, do you need to pull out sets of >> > commits in certain ranges or look up specific transaction ids and find >> > out when they committed? Or do you only need to answer which of two >> > transaction ids committed first? > > The question I want answered is > > "what was the order and xid of the next 0..n transactions, that > committed after transaction X?" > > Preferably I would avoid scanning the entire available WAL just to get > the next n xid's to process. > > The proposal assigned a unique serial number (file segment and position > driven) to each xid and used that for the ordering as well as > identification of the last known transaction. That is certainly a > premature implementation detail. > > In this implementation it wouldn't even matter if a transaction that was > recorded actually never made it because it crashed before the WAL flush. > It would be reported by this "commit order" feature, but there would be > no traces of whatever it did to be found inside the DB, so that anomaly > is harmless. > > > Jan > > -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 5/27/2010 12:01 PM, Jan Wieck wrote: > On 5/27/2010 9:59 AM, Greg Stark wrote: >> This thread has been hard to follow for me. Were any of these >> questions answered? >> > > Yes. The thing missing is any sort of answer to that problem description. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > >> Reading the entire WAL just to find all COMMIT records, then go back to > >> the origin database to get the actual replication log you're looking for > >> is simpler and more efficient? I don't think so. > > > > Agreed, but I think I've not explained myself well enough. > > > > I proposed two completely separate ideas; the first one was this: > > > > If you must get commit order, get it from WAL on *origin*, using exact > > same code that current WALSender provides, plus some logic to read > > through the WAL records and extract commit/aborts. That seems much > > simpler than the proposal you outlined and as SR shows, its low latency > > as well since commits write to WAL. No need to generate event ticks > > either, just use XLogRecPtrs as WALSender already does. > > > > I see no problem with integrating that into core, technically or > > philosophically. > > > > Which means that if I want to allow a consumer of that commit order data > to go offline for three days or so to replicate the 5 requested, low > volume tables, the origin needs to hang on to the entire WAL log from > all 100 other high volume tables? I suggest writing an external tool that strips out what you need that can be run at any time, rather than creating a new data format and overhead for this usecase. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
On May 28, 2010, at 7:19 PM, Bruce Momjian <bruce@momjian.us> wrote: > Jan Wieck wrote: >>>> Reading the entire WAL just to find all COMMIT records, then go >>>> back to >>>> the origin database to get the actual replication log you're >>>> looking for >>>> is simpler and more efficient? I don't think so. >>> >>> Agreed, but I think I've not explained myself well enough. >>> >>> I proposed two completely separate ideas; the first one was this: >>> >>> If you must get commit order, get it from WAL on *origin*, using >>> exact >>> same code that current WALSender provides, plus some logic to read >>> through the WAL records and extract commit/aborts. That seems much >>> simpler than the proposal you outlined and as SR shows, its low >>> latency >>> as well since commits write to WAL. No need to generate event ticks >>> either, just use XLogRecPtrs as WALSender already does. >>> >>> I see no problem with integrating that into core, technically or >>> philosophically. >>> >> >> Which means that if I want to allow a consumer of that commit order >> data >> to go offline for three days or so to replicate the 5 requested, low >> volume tables, the origin needs to hang on to the entire WAL log from >> all 100 other high volume tables? > > I suggest writing an external tool that strips out what you need that > can be run at any time, rather than creating a new data format and > overhead for this usecase. That would be FAR more complex, less robust, and less performant - whereas doing what Jan has proposed is pretty straightforward and should have minimal impact on performance - or none when not enabled. ...Robert
On 5/28/2010 7:19 PM, Bruce Momjian wrote: > Jan Wieck wrote: >> >> Reading the entire WAL just to find all COMMIT records, then go back to >> >> the origin database to get the actual replication log you're looking for >> >> is simpler and more efficient? I don't think so. >> > >> > Agreed, but I think I've not explained myself well enough. >> > >> > I proposed two completely separate ideas; the first one was this: >> > >> > If you must get commit order, get it from WAL on *origin*, using exact >> > same code that current WALSender provides, plus some logic to read >> > through the WAL records and extract commit/aborts. That seems much >> > simpler than the proposal you outlined and as SR shows, its low latency >> > as well since commits write to WAL. No need to generate event ticks >> > either, just use XLogRecPtrs as WALSender already does. >> > >> > I see no problem with integrating that into core, technically or >> > philosophically. >> > >> >> Which means that if I want to allow a consumer of that commit order data >> to go offline for three days or so to replicate the 5 requested, low >> volume tables, the origin needs to hang on to the entire WAL log from >> all 100 other high volume tables? > > I suggest writing an external tool that strips out what you need that > can be run at any time, rather than creating a new data format and > overhead for this usecase. > Stripping it out from what? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > >> > I see no problem with integrating that into core, technically or > >> > philosophically. > >> > > >> > >> Which means that if I want to allow a consumer of that commit order data > >> to go offline for three days or so to replicate the 5 requested, low > >> volume tables, the origin needs to hang on to the entire WAL log from > >> all 100 other high volume tables? > > > > I suggest writing an external tool that strips out what you need that > > can be run at any time, rather than creating a new data format and > > overhead for this usecase. > > > > Stripping it out from what? Stripping it from the WAL. Your system seems to require double-writes on a commit, which is something we have avoided in the past. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
On 6/1/2010 11:09 AM, Bruce Momjian wrote: > Jan Wieck wrote: >> >> > I see no problem with integrating that into core, technically or >> >> > philosophically. >> >> > >> >> >> >> Which means that if I want to allow a consumer of that commit order data >> >> to go offline for three days or so to replicate the 5 requested, low >> >> volume tables, the origin needs to hang on to the entire WAL log from >> >> all 100 other high volume tables? >> > >> > I suggest writing an external tool that strips out what you need that >> > can be run at any time, rather than creating a new data format and >> > overhead for this usecase. >> > >> >> Stripping it out from what? > > Stripping it from the WAL. Your system seems to require double-writes > on a commit, which is something we have avoided in the past. > Your suggestion seems is based on several false assumptions. This does neither require additional physical writes on commit, nor is consuming the entire WAL just to filter out commit records anything even remotely desirable for systems like Londiste or Slony. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
drkp@csail.mit.edu (Dan Ports) writes: > I'm not clear on why the total rowcount is useful, but perhaps I'm > missing something obvious. It would make it easy to conclude: "This next transaction did 8328194 updates. Maybe we should do some kind of checkpoint (e.g. - commit transaction orsuch) before working on it." versus "This transaction we're thinking of working on had 7 updates. No big deal..." -- (reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc")) http://linuxfinances.info/info/finances.html Rules of the Evil Overlord #189. "I will never tell the hero "Yes I was the one who did it, but you'll never be able to prove it to that incompetent old fool." Chances are, that incompetent old fool is standing behind the curtain." <http://www.eviloverlord.com/>
heikki.linnakangas@enterprisedb.com (Heikki Linnakangas) writes: > On 24/05/10 19:51, Kevin Grittner wrote: >> The only thing I'm confused about is what benefit anyone expects to >> get from looking at data between commits in some way other than our >> current snapshot mechanism. Can someone explain a use case where >> what Jan is proposing is better than snapshot isolation? It doesn't >> provide any additional integrity guarantees that I can see. > > Right, it doesn't. What it provides is a way to reconstruct a snapshot > at any point in time, after the fact. For example, after transactions > A, C, D and B have committed in that order, it allows you to > reconstruct a snapshot just like you would've gotten immediately after > the commit of A, C, D and B respectively. That's useful replication > tools like Slony that needs to commit the changes of those > transactions in the slave in the same order as they were committed in > the master. > > I don't know enough of Slony et al. to understand why that'd be better > than the current heartbeat mechanism they use, taking a snapshot every > few seconds, batching commits. I see two advantages: a) Identifying things on a transaction-by-transaction basis means that the snapshots ("syncs") don't need to be captured,which is presently an area of fragility. If the slon daemon falls over on Friday evening, and nobody noticesuntil Monday, the "snapshot" reverts to being all updates between Friday and whenever SYNCs start to be collectedagain. Exposing commit orders eliminates that fragility. SYNCs don't need to be captured anymore, so they can't be missed(which is today's problem). b) The sequence currently used to control log application ordering is a bottleneck, as it is a single sequence shared acrossall connections. It could be eliminated in favor of (perhaps) an in-memory variable defined on a per-connection basis. It's not a bottleneck that we hear a lot of complaints about, but the sequence certainly is a bottleneck. -- select 'cbbrowne' || '@' || 'cbbrowne.com'; http://cbbrowne.com/info/internet.html "MS apparently now has a team dedicated to tracking problems with Linux and publicizing them. I guess eventually they'll figure out this back fires... ;)" -- William Burrow <aa126@DELETE.fan.nb.ca>
On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne@acm.org> wrote: > It would make it easy to conclude: > > "This next transaction did 8328194 updates. Maybe we should do > some kind of checkpoint (e.g. - commit transaction or such) before > working on it." > > versus > > "This transaction we're thinking of working on had 7 updates. No > big deal..." I'm puzzled how you would define this value. How do you add 7 inserts, 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7 inserts and 7 deletes worth twice as much as the 7 updates when they're basically the same thing? What if the inserts fired triggers which inserted 7 more rows, is that 14? What if the 7 updates modified 2 TB of TOAST data but the 8238194 updates were all to the same record and they were all HOT updates so all it did was change 8kB? In any case you'll have all the actual data from your triggers or hooks or whatever so what value does having the system keep track of this add? -- greg
On 6/2/2010 7:49 PM, Greg Stark wrote: > On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne@acm.org> wrote: >> It would make it easy to conclude: >> >> "This next transaction did 8328194 updates. Maybe we should do >> some kind of checkpoint (e.g. - commit transaction or such) before >> working on it." >> >> versus >> >> "This transaction we're thinking of working on had 7 updates. No >> big deal..." > > I'm puzzled how you would define this value. How do you add 7 inserts, > 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7 > inserts and 7 deletes worth twice as much as the 7 updates when > they're basically the same thing? What if the inserts fired triggers > which inserted 7 more rows, is that 14? What if the 7 updates modified > 2 TB of TOAST data but the 8238194 updates were all to the same record > and they were all HOT updates so all it did was change 8kB? > > In any case you'll have all the actual data from your triggers or > hooks or whatever so what value does having the system keep track of > this add? The point is not that we don't have that information now. The point is having a hint BEFORE wading through possibly gigabytes of WAL or log data. If getting that information requires to read all the log data twice or the need to read gigabytes of otherwise useless WAL data (as per Bruce's suggestion), we better not get it at all and just keep doing what we are doing now. I actually have a hard time understanding why people are so opposed to a feature that has zero impact at all unless a DBA actually turns in ON. What is the problem with exposing the commit order of transactions? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > The point is not that we don't have that information now. The point is > having a hint BEFORE wading through possibly gigabytes of WAL or log data. > > If getting that information requires to read all the log data twice or > the need to read gigabytes of otherwise useless WAL data (as per Bruce's > suggestion), we better not get it at all and just keep doing what we are > doing now. > > I actually have a hard time understanding why people are so opposed to a > feature that has zero impact at all unless a DBA actually turns in ON. > What is the problem with exposing the commit order of transactions? If you want to fork Postgres and add it, go ahead, but if the community has to maintain the code and document it, we care. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Jan Wieck <JanWieck@Yahoo.com> wrote: > I actually have a hard time understanding why people are so > opposed to a feature that has zero impact at all unless a DBA > actually turns in ON. What is the problem with exposing the > commit order of transactions? FWIW, once I came to understand the use case, it seems to me a perfectly reasonable and useful thing to have. It does strike me that there may be value to add one more xid to support certain types of integrity for some use cases, but that's certainly something which could be added later, if at all. Once I realized that, I just dropped out of the discussion; perhaps I should have bowed out with an endorsement. Unless my memory is failing me worse than usual, Dan Ports, who is working on the serializable implementation so he can use the predicate locking with a transaction-aware caching feature, needs the ability to track commit order of transactions by xid; so the use cases go beyond Slony and Londiste. -Kevin
On 6/3/2010 4:04 PM, Bruce Momjian wrote: > If you want to fork Postgres and add it, go ahead, but if the community > has to maintain the code and document it, we care. That comment was rather unprofessional. I think the rest of us still try to find the best solution for the problem, not kill the discussion. You may want to rejoin that effort. I care about an efficient, low overhead way to get a certain information, that is otherwise extremely difficult, expensive and version dependent to get. I care about cleaning up more of the mistakes, made in the original development of Slony. Namely using hacks and kluges to implement details, not supported by a current version of PostgreSQL. Londiste and Slony made a good leap on that with the txid data type. Slony made another step like that with 2.0, switching to the (for that very purpose developed and contributed) native trigger configuration instead of hacking system catalogs. This would be another step in that direction and we would be able to unify Londiste's and Slony's transport mechanism and eliminating the tick/sync kluge. Care to explain what exactly you care about? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
bruce@momjian.us (Bruce Momjian) writes: > Jan Wieck wrote: >> The point is not that we don't have that information now. The point is >> having a hint BEFORE wading through possibly gigabytes of WAL or log data. >> >> If getting that information requires to read all the log data twice or >> the need to read gigabytes of otherwise useless WAL data (as per Bruce's >> suggestion), we better not get it at all and just keep doing what we are >> doing now. >> >> I actually have a hard time understanding why people are so opposed to a >> feature that has zero impact at all unless a DBA actually turns in ON. >> What is the problem with exposing the commit order of transactions? > > If you want to fork Postgres and add it, go ahead, but if the community > has to maintain the code and document it, we care. Are you "caring" or "opposing"? It seems rather uncharitable to imply that Jan doesn't care. I know *I'm* not interested in a forked Postgres for this - I would prefer to find out what things could be done that don't involve gross amounts of WAL file grovelling for data that mayn't necessarily even be available. -- select 'cbbrowne' || '@' || 'cbbrowne.com'; http://cbbrowne.com/info/internet.html "MS apparently now has a team dedicated to tracking problems with Linux and publicizing them. I guess eventually they'll figure out this back fires... ;)" -- William Burrow <aa126@DELETE.fan.nb.ca>
gsstark@mit.edu (Greg Stark) writes: > On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne@acm.org> wrote: >> It would make it easy to conclude: >> >> "This next transaction did 8328194 updates. Maybe we should do >> some kind of checkpoint (e.g. - commit transaction or such) before >> working on it." >> >> versus >> >> "This transaction we're thinking of working on had 7 updates. No >> big deal..." > > I'm puzzled how you would define this value. How do you add 7 inserts, > 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7 > inserts and 7 deletes worth twice as much as the 7 updates when > they're basically the same thing? What if the inserts fired triggers > which inserted 7 more rows, is that 14? What if the 7 updates modified > 2 TB of TOAST data but the 8238194 updates were all to the same record > and they were all HOT updates so all it did was change 8kB? The presence of those questions (and their ambiguity) is the reason why there's a little squirming as to whether this is super-useful and super-necessary. What this offers is *SOME* idea of how much updating work a particular transaction did. It's a bit worse than you suggest: - If replication triggers have captured tuples, those would get counted. - TOAST updates might lead to extra updates being counted. But back to where you started, I'd anticipate 7 inserts, 7 deletes, and 7 updates being counted as something around 21 updates. And if that included 5 TOAST changes, it might bump up to 26. If there were replication triggers in place, that might bump the count up to 45 (which I chose arbitrarily). > In any case you'll have all the actual data from your triggers or > hooks or whatever so what value does having the system keep track of > this add? This means that when we'd pull the list of transactions to consider, we'd get something like: select * from next_transactions('4218:23', 50); [list of 50 transactions returned, each with... -> txid -> START timestamp -> COMMIT timestamp -> Approximate # of updates Then, for each of the 50, I'd pull replication log data for the corresponding transaction. If I have the approximate # of updates, that might lead me to stop short, and say: "That next update looks like a doozy! I'm going to stop and commit what I've got before doing that one." It's not strictly necessary, but would surely be useful for flow control. -- select 'cbbrowne' || '@' || 'cbbrowne.com'; http://cbbrowne.com/info/internet.html "MS apparently now has a team dedicated to tracking problems with Linux and publicizing them. I guess eventually they'll figure out this back fires... ;)" -- William Burrow <aa126@DELETE.fan.nb.ca>
On Thu, Jun 3, 2010 at 8:50 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> I'm puzzled how you would define this value. How do you add 7 inserts, >> 7 deletes, and 7 updates? Is that 21 rows modified? > > I actually have a hard time understanding why people are so opposed to a > feature that has zero impact at all unless a DBA actually turns in ON. What > is the problem with exposing the commit order of transactions? The post you were responding to was regarding the meaninglessness of the "number of records" attribute you wanted. Your response is a non sequitor. I think the commit order of transactions would be a good thing to expose though I've asked repeatedly what kind of interface you need and never gotten answers to all the questions. -- greg
Greg Stark <gsstark@mit.edu> wrote: > what kind of interface you need For the potential uses I can see, it would be great to have a SRF which took two parameters: xid of last known commit and a limit how many commits past that to return. Perhaps a negative number could move earlier in time, if that seems reasonable to others. I think that's also consistent with Jan's posts. A GUC to enable it and some way to specify retention (or force cleanup) are the only other user-facing features which come to mind for me. (Not sure what form that last should take, but didn't Jan say something about both of these early in the thread?) Do you see a need for something else (besides, obviously, docs)? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Greg Stark <gsstark@mit.edu> wrote: > >> what kind of interface you need > > For the potential uses I can see, it would be great to have a SRF > which took two parameters: xid of last known commit and a limit > how many commits past that to return. Jan's very first post had it right; my idea was flawed: | Exposing the data will be done via a set returning function. The | SRF takes two arguments. The maximum number of rows to return and | the last serial number processed by the reader. The advantage of | such SRF is that the result can be used in a query that right away | delivers audit or replication log information in transaction | commit order. The SRF can return an empty set if no further | transactions have committed since, or an error if data segments | needed to answer the request have already been purged. | | Purging of the data will be possible in several different ways. | Autovacuum will call a function that drops segments of the data | that are outside the postgresql.conf configuration with respect to | maximum age or data volume. There will also be a function reserved | for superusers to explicitly purge the data up to a certain serial | number. Apologies for not looking back to the start of the thread before that last post. It was all laid out right at the start. -Kevin
Jan Wieck wrote: > On 6/3/2010 4:04 PM, Bruce Momjian wrote: > > If you want to fork Postgres and add it, go ahead, but if the community > > has to maintain the code and document it, we care. > > That comment was rather unprofessional. I think the rest of us still try > to find the best solution for the problem, not kill the discussion. You > may want to rejoin that effort. > > I care about an efficient, low overhead way to get a certain > information, that is otherwise extremely difficult, expensive and > version dependent to get. > > I care about cleaning up more of the mistakes, made in the original > development of Slony. Namely using hacks and kluges to implement > details, not supported by a current version of PostgreSQL. Londiste and > Slony made a good leap on that with the txid data type. Slony made > another step like that with 2.0, switching to the (for that very purpose > developed and contributed) native trigger configuration instead of > hacking system catalogs. This would be another step in that direction > and we would be able to unify Londiste's and Slony's transport mechanism > and eliminating the tick/sync kluge. > > Care to explain what exactly you care about? Here is what I was replying to: > >> I actually have a hard time understanding why people are so opposed t$ > > >> feature that has zero impact at all unless a DBA actually turns in ON. > >> What is the problem with exposing the commit order of transactions? Jan's comment is why should others care what he wants because it has zero impact? I am saying the community cares because we have to maintain the code. I stand by my comment. I remember a dismissive comment by Jan when 'session_replication_role' was added, and a similar strong comment from me at that time as well. It seems we are doing this again. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian wrote: > Jan Wieck wrote: > > On 6/3/2010 4:04 PM, Bruce Momjian wrote: > > > If you want to fork Postgres and add it, go ahead, but if the community > > > has to maintain the code and document it, we care. > > > > That comment was rather unprofessional. I think the rest of us still try > > to find the best solution for the problem, not kill the discussion. You > > may want to rejoin that effort. > > > > I care about an efficient, low overhead way to get a certain > > information, that is otherwise extremely difficult, expensive and > > version dependent to get. > > > > I care about cleaning up more of the mistakes, made in the original > > development of Slony. Namely using hacks and kluges to implement > > details, not supported by a current version of PostgreSQL. Londiste and > > Slony made a good leap on that with the txid data type. Slony made > > another step like that with 2.0, switching to the (for that very purpose > > developed and contributed) native trigger configuration instead of > > hacking system catalogs. This would be another step in that direction > > and we would be able to unify Londiste's and Slony's transport mechanism > > and eliminating the tick/sync kluge. > > > > Care to explain what exactly you care about? > > Here is what I was replying to: > > > >> I actually have a hard time understanding why people are so opposed t$ > > > >> feature that has zero impact at all unless a DBA actually turns in ON. > > >> What is the problem with exposing the commit order of transactions? > > Jan's comment is why should others care what he wants because it has > zero impact? I am saying the community cares because we have to > maintain the code. I stand by my comment. > > I remember a dismissive comment by Jan when 'session_replication_role' > was added, and a similar strong comment from me at that time as well. > It seems we are doing this again. Of course, if I am misintepreting what Jan said, please let me know. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Excerpts from Chris Browne's message of jue jun 03 16:21:35 -0400 2010: > What this offers is *SOME* idea of how much updating work a particular > transaction did. It's a bit worse than you suggest: > > - If replication triggers have captured tuples, those would get > counted. > > - TOAST updates might lead to extra updates being counted. > > But back to where you started, I'd anticipate 7 inserts, 7 deletes, > and 7 updates being counted as something around 21 updates. > > And if that included 5 TOAST changes, it might bump up to 26. > > If there were replication triggers in place, that might bump the count > up to 45 (which I chose arbitrarily). Why not send separate numbers of tuple inserts/updates/deletes, which we already have from pgstats? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 6/3/2010 5:58 PM, Greg Stark wrote: > On Thu, Jun 3, 2010 at 8:50 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >>> I'm puzzled how you would define this value. How do you add 7 inserts, >>> 7 deletes, and 7 updates? Is that 21 rows modified? >> >> I actually have a hard time understanding why people are so opposed to a >> feature that has zero impact at all unless a DBA actually turns in ON. What >> is the problem with exposing the commit order of transactions? > > The post you were responding to was regarding the meaninglessness of > the "number of records" attribute you wanted. Your response is a non > sequitor. I never proposed a "number of records" attribute. I proposed a sum of the row counts in the statistics collector. That row count would be a mix of insert, update, delete and toast operations. It's not an exact indicator of anything, but a good enough hint of how much data may come down the pipe if I were to select all replication data belonging to that transaction. > > I think the commit order of transactions would be a good thing to > expose though I've asked repeatedly what kind of interface you need > and never gotten answers to all the questions. In the original email that started this whole thread I wrote: > Exposing the data will be done via a set returning function. The SRF > takes two arguments. The maximum number of rows to return and the last > serial number processed by the reader. The advantage of such SRF is that > the result can be used in a query that right away delivers audit or > replication log information in transaction commit order. The SRF can > return an empty set if no further transactions have committed since, or > an error if data segments needed to answer the request have already been > purged. Did that not answer your question? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 6/3/2010 6:24 PM, Kevin Grittner wrote: > Apologies for not looking back to the start of the thread before > that last post. It was all laid out right at the start. No need to apologize. Happens. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 6/3/2010 7:11 PM, Alvaro Herrera wrote: > Excerpts from Chris Browne's message of jue jun 03 16:21:35 -0400 2010: > >> What this offers is *SOME* idea of how much updating work a particular >> transaction did. It's a bit worse than you suggest: >> >> - If replication triggers have captured tuples, those would get >> counted. >> >> - TOAST updates might lead to extra updates being counted. >> >> But back to where you started, I'd anticipate 7 inserts, 7 deletes, >> and 7 updates being counted as something around 21 updates. >> >> And if that included 5 TOAST changes, it might bump up to 26. >> >> If there were replication triggers in place, that might bump the count >> up to 45 (which I chose arbitrarily). > > Why not send separate numbers of tuple inserts/updates/deletes, which we > already have from pgstats? > We only have them for the entire database. The purpose of this is just a guesstimate about what data volume to expect if I were to select all log from a particular transaction. This datum isn't critical, just handy for the overall feature to be useful. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Thu, Jun 3, 2010 at 6:29 PM, Bruce Momjian <bruce@momjian.us> wrote: > Jan Wieck wrote: >> On 6/3/2010 4:04 PM, Bruce Momjian wrote: >> > If you want to fork Postgres and add it, go ahead, but if the community >> > has to maintain the code and document it, we care. >> >> That comment was rather unprofessional. I think the rest of us still try >> to find the best solution for the problem, not kill the discussion. You >> may want to rejoin that effort. >> >> I care about an efficient, low overhead way to get a certain >> information, that is otherwise extremely difficult, expensive and >> version dependent to get. >> >> I care about cleaning up more of the mistakes, made in the original >> development of Slony. Namely using hacks and kluges to implement >> details, not supported by a current version of PostgreSQL. Londiste and >> Slony made a good leap on that with the txid data type. Slony made >> another step like that with 2.0, switching to the (for that very purpose >> developed and contributed) native trigger configuration instead of >> hacking system catalogs. This would be another step in that direction >> and we would be able to unify Londiste's and Slony's transport mechanism >> and eliminating the tick/sync kluge. >> >> Care to explain what exactly you care about? > > Here is what I was replying to: > >> >> I actually have a hard time understanding why people are so opposed t$ >> > >> feature that has zero impact at all unless a DBA actually turns in ON. >> >> What is the problem with exposing the commit order of transactions? > > Jan's comment is why should others care what he wants because it has > zero impact? I am saying the community cares because we have to > maintain the code. I stand by my comment. > > I remember a dismissive comment by Jan when 'session_replication_role' > was added, and a similar strong comment from me at that time as well. > It seems we are doing this again. I think it's entirely legitimate and proper for us to make a decision about whether this feature is worth including in core PostgreSQL. We are obviously not in the business of adding random features solely for the benefit of third-party applications. That having been said, there are several reasons why I believe that this particular feature is an excellent candidate for inclusion in core. 1. It solves a problem for which there is no easy workaround. Rereading all the WAL to extract the commit records is not an easy workaround, nor is what Slony and Londiste are doing now. 2. It is usable by multiple projects, not just one. It may well have applications beyond replication (e.g. distributed transactions), but at a very minimum it is usable by and useful to multiple replication solutions. 3. It has a clear specification which can be easily understood even by people who do not fully understand how replication solutions will make use of it, which makes code maintenance much less burdensome. Obviously, Jan's original email on this topic was just a sketch, but I find it to be pretty clear. 4. We have an existing precedent of being willing to add limited support into core to allow replication solutions to do their thing (session_replication_role, ALTER TABLE ... ENABLE REPLICA TRIGGER, etc). Even though we now have built-in replication via HS and SR, there is still a BIG use case for Slony, Londiste, and other add-on tools. Making those tools more successful and performant is good for PostgreSQL. 5. It does not involve highly invasive changes to the core code. 6. It can be turned off for users who don't want it. I find the skeptical attitude on this thread altogether unwarranted. Jan made his case and, at least IMHO, presented it pretty clearly. He then answered, multiple times, numerous questions which were already addressed in the original email, as well as various others. I think we should be very careful about assuming that we understand replication and its needs better than someone who has spent many years developing one of the major PostgreSQL replication solutions. Independent of Jan's qualifications, there are clearly several people on this thread who understand why this is useful and valuable, including me. I am obviously not in a position to insist that we accept this feature (assuming Jan produces a patch rather than getting discouraged and giving up) but I would like us to think very, very carefully before rejecting it, and not to do so unless we have a DARN good reason. Most patches add code, and therefore require code maintenance - that is not, by itself, a reason to reject them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Bruce Momjian wrote: > > I care about cleaning up more of the mistakes, made in the original > > development of Slony. Namely using hacks and kluges to implement > > details, not supported by a current version of PostgreSQL. Londiste and > > Slony made a good leap on that with the txid data type. Slony made > > another step like that with 2.0, switching to the (for that very purpose > > developed and contributed) native trigger configuration instead of > > hacking system catalogs. This would be another step in that direction > > and we would be able to unify Londiste's and Slony's transport mechanism > > and eliminating the tick/sync kluge. > > > > Care to explain what exactly you care about? > > Here is what I was replying to: > > > >> I actually have a hard time understanding why people are so opposed t$ > > > >> feature that has zero impact at all unless a DBA actually turns in ON. > > >> What is the problem with exposing the commit order of transactions? > > Jan's comment is why should others care what he wants because it has > zero impact? I am saying the community cares because we have to > maintain the code. I stand by my comment. > > I remember a dismissive comment by Jan when 'session_replication_role' > was added, and a similar strong comment from me at that time as well. > It seems we are doing this again. FYI, I talked to Jan on the phone and we have resolved this issue. :-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I find the skeptical attitude on this thread altogether unwarranted. > Jan made his case and, at least IMHO, presented it pretty clearly. Just to be clear I think the idea of exposing commit order is a no-brainer. The specific interface is what I was questioning. A function which takes a starting xid and a number of transactions to return seems very tied to one particular application. I could easily see other systems such as a multi-master system instead only wanting to compare two transactions to find out which committed first. Or non-replication applications where you have an LSN and want to know whether a given transaction had committed by that time. So one possible interface would be to do something like xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with an optional argument to limit the number or records returned. So you could do: old := select pg_current_xlog_location(); while (1) { sleep 60s; new := select pg_current_xlog_location() process(select xids_committed_between(old,new)) old := new } This might be more useful for PITR recovery for example where you want to find out what transactions committed between now and some known point of corruption. I could also see it being useful to have a function pg_xlog_location_of_commit(xid). That would let you run recovery until a particular transaction committed or test whether your replica is caught up to a particular commit. It could be useful for monitoring Hot Standby slaves. > He > then answered, multiple times, numerous questions which were already > addressed in the original email, as well as various others. I think I did miss some of the original description. That might have caused some of the difficulty as I was asking questions about something he assumed he had already answered. > I think > we should be very careful about assuming that we understand > replication and its needs better than someone who has spent many years > developing one of the major PostgreSQL replication solutions. Well the flip side of that is that we want an interface that's useful for more than just one replication system. This is something basic enough that I think it will be useful for more than just replication if we design it generally enough. It should be useful for backup/restore processes and monitoring as well as various forms of replication including master-slave trigger based systems but also including PITR-based replication, log-parsing systems, multi-master trigger based systems, 2PC-based systems, etc. -- greg
On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark <gsstark@mit.edu> wrote: > On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I find the skeptical attitude on this thread altogether unwarranted. >> Jan made his case and, at least IMHO, presented it pretty clearly. > > Just to be clear I think the idea of exposing commit order is a > no-brainer. The specific interface is what I was questioning. OK, thanks for that clarification. > A function which takes a starting xid and a number of transactions to > return seems very tied to one particular application. I could easily > see other systems such as a multi-master system instead only wanting > to compare two transactions to find out which committed first. Or > non-replication applications where you have an LSN and want to know > whether a given transaction had committed by that time. > > So one possible interface would be to do something like > xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with > an optional argument to limit the number or records returned. I'm imagining that the backend data storage for this would be a file containing, essentially, a struct for each commit repeated over and over again, packed tightly. It's easy to index into such a file using a sequence number (give me the 1000'th commit) but searching by LSN would require (a) storing the LSNs and (b) binary search. Maybe it's worth adding that complexity, but I'm not sure that it is. Keeping the size of this file small is important for ensuring that it has minimal performance impact (which is also why I'm not sold on trying to include the tuple counters that Jan proposed - I think we can solve the problem he's worried about there more cleanly in other ways). > So you could do: > > old := select pg_current_xlog_location(); > while (1) > { > sleep 60s; > new := select pg_current_xlog_location() > process(select xids_committed_between(old,new)) > old := new > } > > This might be more useful for PITR recovery for example where you want > to find out what transactions committed between now and some known > point of corruption. This could also be done by selecting the current commit sequence number, getting the XIDs committed between the two commit sequence numbers, etc. > I could also see it being useful to have a function > pg_xlog_location_of_commit(xid). That would let you run recovery until > a particular transaction committed or test whether your replica is > caught up to a particular commit. It could be useful for monitoring > Hot Standby slaves. Well, you'd need to index the commit data to make that work, I think, so that adds a lot of complexity. The implementation as proposed lets you find the commits after a known point in order of occurrence, but it doesn't let you inquire about the location of a particular commit. If you want to run recovery until a particular transaction commits, we could teach the recovery code to look for the commit record for that XID and then pause at that point (or just before that point, if someone wanted that as an alternative behavior), which would be much simpler than using this mechanism. And if you want to check whether slaves are caught up, it would probably be better to use LSN rather than commits, because you could be caught up on commits but way behind on WAL replay. >> I think >> we should be very careful about assuming that we understand >> replication and its needs better than someone who has spent many years >> developing one of the major PostgreSQL replication solutions. > > Well the flip side of that is that we want an interface that's useful > for more than just one replication system. This is something basic > enough that I think it will be useful for more than just replication > if we design it generally enough. It should be useful for > backup/restore processes and monitoring as well as various forms of > replication including master-slave trigger based systems but also > including PITR-based replication, log-parsing systems, multi-master > trigger based systems, 2PC-based systems, etc. Making it general enough to serve multiple needs is good, but we've got to make sure that the extra complexity is buying us something. Jan seems pretty confident that this could be used by Londiste also, though it would be nice to have some confirmation from the Londiste developer(s) on that. I think it may also have applications for distributed transactions and multi-master replication, but I am not too sure it helps much for PITR-based replication or log-parsing systems. We want to design something that is good, but trying to solve too many problems may end up solving none of them well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Jan Wieck's message of jue jun 03 19:52:19 -0400 2010: > On 6/3/2010 7:11 PM, Alvaro Herrera wrote: > > Why not send separate numbers of tuple inserts/updates/deletes, which we > > already have from pgstats? > > We only have them for the entire database. The purpose of this is just a > guesstimate about what data volume to expect if I were to select all log > from a particular transaction. But we already have per table counters. Couldn't we aggregate them per transaction as well, if this feature is enabled? I'm guessing that this is going to have some uses besides Slony; vague measurements could turn out to be unusable for some of these. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 6/4/2010 10:44 AM, Greg Stark wrote: > On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I find the skeptical attitude on this thread altogether unwarranted. >> Jan made his case and, at least IMHO, presented it pretty clearly. > > Just to be clear I think the idea of exposing commit order is a > no-brainer. The specific interface is what I was questioning. > > A function which takes a starting xid and a number of transactions to > return seems very tied to one particular application. I could easily > see other systems such as a multi-master system instead only wanting > to compare two transactions to find out which committed first. Or > non-replication applications where you have an LSN and want to know > whether a given transaction had committed by that time. Read the proposal again. I mean the original mail that started this tread. The function does NOT take an xid as argument. Being able to compare two xid's against each other with respect to their commit order is eventually useful. The serial number of the data set, returned by the SRF as proposed, would perfectly satisfy that need. But not the way you envision for multimaster. Multimaster would ask "did xid X from server A commit before or after xid Y from server B?" That is a question completely outside the scope of this proposal. Please keep it real. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 6/4/2010 12:52 PM, Alvaro Herrera wrote: > Excerpts from Jan Wieck's message of jue jun 03 19:52:19 -0400 2010: >> On 6/3/2010 7:11 PM, Alvaro Herrera wrote: > >> > Why not send separate numbers of tuple inserts/updates/deletes, which we >> > already have from pgstats? >> >> We only have them for the entire database. The purpose of this is just a >> guesstimate about what data volume to expect if I were to select all log >> from a particular transaction. > > But we already have per table counters. Couldn't we aggregate them per > transaction as well, if this feature is enabled? I'm guessing that this > is going to have some uses besides Slony; vague measurements could turn > out to be unusable for some of these. We have them per table and per index, summarized over all transactions. It is debatable if bloating this feature with detailed statistics is useful or not, but I'd rather not have that bloat at the beginning, because otherwise I know exactly what is going to happen. People will just come back and say "zero impact my a..". Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 6/4/10, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark <gsstark@mit.edu> wrote: > > A function which takes a starting xid and a number of transactions to > > return seems very tied to one particular application. I could easily > > see other systems such as a multi-master system instead only wanting > > to compare two transactions to find out which committed first. Or > > non-replication applications where you have an LSN and want to know > > whether a given transaction had committed by that time. > > > > So one possible interface would be to do something like > > xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with > > an optional argument to limit the number or records returned. > > > I'm imagining that the backend data storage for this would be a file > containing, essentially, a struct for each commit repeated over and > over again, packed tightly. It's easy to index into such a file using > a sequence number (give me the 1000'th commit) but searching by LSN > would require (a) storing the LSNs and (b) binary search. Maybe it's > worth adding that complexity, but I'm not sure that it is. Keeping > the size of this file small is important for ensuring that it has > minimal performance impact (which is also why I'm not sold on trying > to include the tuple counters that Jan proposed - I think we can solve > the problem he's worried about there more cleanly in other ways). AIUI, you index the file by offset. > >> I think > >> we should be very careful about assuming that we understand > >> replication and its needs better than someone who has spent many years > >> developing one of the major PostgreSQL replication solutions. > > > > Well the flip side of that is that we want an interface that's useful > > for more than just one replication system. This is something basic > > enough that I think it will be useful for more than just replication > > if we design it generally enough. It should be useful for > > backup/restore processes and monitoring as well as various forms of > > replication including master-slave trigger based systems but also > > including PITR-based replication, log-parsing systems, multi-master > > trigger based systems, 2PC-based systems, etc. > > > Making it general enough to serve multiple needs is good, but we've > got to make sure that the extra complexity is buying us something. > Jan seems pretty confident that this could be used by Londiste also, > though it would be nice to have some confirmation from the Londiste > developer(s) on that. I think it may also have applications for > distributed transactions and multi-master replication, but I am not > too sure it helps much for PITR-based replication or log-parsing > systems. We want to design something that is good, but trying to > solve too many problems may end up solving none of them well. The potential for single shared queue implementation, with the additional potential for merging async replication implementations sounds attractive. (Merging ~ having single one that satisfies broad range of needs.) Unless the functionality accepted into core will be limited to replication only and/or performs worse than current snapshot-based grouping. Then it is uninteresting, of course. Jan's proposal of storing small struct into segmented files sounds like it could work. Can't say anything more because I can't imagine it as well as Jan. Would need to play with working implementation to say more... -- marko
Excerpts from Marko Kreen's message of jue jun 10 18:10:50 -0400 2010: > Jan's proposal of storing small struct into segmented files > sounds like it could work. Can't say anything more because > I can't imagine it as well as Jan. Would need to play with > working implementation to say more... We already have such a thing -- see pg_multixact -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote: > In some systems (data warehousing, replication), the order of commits is > important, since that is the order in which changes have become visible. > This information could theoretically be extracted from the WAL, but > scanning the entire WAL just to extract this tidbit of information would > be excruciatingly painful. This idea had support from at least 6 hackers. I'm happy to add my own. Can I suggest it is added as a hook, rather than argue about the details too much? The main use case is in combination with external systems, so that way we can maintain the relevant code with the system that cares about it. > CommitTransaction() inside of xact.c will call a function, that inserts > a new record into this array. The operation will for most of the time be > nothing than taking a spinlock and adding the record to shared memory. > All the data for the record is readily available, does not require > further locking and can be collected locally before taking the spinlock. > The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the > commit_timestamp is what CommitTransaction() just decided to write into > the WAL commit record and the total_rowcount is the sum of inserted, > updated and deleted heap tuples during the transaction, which should be > easily available from the statistics collector, unless row stats are > disabled, in which case the datum would be zero. Does this need to be called while in a critical section? Or can we wait until after the actual marking of the commit before calling this? > Checkpoint handling will call a function to flush the shared buffers. > Together with this, the information from WAL records will be sufficient > to recover this data (except for row counts) during crash recovery. So it would need to work identically in recovery also? These two values are not currently stored in the commit WAL record. timestamptz xci_begin_timestamp int64 xci_total_rowcount Both of those seem optional, so I don't really want them added to WAL. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services