Thread: Proposal: Snapshot cloning
Granted this one has a few open ends so far and I'd like to receive some constructive input on how to actually implement it. The idea is to clone an existing serializable transactions snapshot visibility information from one backend to another. The semantics would be like this: backend1: start transaction; backend1: set transaction isolation level serializable; backend1: select pg_backend_pid(); backend1: select publish_snapshot(); -- will block backend2: start transaction; backend2: set transaction isolation level serializable; backend2: select clone_snapshot(<pid>);-- will unblock backend1 backend1: select publish_snapshot(); backend3: start transaction; backend3: set transaction isolation level serializable; backend3: select clone_snapshot(<pid>); ... This will allow a number of separate backends to assume the same MVCC visibility, so that they can query independent but the overall result will be according to one consistent snapshot of the database. What I try to accomplish with this is to widen a bottleneck, many current Slony users are facing. The initial copy of a database is currently limited to one single reader to copy a snapshot of the data provider. With the above functionality, several tables could be copied in parallel by different client threads, feeding separate backends on the receiving side at the same time. The feature could also be used by a parallel version of pg_dump as well as data mining tools. The cloning process needs to make sure that the clone_snapshot() call is made from the same DB user in the same database as corresponding publish_snapshot() call was done. Since publish_snapshot() only publishes the information, it gained legally and that is visible in the PGPROC shared memory (xmin, xmax being the crucial part here), there is no risk of creating a snapshot for which data might have been removed by vacuum already. What I am not sure about yet is what IPC method would best suit the transfer of the arbitrarily sized xip vector. Ideas? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Ühel kenal päeval, N, 2007-01-25 kell 22:19, kirjutas Jan Wieck: > Granted this one has a few open ends so far and I'd like to receive some > constructive input on how to actually implement it. > > The idea is to clone an existing serializable transactions snapshot > visibility information from one backend to another. The semantics would > be like this: > > backend1: start transaction; > backend1: set transaction isolation level serializable; > backend1: select pg_backend_pid(); > backend1: select publish_snapshot(); -- will block > > backend2: start transaction; > backend2: set transaction isolation level serializable; > backend2: select clone_snapshot(<pid>); -- will unblock backend1 > > backend1: select publish_snapshot(); > > backend3: start transaction; > backend3: set transaction isolation level serializable; > backend3: select clone_snapshot(<pid>); > > ... > > This will allow a number of separate backends to assume the same MVCC > visibility, so that they can query independent but the overall result > will be according to one consistent snapshot of the database. I see uses for this in implementing query parallelism in user level code, like querying two child tables in two separate processes. > What I try to accomplish with this is to widen a bottleneck, many > current Slony users are facing. The initial copy of a database is > currently limited to one single reader to copy a snapshot of the data > provider. With the above functionality, several tables could be copied > in parallel by different client threads, feeding separate backends on > the receiving side at the same time. I'm afraid that for most configurations this would make the copy slower, as there will be mode random disk i/o. Maybe better fix slony so that it allows initial copies in different parallel transactions, or just do initial copy in several sets and merge the sets later. > The feature could also be used by a parallel version of pg_dump as well > as data mining tools. > > The cloning process needs to make sure that the clone_snapshot() call is > made from the same DB user in the same database as corresponding > publish_snapshot() call was done. Why ? Snapshot is universal and same for whole db instance, so why limit it to same user/database ? > Since publish_snapshot() only > publishes the information, it gained legally and that is visible in the > PGPROC shared memory (xmin, xmax being the crucial part here), there is > no risk of creating a snapshot for which data might have been removed by > vacuum already. > > What I am not sure about yet is what IPC method would best suit the > transfer of the arbitrarily sized xip vector. Ideas? > > > Jan > -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Thu, 2007-01-25 at 22:19 -0500, Jan Wieck wrote: > The idea is to clone an existing serializable transactions snapshot > visibility information from one backend to another. The semantics would > be like this: > > backend1: start transaction; > backend1: set transaction isolation level serializable; > backend1: select pg_backend_pid(); > backend1: select publish_snapshot(); -- will block Great idea. It can also be used by pg_dump to publish its snapshot so that we can make VACUUM continue to process effectively while it pg_dump is running. Two questions: - why does it have to block? I don't see any reason - the first process can begin doing useful work. The second process might fail or itself be blocked by something. - why just serializable snapshots? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Jan Wieck" <JanWieck@Yahoo.com> writes: > backend1: select publish_snapshot(); -- will block > > backend2: start transaction; > backend2: set transaction isolation level serializable; > backend2: select clone_snapshot(<pid>); -- will unblock backend1 It seems simpler to have a current_snapshot() function that returns an bytea or a new snapshot data type which set_current_snapshot(bytea) took to change your snapshot. Then you could use tables or out-of-band communication to pass around your snapshots however you please. set_current_snapshot() would have to sanity check that the xmin of the new snapshot isn't older than the current globaloldestxmin. That could be handy for debugging purposes too. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 1/26/2007 8:06 AM, Gregory Stark wrote: > "Jan Wieck" <JanWieck@Yahoo.com> writes: > >> backend1: select publish_snapshot(); -- will block >> >> backend2: start transaction; >> backend2: set transaction isolation level serializable; >> backend2: select clone_snapshot(<pid>); -- will unblock backend1 > > It seems simpler to have a current_snapshot() function that returns an bytea > or a new snapshot data type which set_current_snapshot(bytea) took to change > your snapshot. Then you could use tables or out-of-band communication to pass > around your snapshots however you please. > > set_current_snapshot() would have to sanity check that the xmin of the new > snapshot isn't older than the current globaloldestxmin. That would solve the backend to backend IPC problem nicely. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs: > On Thu, 2007-01-25 at 22:19 -0500, Jan Wieck wrote: > > > The idea is to clone an existing serializable transactions snapshot > > visibility information from one backend to another. The semantics would > > be like this: > > > > backend1: start transaction; > > backend1: set transaction isolation level serializable; > > backend1: select pg_backend_pid(); > > backend1: select publish_snapshot(); -- will block > > Great idea. It can also be used by pg_dump to publish its snapshot so > that we can make VACUUM continue to process effectively while it pg_dump > is running. Do you mean we that vacuum would clean up tuples still visible to pgdump ? > Two questions: > - why does it have to block? I don't see any reason - the first process > can begin doing useful work. The second process might fail or itself be > blocked by something. As I see it, it has to block so that it's transaction woud not end so that the system knows that it can't yet remove tuples in that snapshot. And it should block util all its consumers have ended their use of the published snapshot > - why just serializable snapshots? There s probably no point to aquire it into read-commited transaction when the next command will revert to its own snapshot anyway. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Fri, 2007-01-26 at 16:09 +0200, Hannu Krosing wrote: > Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs: > > Two questions: > > - why does it have to block? I don't see any reason - the first process > > can begin doing useful work. The second process might fail or itself be > > blocked by something. > > As I see it, it has to block so that it's transaction woud not end so > that the system knows that it can't yet remove tuples in that snapshot. > > And it should block util all its consumers have ended their use of the > published snapshot Agreed that the Snapshot must be visible to all, but thats no reason why the original call has to block, just that we must do something to prevent the Snapshot from disappearing from view. > > - why just serializable snapshots? > > There s probably no point to aquire it into read-commited transaction > when the next command will revert to its own snapshot anyway. But the stated use case was to share snapshots, which seems valid whatever the type of Snapshot. One of the stated cases was parallel query... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-01-26 at 16:09 +0200, Hannu Krosing wrote: > Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs: > > Great idea. It can also be used by pg_dump to publish its snapshot so > > that we can make VACUUM continue to process effectively while it pg_dump > > is running. > > Do you mean we that vacuum would clean up tuples still visible to > pgdump ? No, that would break MVCC. But we may have done lots of updates/deletes that are *not* visible to any Snapshot, yet are not yet removable because they are higher than OldestXmin but we don't know that because previously the Snapshot details were not available. ISTM that this proposal is a way of making the Snapshot limits publicly available so that they can be used by VACUUM. Sure it isn't every backend, but the details may be useful. So this is an additional benefit to this proposal. (There's a hole in the above idea, so don't jump on my back to explain it - I see it and am trying to work out a way around it...) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
stark@enterprisedb.com (Gregory Stark) writes: > "Jan Wieck" <JanWieck@Yahoo.com> writes: > >> backend1: select publish_snapshot(); -- will block >> >> backend2: start transaction; >> backend2: set transaction isolation level serializable; >> backend2: select clone_snapshot(<pid>); -- will unblock backend1 > > It seems simpler to have a current_snapshot() function that returns an bytea > or a new snapshot data type which set_current_snapshot(bytea) took to change > your snapshot. Then you could use tables or out-of-band communication to pass > around your snapshots however you please. > > set_current_snapshot() would have to sanity check that the xmin of the new > snapshot isn't older than the current globaloldestxmin. > > That could be handy for debugging purposes too. Here's a wild thought... Would there be any sense in setting up the ability to declare expressly a transaction's visibility parameters? Consider that the Slony-I sl_event table records: ev_minxid, ev_maxxid, ev_xip Grabbing a sample from an instance...[ ev_minxid| ev_maxxid| ev_xip ] = [1377591608 | 1377591612 | '1377591608','1377591610'] Would it be plausible to, in effect, assert these things? To say: start transaction; set transaction isolation level serializable; select set_transaction_visibility(1377591608, 1377591612, [1377591608, 1377591610]); And thus assert the visibility that was recorded at that point in time? I may very well have the parameters characterized in a wrong way; please assume an appropriate way instead as needed :-). This would permit, if I am seeing this right, a way that you could, in effect, get a form of "time travel" via this where you'd be able to arbitrarily point at different forms of data visibility. The wild part being that you could assert data visibility declarations that a normal connection couldn't naturally obtain... -- let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;; http://linuxfinances.info/info/multiplexor.html Sturgeon's Law: 90% of *EVERYTHING* is crud.
"Simon Riggs" <simon@2ndquadrant.com> writes: > No, that would break MVCC. But we may have done lots of updates/deletes > that are *not* visible to any Snapshot, yet are not yet removable > because they are higher than OldestXmin but we don't know that because > previously the Snapshot details were not available. ISTM that this > proposal is a way of making the Snapshot limits publicly available so > that they can be used by VACUUM. Certainly not, unless you intend that *every* snapshot *must* be published, which is an overhead up with which we will not put. One pretty serious problem with the proposal as written is the part about the sender blocking until the receiver takes the snap; that means it's not really a "publish" in the sense that you can make it available without worrying about exactly how many readers there might or might not be. That alone is sufficient to kill any thought of VACUUM making use of the info. I'd feel happier with an implementation more like prepared transactions: you stuff the information into shared memory and it sits there, readable by anyone, until such time as you take it down again. Like prepared xacts, GlobalXmin calculations would need to include these snapshots (and hence they'd limit vacuums). A shared-memory area would have to be fixed size, but perhaps backing files, like those used by prepared xacts, could handle the overflow for very large xip lists. Presumably crash safety is not an issue so this wouldn't require any complicated mechanism. regards, tom lane
Hannu Krosing <hannu@skype.net> writes: > Ühel kenal päeval, N, 2007-01-25 kell 22:19, kirjutas Jan Wieck: >> The cloning process needs to make sure that the clone_snapshot() call is >> made from the same DB user in the same database as corresponding >> publish_snapshot() call was done. > Why ? Snapshot is universal and same for whole db instance, so why limit > it to same user/database ? Yeah. Use-case: pg_dumpall could guarantee that it produces consistent snapshots across multiple databases. (Not sure I actually want that, but it's at least arguably useful to someone.) I think you would want to mark a snapshot with an owner, but that would be for the purpose of restricting who could take it down, not who could copy it. regards, tom lane
Jan Wieck <JanWieck@Yahoo.com> writes: > On 1/26/2007 8:06 AM, Gregory Stark wrote: >> It seems simpler to have a current_snapshot() function that returns an bytea >> or a new snapshot data type which set_current_snapshot(bytea) took to change >> your snapshot. Then you could use tables or out-of-band communication to pass >> around your snapshots however you please. >> >> set_current_snapshot() would have to sanity check that the xmin of the new >> snapshot isn't older than the current globaloldestxmin. > That would solve the backend to backend IPC problem nicely. But it fails on the count of making sure that globaloldestxmin doesn't advance past the snap you want to use. And exactly how will you pass a snap through a table? It won't become visible until you commit ... whereupon your own xmin isn't blocking the advance of globaloldestxmin. regards, tom lane
On Fri, 2007-01-26 at 11:36 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > No, that would break MVCC. But we may have done lots of updates/deletes > > that are *not* visible to any Snapshot, yet are not yet removable > > because they are higher than OldestXmin but we don't know that because > > previously the Snapshot details were not available. ISTM that this > > proposal is a way of making the Snapshot limits publicly available so > > that they can be used by VACUUM. > > Certainly not, unless you intend that *every* snapshot *must* be > published, which is an overhead up with which we will not put. Agreed, but that's the general case problem. What I was hoping was that this would provide a mechanism for long running transactions (LRTs) to publish their min/max Xids. Then if all backends publish the minimum Xid of any Snapshot they have generated in the proc array, we'd be able to decide if there are any large holes in the global set of Snapshots. As a general case that's hard to evaluate, but in the common case of a lone LRT and all the rest short duration transactions you can end up with a gap of 250,000+ transactions opening up between the two. It would be fairly easy to have VACUUM check for large "visibility gaps" between groups of transactions and then use that to improve its effectiveness in the presence of LRTs. Theoretically we have to keep the chain of intermediate updates around so it can be traversed by the old transaction, but in practical terms traversing a long chain of updates isn't sensible. Serializable LRTs will never traverse the chain anyway (that's a serializability error), but there are some special cases to consider, hence my mentioning an unresolved problem previously. We'd need to be much more careful about the way Snapshots are managed, so we can be certain that we take them all into account. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> set_current_snapshot() would have to sanity check that the xmin of the new >>> snapshot isn't older than the current globaloldestxmin. > >> That would solve the backend to backend IPC problem nicely. > > But it fails on the count of making sure that globaloldestxmin doesn't > advance past the snap you want to use. And exactly how will you pass > a snap through a table? It won't become visible until you commit ... > whereupon your own xmin isn't blocking the advance of globaloldestxmin. Hm, good point. You could always do it in a separate connection, but that starts to get annoying. I was more envisioning passing it around out-of-band though, something like: $db->execute(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE); $snap = $db->execute(select current_snapshot()); for each db { if (fork()) $slave[i] = $db->connect(); $slave[i]->execute(select set_snapshot($snap)); $slave[i]->execute(copy table[i] to file[i]); } I'm also wondering about something like: $db->execute(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE); $snap = $db->execute(select current_snapshot()); if (fork()) $slave = $db->connect(); $slave->execute(select set_snapshot($snap); $slave->execute(copy tab fromhugefile); signal parent } else { while(no signal yet) { $rows_loaded_so_far = $db->execute(select count(*)from tab); display_progress($rows_loaded_so_far); sleep(60); } } Sorry for the vaguely perlish pseudocode but it's the clearest way I can think to write it. I don't think it would make much sense to try to do anything like this in plpgsql; I think you really do want to be doing it in a language outside the database where it's easier to open multiple connections and handle IPC. I realize the second idea might take more hackery than just setting the snapshot... In particular as written above it wouldn't work because the slave would be writing with a new xid that isn't actually in the snapshot. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 1/26/2007 12:22 PM, Simon Riggs wrote: > On Fri, 2007-01-26 at 11:36 -0500, Tom Lane wrote: >> "Simon Riggs" <simon@2ndquadrant.com> writes: >> > No, that would break MVCC. But we may have done lots of updates/deletes >> > that are *not* visible to any Snapshot, yet are not yet removable >> > because they are higher than OldestXmin but we don't know that because >> > previously the Snapshot details were not available. ISTM that this >> > proposal is a way of making the Snapshot limits publicly available so >> > that they can be used by VACUUM. >> >> Certainly not, unless you intend that *every* snapshot *must* be >> published, which is an overhead up with which we will not put. > > Agreed, but that's the general case problem. > > What I was hoping was that this would provide a mechanism for long > running transactions (LRTs) to publish their min/max Xids. Then if all > backends publish the minimum Xid of any Snapshot they have generated in > the proc array, we'd be able to decide if there are any large holes in > the global set of Snapshots. As a general case that's hard to evaluate, > but in the common case of a lone LRT and all the rest short duration > transactions you can end up with a gap of 250,000+ transactions opening > up between the two. It would be fairly easy to have VACUUM check for > large "visibility gaps" between groups of transactions and then use that > to improve its effectiveness in the presence of LRTs. There is a flaw in that theory. If you have a single LTR, then each subsequent transactions xmin will be exactly that one, no? Jan > > Theoretically we have to keep the chain of intermediate updates around > so it can be traversed by the old transaction, but in practical terms > traversing a long chain of updates isn't sensible. Serializable LRTs > will never traverse the chain anyway (that's a serializability error), > but there are some special cases to consider, hence my mentioning an > unresolved problem previously. > > We'd need to be much more careful about the way Snapshots are managed, > so we can be certain that we take them all into account. > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 1/26/2007 11:58 AM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 1/26/2007 8:06 AM, Gregory Stark wrote: >>> It seems simpler to have a current_snapshot() function that returns an bytea >>> or a new snapshot data type which set_current_snapshot(bytea) took to change >>> your snapshot. Then you could use tables or out-of-band communication to pass >>> around your snapshots however you please. >>> >>> set_current_snapshot() would have to sanity check that the xmin of the new >>> snapshot isn't older than the current globaloldestxmin. > >> That would solve the backend to backend IPC problem nicely. > > But it fails on the count of making sure that globaloldestxmin doesn't > advance past the snap you want to use. And exactly how will you pass > a snap through a table? It won't become visible until you commit ... > whereupon your own xmin isn't blocking the advance of globaloldestxmin. The client receives the snapshot information as a result from the function call to current_snapshot(). The call to set_current_snapshot(snap) errors out if snap's xmin is older than globaloldestxmin. It is the client app that has to make sure that the transaction that created snap is still in progress. I didn't say passing anything through a table. Take a modified pg_dump as an example. It could write multiple files. A pre-load sql with the first part of the schema. Then a post-load sql with the finalization of same (creating indexes, adding constraints). It then builds a list of all relations to COPY, starts n threads each writing a different file. Each thread connects to the DB and adjusts the snapshot to the one of the main transaction (which is still open). Then each thread grabs the next table to dump from the list and writes the COPY data to its output file. The threads exit when the list of tables is empty. The main thread waits until the last thread has joined and commits the main transaction. Wouldn't be too hard to write a script that restores that split dump in parallel as well. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Fri, 2007-01-26 at 12:43 -0500, Jan Wieck wrote: > There is a flaw in that theory. If you have a single LTR, then each > subsequent transactions xmin will be exactly that one, no? You got me. My description was too loose, but you also got the rough picture. We'll save the detail for another day, but we all know its a bridge we will have to cross one day, soon. I wasn't meaning to raise this specific discussion now, just to say that publishing snapshots for known LRTs is one way by which we can solve the LRT/VACUUMing issue. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > You got me. My description was too loose, but you also got the rough > picture. We'll save the detail for another day, but we all know its a > bridge we will have to cross one day, soon. I wasn't meaning to raise > this specific discussion now, just to say that publishing snapshots for > known LRTs is one way by which we can solve the LRT/VACUUMing issue. I don't actually see that it buys you a darn thing ... you still won't be able to delete dead updated tuples because of the possibility of the LRT deciding to chase ctid chains up from the tuples it can see. You also seem to be assuming that a transaction can have only one snapshot, which is not something we can enforce in enough cases to make it a very useful restriction. regards, tom lane
On Jan 26, 2007, at 4:48 PM, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> You got me. My description was too loose, but you also got the rough >> picture. We'll save the detail for another day, but we all know its a >> bridge we will have to cross one day, soon. I wasn't meaning to raise >> this specific discussion now, just to say that publishing >> snapshots for >> known LRTs is one way by which we can solve the LRT/VACUUMing issue. > > I don't actually see that it buys you a darn thing ... you still won't > be able to delete dead updated tuples because of the possibility of > the > LRT deciding to chase ctid chains up from the tuples it can see. You > also seem to be assuming that a transaction can have only one > snapshot, > which is not something we can enforce in enough cases to make it a > very > useful restriction. Well, Simon was talking about a serialized LRT, which ISTM shouldn't be hunting down ctid chains past the point it serialized at. Even if that's not the case, there is also the possibility if a LRT publishing information about what tables it will hit. Any tables not being touched by a LRT could be vacuumed past the global minxid. It would be up to the user to do that in many cases, but that's likely to be well worth it if you have LRTs that are only hitting a few tables yet you have other tables that really, really need to stay vacuumed. Believe me, that is a very common use case in the real world (think queue table, or web session table). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <decibel@decibel.org> writes: > On Jan 26, 2007, at 4:48 PM, Tom Lane wrote: >> I don't actually see that it buys you a darn thing ... you still won't >> be able to delete dead updated tuples because of the possibility of >> the LRT deciding to chase ctid chains up from the tuples it can see. > Well, Simon was talking about a serialized LRT, which ISTM shouldn't > be hunting down ctid chains past the point it serialized at. How you figure that? If the LRT wants to update a tuple, it's got to chase the ctid chain to see whether the head update committed or not. It's not an error for a serializable transaction to update a tuple that was tentatively updated by a transaction that rolled back. > Even if that's not the case, there is also the possibility if a LRT > publishing information about what tables it will hit. I think we already bought 99% of the possible win there by fixing vacuum. Most ordinary transactions aren't going to be able to predict which other tables the user might try to touch. regards, tom lane
On Jan 29, 2007, at 11:28 PM, Tom Lane wrote: > Jim Nasby <decibel@decibel.org> writes: >> On Jan 26, 2007, at 4:48 PM, Tom Lane wrote: >>> I don't actually see that it buys you a darn thing ... you still >>> won't >>> be able to delete dead updated tuples because of the possibility of >>> the LRT deciding to chase ctid chains up from the tuples it can see. > >> Well, Simon was talking about a serialized LRT, which ISTM shouldn't >> be hunting down ctid chains past the point it serialized at. > > How you figure that? If the LRT wants to update a tuple, it's got to > chase the ctid chain to see whether the head update committed or not. > It's not an error for a serializable transaction to update a tuple > that > was tentatively updated by a transaction that rolled back. Nuts. :( >> Even if that's not the case, there is also the possibility if a LRT >> publishing information about what tables it will hit. > > I think we already bought 99% of the possible win there by fixing > vacuum. Most ordinary transactions aren't going to be able to predict > which other tables the user might try to touch. Presumably a single-statement transaction could do that in most (if not all) cases. But even if we didn't support automatically detecting what tables a transaction was hitting, we could allow the user to specify it and then bomb out if the transaction tried to hit anything that wasn't in that list. That would allow users who are creating LRTs to limit their impact on vacuum. The safe way to perform that check would be to check each buffer before accessing it, but I'm unsure how large a performance impact that would entail; I don't know how much code we run through to pull a tuple out of a page and do something with it compared to the cost of checking if that buffer belongs to a relation/ file that's in the "approved list". Perhaps a better way would be to allow users to mark vacuum-critical tables for "restricted" access. To access a restricted table the user would need to provide a list of restricted tables that a transaction is going to hit (or maybe just lump all restricted tables into one group), and that transaction would log it's XID somewhere that vacuum can look at. If a transaction that hasn't specified it will touch the restricted tables tries to do so it errors out. We might want some way to flag buffers as belonging to a restricted table (or one of it's indexes) so that transactions that aren't hitting restricted tables wouldn't have to pay a large performance penalty to figure that out. But you'd only have to mark those buffers when they're read in from the OS, and presumably a restricted table will be small enough that it's buffers should stay put. Logging the XID could prove to be a serialization point, but we could possibly avoid that by using per-relation locks. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)