Thread: Lazy Snapshots
One of the problems with Hot Standby is that a long running query on the standby can conflict with VACUUMed rows on the primary, causing queries to be cancelled.
I've been looking at this problem for about a year now from various angles. Jeff Jane's recent thoughts on procarray scalability have led me down an interesting path, described here. Taken together, this has led me to rethink completely the strategy used for avoiding conflicts in the Hot Standby patch.
Currently, we take eager snapshots, meaning we take a snapshot at the start of each statement whether or not it is necessary. Snapshots exist to disambiguate the running state of recent transactions, so if a statement never sees data written by recent transactions then we will never actually use the snapshot.
Another way of doing this is to utilize lazy snapshots: do not take a snapshot when a statement starts and only take one at the point that we need one. No other changes to the MVCC mechanisms are proposed.
Is that possible?
The time the snapshot is taken is the time of the consistent viewpoint from which all data access during a statement is judged. Taking the snapshot later, at an undefined point in the future means that the consistent viewpoint is actually floating. When we execute the statement we won't actually know which viewpoint will be used to derive the answers to a query.
A floating, yet consistent viewpoint is in my opinion a good thing, since it includes a more recent database state in the answer to a query than we would otherwise have used. Consider the case where a very large table has a "select count(*)" executed on it. The scan begins at block 0 and continues through the table until the end, which for purposes of an example we will say takes 1 hour. Rows are added to the table at a constant rate of 100/sec and immediately committed. So by the time the scan has finished it will deliver an answer that is wrong by 360000. Using a lazy snapshot would give us an answer almost exactly correct, though of course Heisenbuggers may dispute the existence of a "correct" answer in this case.
So let's look at some theory details:
* Scan begins, no snapshot set. A row is inserted and transaction commits. Scan progresses until it sees a recent row. Scan takes snapshot; the row is now visible to it and progresses. Another row is inserted and transaction commits. When we later come to second new row, we already have a snapshot, so that row is invisible to us. Results of query are consistent to the point we took the snapshot, which happened when we saw the first row. Are the results consistent only to end of transaction that created that row? No, other transactions can also have committed after it and yet before we take snapshot. The recent transaction is the catalyst for us to take a snapshot, though the snapshot is not dependent upon the xid of the new row we have seen.
* Scan begins, no snapshot set. Ahead of scan a row that would have been visible at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no evidence that a removal has taken place, never sees contention and thus never takes a snapshot. This isn't a problem; the row removal created an implicit xmin for our scan. If we later took a snapshot the xmin of the snapshot would be equal or later than our previous implicit xmin and so MVCC would be working. This shows that it is wrong to presume that taking no snapshot at all means that the time consistent point on the scan was at the start of a statement, it may not be.
* We open a cursor, then start issuing updates where current of. Does the cursor need a snapshot? I don't think it does, since we have special visibility rules for rows produced by our own transaction and we do not need a snapshot to disambiguate them. ISTM there may be a corner case where we need cursors to take snapshots, but I haven't seen it yet.
Does that cover all the cases? Some main ones, but let's see if other problems emerge? Anyone?
OK, in theory it seems to work, so how will it work in practice and does that cause other difficulties?
* We will hold a new global variable LastGlobalXmin, which is maintained by GetSnapshotData(). We can access it atomically, without locks.
* In XidInMVCCSnapshot() if our snapshot is NULL then we update RecentXmin from LastGlobalXmin and test using that because we don't have a snapshot xmin. If this is sufficient to return false then that's all we do. Otherwise, we now get a full snapshot and then continue as normal. (There may be some API rework to allow this to happen, so I think I papered over a few difficulties here, but in broad terms, this appears to work).
Lazy snapshots mean that some things normally updated during snapshot taking will fall behind somewhat. This has a couple of effects that we can mitigate in various ways
* In TransactionIdIsInProgress() if xid < RecentXmin we update RecentXmin from globalxmin and retry the test.
* We probably need to do something with HOT page cleaning as well, but that is fairly subtle bit of tuning that I expect to see a range of viewpoints on. Various options exist from do-nothing through to re-check xmin prior to each cleaning check or somewhere in between.
I have no idea whether this idea is patented and I would appreciate some help in researching whether this idea is legally able to be implemented by PGDG, so I can remain untainted.
Benefits
* Scalability: The reduction in ProcArrayLock requests from snapshots will drop away considerably as a result of these changes. (It may prove feasible to provide an option to lightly partition the procarray to increase commit rate, but that would be later)
* Hot Standby: Implementing this will likely significantly reduce the number of queries cancelled during Hot Standby. This will be because many queries will not have snapshots at all and the queries that do will typically have much younger snapshots.
* Accuracy: More accurate answers to long database queries.
I will be removing various parts of code from Hot Standby patch while this is discussed. I'm not very available at moment, so my replies are likely to be considerably delayed.
Best Regards, Simon Riggs
simon@2ndquadrant.com wrote: > * Scan begins, no snapshot set. Ahead of scan a row that would have been visible > at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no > evidence that a removal has taken place, never sees contention and thus never > takes a snapshot. This isn't a problem; the row removal created an implicit xmin > for our scan. If we later took a snapshot the xmin of the snapshot would be > equal or later than our previous implicit xmin and so MVCC would be working. > This shows that it is wrong to presume that taking no snapshot at all means that > the time consistent point on the scan was at the start of a statement, it may > not be. I don't understand this part. Imagine this: Transaction A: start query "SELECT * FROM foo;" Transaction B: BEGIN; DELETE FROM foo WHERE id = 1 OR id = 100000; COMMIT Transaction A: query finishes. How do you ensure that the query sees either both or neither of the deleted tuples? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"simon@2ndquadrant.com" <simon@2ndquadrant.com> writes: > Currently, we take eager snapshots, meaning we take a snapshot at the start of > each statement whether or not it is necessary. Snapshots exist to disambiguate > the running state of recent transactions, so if a statement never sees data > written by recent transactions then we will never actually use the snapshot. > Another way of doing this is to utilize lazy snapshots: do not take a snapshot > when a statement starts and only take one at the point that we need > one. Simon, this concept is completely broken, as far as I can tell. Consider this example: * You scan some row R1 with an ancient XMIN and no XMAX. You decide it's visible.* Transaction T2 deletes R1, inserts R2,and commits.* You scan R2. Since it has a recent XMIN, you now take a snapshot. The snapshot says T2 committed. Soyou consider R2 is visible. This is inconsistent. In the worst case R2 might be an update of R1. Even if the idea gave self-consistent results, I don't agree that it's okay to not know when the statement's snapshot will be taken. Timing of snapshots relative to other actions is critical for many reasons. regards, tom lane
Simon, > Another way of doing this is to utilize lazy snapshots: do not take a > snapshot when a statement starts and only take one at the point that we > need one. No other changes to the MVCC mechanisms are proposed. I like your general thinking here, even if this specific idea proves unworkable. It's along the lines of "how can we take less snapshots?" I'd love to see other ideas in this direction. Overall, this sounds like an optimization which would be excellent for large database performance but would mess with OLTP transaction semantics. > A floating, yet consistent viewpoint is in my opinion a good thing, > since it includes a more recent database state in the answer to a query > than we would otherwise have used. Consider the case where a very large > table has a "select count(*)" executed on it. The scan begins at block 0 > and continues through the table until the end, which for purposes of an > example we will say takes 1 hour. Rows are added to the table at a > constant rate of 100/sec and immediately committed. So by the time the > scan has finished it will deliver an answer that is wrong by 360000. > Using a lazy snapshot would give us an answer almost exactly correct, > though of course Heisenbuggers may dispute the existence of a "correct" > answer in this case. You're assuming that all of the new rows are at the end of the table, which makes sense for inserts. But what about updates? You're also assuming we begin the scan at the end, which for Synchronized Scans isn't necessarily true. > * Scan begins, no snapshot set. A row is inserted and transaction > commits. Scan progresses until it sees a recent row. How do we define "recent"? The bigger issue with this idea, I think, is that it only *delays* taking snapshots, it doesn't *prevent* taking them. That is, you still have to take the snapshot eventually, you just do it later. So it helps the hot standby case, possibly, but I don't see the other benefits materializing. And we'd have to deal with a lot of complications which this change would cause. When we discussed Hot Standby at pgCon 2008, we discussed the possibility of stopping the replications stream whenever a VACUUM came through until any queries currently running on the slave completed. Did that approach turn out to be completely unworkable? Why? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On 19 August 2009 at 00:09 Josh Berkus <josh@agliodbs.com> wrote:
> When we discussed Hot Standby at pgCon 2008, we discussed the
> possibility of stopping the replications stream whenever a VACUUM came
> through until any queries currently running on the slave completed. Did
> that approach turn out to be completely unworkable? Why?
It will be an option, in two forms and not the part I was removing. Neither of those do all the things I want to be able to do, so I had been searching for another way to do replay and uninterrupted queries.
Best Regards, Simon Riggs
On 18 August 2009 at 16:18 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon, this concept is completely broken, as far as I can tell.
Thanks for the precise example.
Yes, I agree it is not going to work when there could be more than one row version in the relation. Which doesn't leave useful wriggle room, so not pursuing this further.
What a shame.
As discussed on -perform, it does seem as if we might have retrospectively calculated snapshots if we had the right data structure. But that's a different set of thoughts for another day.
Best Regards, Simon Riggs
Hi,<br /> I have given some thought in this direction. I am just providing my idea.<br /><br />a) Have a structure similarto commit log, which should also store, transaction id at which the transaction got committed. Every transaction,after committing should update the transaction id at which the commit has happened<br /> b) All the transactions-when it starts should have two figures - xmin and also a transaction id after which nothing has got committed<br/>c) So whenever it sees the records, which are not inside the window, it can make decision by itself. <br /> i) The ones below xmin and committed are visible<br /> ii) the ones after xmax and committed are not visible<br/> iii) When its something in between, then check the time at which the commit happened in the structureand make the decision.<br /><br />Ideally, it would be better to change the structure of commit log itself. Butmaintaining an another structure also would help.<br /><br />Thanks,<br />Gokul.<br />
Gokulakannan Somasundaram <gokul007@gmail.com> writes: > I have given some thought in this direction. I am just providing my idea. > a) Have a structure similar to commit log, which should also store, > transaction id at which the transaction got committed. Every transaction, > after committing should update the transaction id at which the commit has > happened The maintenance costs and update contention for such a datastructure would render this completely impractical, even if consulting it were free. regards, tom lane
The maintenance costs and update contention for such a datastructure
would render this completely impractical, even if consulting it were
free.
Thanks for the reply.
a) Only one transaction would be updating its commit status. Its multiple readers Vs Single Writer for the position of a particular transation( a memory location ). So a reader-writer lock would reduce the contention. Moreover it releases the need for the synchronization that happens with global pg_procs now. For example something like a select which would only query the old data will never access this structure.
I am right now not able to think of anything on the maintenance cost perspective.
Gokul.