Thread: Lazy Snapshots

Lazy Snapshots

From
"simon@2ndquadrant.com"
Date:

One of the problems with Hot Standby is that a long running query on the standby can conflict with VACUUMed rows on the primary, causing queries to be cancelled.

I've been looking at this problem for about a year now from various angles. Jeff Jane's recent thoughts on procarray scalability have led me down an interesting path, described here. Taken together, this has led me to rethink completely the strategy used for avoiding conflicts in the Hot Standby patch.

Currently, we take eager snapshots, meaning we take a snapshot at the start of each statement whether or not it is necessary. Snapshots exist to disambiguate the running state of recent transactions, so if a statement never sees data written by recent transactions then we will never actually use the snapshot.

Another way of doing this is to utilize lazy snapshots: do not take a snapshot when a statement starts and only take one at the point that we need one. No other changes to the MVCC mechanisms are proposed.

Is that possible?

The time the snapshot is taken is the time of the consistent viewpoint from which all data access during a statement is judged. Taking the snapshot later, at an undefined point in the future means that the consistent viewpoint is actually floating. When we execute the statement we won't actually know which viewpoint will be used to derive the answers to a query.

A floating, yet consistent viewpoint is in my opinion a good thing, since it includes a more recent database state in the answer to a query than we would otherwise have used. Consider the case where a very large table has a "select count(*)" executed on it. The scan begins at block 0 and continues through the table until the end, which for purposes of an example we will say takes 1 hour. Rows are added to the table at a constant rate of 100/sec and immediately committed. So by the time the scan has finished it will deliver an answer that is wrong by 360000. Using a lazy snapshot would give us an answer almost exactly correct, though of course Heisenbuggers may dispute the existence of a "correct" answer in this case.

So let's look at some theory details:

* Scan begins, no snapshot set. A row is inserted and transaction commits. Scan progresses until it sees a recent row. Scan takes snapshot; the row is now visible to it and progresses. Another row is inserted and transaction commits. When we later come to second new row, we already have a snapshot, so that row is invisible to us. Results of query are consistent to the point we took the snapshot, which happened when we saw the first row. Are the results consistent only to end of transaction that created that row? No, other transactions can also have committed after it and yet before we take snapshot. The recent transaction is the catalyst for us to take a snapshot, though the snapshot is not dependent upon the xid of the new row we have seen.

* Scan begins, no snapshot set. Ahead of scan a row that would have been visible at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no evidence that a removal has taken place, never sees contention and thus never takes a snapshot. This isn't a problem; the row removal created an implicit xmin for our scan. If we later took a snapshot the xmin of the snapshot would be equal or later than our previous implicit xmin and so MVCC would be working. This shows that it is wrong to presume that taking no snapshot at all means that the time consistent point on the scan was at the start of a statement, it may not be.

* We open a cursor, then start issuing updates where current of. Does the cursor need a snapshot? I don't think it does, since we have special visibility rules for rows produced by our own transaction and we do not need a snapshot to disambiguate them. ISTM there may be a corner case where we need cursors to take snapshots, but I haven't seen it yet.

Does that cover all the cases? Some main ones, but let's see if other problems emerge? Anyone?

OK, in theory it seems to work, so how will it work in practice and does that cause other difficulties?

* We will hold a new global variable LastGlobalXmin, which is maintained by GetSnapshotData(). We can access it atomically, without locks.

* In XidInMVCCSnapshot() if our snapshot is NULL then we update RecentXmin from LastGlobalXmin and test using that because we don't have a snapshot xmin. If this is sufficient to return false then that's all we do. Otherwise, we now get a full snapshot and then continue as normal. (There may be some API rework to allow this to happen, so I think I papered over a few difficulties here, but in broad terms, this appears to work).

Lazy snapshots mean that some things normally updated during snapshot taking will fall behind somewhat. This has a couple of effects that we can mitigate in various ways

* In TransactionIdIsInProgress() if xid < RecentXmin we update RecentXmin from globalxmin and retry the test.

* We probably need to do something with HOT page cleaning as well, but that is fairly subtle bit of tuning that I expect to see a range of viewpoints on. Various options exist from do-nothing through to re-check xmin prior to each cleaning check or somewhere in between.

I have no idea whether this idea is patented and I would appreciate some help in researching whether this idea is legally able to be implemented by PGDG, so I can remain untainted.

Benefits

* Scalability: The reduction in ProcArrayLock requests from snapshots will drop away considerably as a result of these changes. (It may prove feasible to provide an option to lightly partition the procarray to increase commit rate, but that would be later)

* Hot Standby: Implementing this will likely significantly reduce the number of queries cancelled during Hot Standby. This will be because many queries will not have snapshots at all and the queries that do will typically have much younger snapshots.

* Accuracy: More accurate answers to long database queries.

I will be removing various parts of code from Hot Standby patch while this is discussed. I'm not very available at moment, so my replies are likely to be considerably delayed.

Best Regards, Simon Riggs

Re: Lazy Snapshots

From
Heikki Linnakangas
Date:
simon@2ndquadrant.com wrote:
> * Scan begins, no snapshot set. Ahead of scan a row that would have been visible
> at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no
> evidence that a removal has taken place, never sees contention and thus never
> takes a snapshot. This isn't a problem; the row removal created an implicit xmin
> for our scan. If we later took a snapshot the xmin of the snapshot would be
> equal or later than our previous implicit xmin and so MVCC would be working.
> This shows that it is wrong to presume that taking no snapshot at all means that
> the time consistent point on the scan was at the start of a statement, it may
> not be.

I don't understand this part. Imagine this:

Transaction A: start query "SELECT * FROM foo;"
Transaction B: BEGIN; DELETE FROM foo WHERE id = 1 OR id = 100000; COMMIT
Transaction A: query finishes.

How do you ensure that the query sees either both or neither of the
deleted tuples?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Lazy Snapshots

From
Tom Lane
Date:
"simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:
> Currently, we take eager snapshots, meaning we take a snapshot at the start of
> each statement whether or not it is necessary. Snapshots exist to disambiguate
> the running state of recent transactions, so if a statement never sees data
> written by recent transactions then we will never actually use the snapshot.

> Another way of doing this is to utilize lazy snapshots: do not take a snapshot
> when a statement starts and only take one at the point that we need
> one.

Simon, this concept is completely broken, as far as I can tell.
Consider this example:
* You scan some row R1 with an ancient XMIN and no XMAX.  You decide it's visible.* Transaction T2 deletes R1, inserts
R2,and commits.* You scan R2.  Since it has a recent XMIN, you now take  a snapshot.  The snapshot says T2 committed.
Soyou  consider R2 is visible.  This is inconsistent.  In the  worst case R2 might be an update of R1.
 

Even if the idea gave self-consistent results, I don't agree that it's
okay to not know when the statement's snapshot will be taken.  Timing
of snapshots relative to other actions is critical for many reasons.
        regards, tom lane


Re: Lazy Snapshots

From
Josh Berkus
Date:
Simon,

> Another way of doing this is to utilize lazy snapshots: do not take a
> snapshot when a statement starts and only take one at the point that we
> need one. No other changes to the MVCC mechanisms are proposed.

I like your general thinking here, even if this specific idea proves
unworkable.  It's along the lines of "how can we take less snapshots?"
I'd love to see other ideas in this direction.

Overall, this sounds like an optimization which would be excellent for
large database performance but would mess with OLTP transaction semantics.

> A floating, yet consistent viewpoint is in my opinion a good thing,
> since it includes a more recent database state in the answer to a query
> than we would otherwise have used. Consider the case where a very large
> table has a "select count(*)" executed on it. The scan begins at block 0
> and continues through the table until the end, which for purposes of an
> example we will say takes 1 hour. Rows are added to the table at a
> constant rate of 100/sec and immediately committed. So by the time the
> scan has finished it will deliver an answer that is wrong by 360000.
> Using a lazy snapshot would give us an answer almost exactly correct,
> though of course Heisenbuggers may dispute the existence of a "correct"
> answer in this case.

You're assuming that all of the new rows are at the end of the table,
which makes sense for inserts.  But what about updates?  You're also
assuming we begin the scan at the end, which for Synchronized Scans
isn't necessarily true.

> * Scan begins, no snapshot set. A row is inserted and transaction
> commits. Scan progresses until it sees a recent row. 

How do we define "recent"?

The bigger issue with this idea, I think, is that it only *delays*
taking snapshots, it doesn't *prevent* taking them.  That is, you still
have to take the snapshot eventually, you just do it later.  So it helps
the hot standby case, possibly, but I don't see the other benefits
materializing.  And we'd have to deal with a lot of complications which
this change would cause.

When we discussed Hot Standby at pgCon 2008, we discussed the
possibility of stopping the replications stream whenever a VACUUM came
through until any queries currently running on the slave completed.  Did
that approach turn out to be completely unworkable?  Why?

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Lazy Snapshots

From
"simon@2ndquadrant.com"
Date:

On 19 August 2009 at 00:09 Josh Berkus <josh@agliodbs.com> wrote:
 
> When we discussed Hot Standby at pgCon 2008, we discussed the
> possibility of stopping the replications stream whenever a VACUUM came
> through until any queries currently running on the slave completed.  Did
> that approach turn out to be completely unworkable?  Why?

It will be an option, in two forms and not the part I was removing. Neither of those do all the things I want to be able to do, so I had been searching for another way to do replay and uninterrupted queries.

Best Regards, Simon Riggs

Re: Lazy Snapshots

From
"simon@2ndquadrant.com"
Date:

On 18 August 2009 at 16:18 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Simon, this concept is completely broken, as far as I can tell.

Thanks for the precise example.

Yes, I agree it is not going to work when there could be more than one row version in the relation. Which doesn't leave useful wriggle room, so not pursuing this further.

What a shame.

As discussed on -perform, it does seem as if we might have retrospectively calculated snapshots if we had the right data structure. But that's a different set of thoughts for another day.

Best Regards, Simon Riggs



Re: Lazy Snapshots

From
Gokulakannan Somasundaram
Date:
Hi,<br />   I have given some thought in this direction. I am just providing my idea.<br /><br />a) Have a structure
similarto commit log, which should also store, transaction id at which the transaction got committed. Every
transaction,after committing should update the transaction id at which the commit has happened<br /> b) All the
transactions-when it starts should have two figures - xmin and also a transaction id after which nothing has got
committed<br/>c) So whenever it sees the records, which are not inside the window, it can make decision by itself. <br
/>       i) The ones below xmin and committed are visible<br />       ii) the ones after xmax and committed are not
visible<br/>       iii) When its something in between, then check the time at which the commit happened in the
structureand make the decision.<br /><br />Ideally, it would be better to change the structure of commit log itself.
Butmaintaining an another structure also would help.<br /><br />Thanks,<br />Gokul.<br /> 

Re: Lazy Snapshots

From
Tom Lane
Date:
Gokulakannan Somasundaram <gokul007@gmail.com> writes:
>    I have given some thought in this direction. I am just providing my idea.

> a) Have a structure similar to commit log, which should also store,
> transaction id at which the transaction got committed. Every transaction,
> after committing should update the transaction id at which the commit has
> happened

The maintenance costs and update contention for such a datastructure
would render this completely impractical, even if consulting it were
free.
        regards, tom lane


Re: Lazy Snapshots

From
Gokulakannan Somasundaram
Date:

The maintenance costs and update contention for such a datastructure
would render this completely impractical, even if consulting it were
free.

Thanks for the reply.

a) Only one transaction would be updating its commit status. Its multiple readers Vs Single Writer for the position of a particular transation( a memory location ). So a reader-writer lock would reduce the contention. Moreover it releases the need for the synchronization that happens with global pg_procs now.  For example something like a select which would only query the old data will never access this structure.

I am right now not able to think of anything on the maintenance cost perspective.

Gokul.