Lazy Snapshots - Mailing list pgsql-hackers

From simon@2ndquadrant.com
Subject Lazy Snapshots
Date
Msg-id 1278436501.36016.1250590388532.JavaMail.open-xchange@oxltgw02.schlund.de
Whole thread Raw
Responses Re: Lazy Snapshots  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: Lazy Snapshots  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Lazy Snapshots  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers

One of the problems with Hot Standby is that a long running query on the standby can conflict with VACUUMed rows on the primary, causing queries to be cancelled.

I've been looking at this problem for about a year now from various angles. Jeff Jane's recent thoughts on procarray scalability have led me down an interesting path, described here. Taken together, this has led me to rethink completely the strategy used for avoiding conflicts in the Hot Standby patch.

Currently, we take eager snapshots, meaning we take a snapshot at the start of each statement whether or not it is necessary. Snapshots exist to disambiguate the running state of recent transactions, so if a statement never sees data written by recent transactions then we will never actually use the snapshot.

Another way of doing this is to utilize lazy snapshots: do not take a snapshot when a statement starts and only take one at the point that we need one. No other changes to the MVCC mechanisms are proposed.

Is that possible?

The time the snapshot is taken is the time of the consistent viewpoint from which all data access during a statement is judged. Taking the snapshot later, at an undefined point in the future means that the consistent viewpoint is actually floating. When we execute the statement we won't actually know which viewpoint will be used to derive the answers to a query.

A floating, yet consistent viewpoint is in my opinion a good thing, since it includes a more recent database state in the answer to a query than we would otherwise have used. Consider the case where a very large table has a "select count(*)" executed on it. The scan begins at block 0 and continues through the table until the end, which for purposes of an example we will say takes 1 hour. Rows are added to the table at a constant rate of 100/sec and immediately committed. So by the time the scan has finished it will deliver an answer that is wrong by 360000. Using a lazy snapshot would give us an answer almost exactly correct, though of course Heisenbuggers may dispute the existence of a "correct" answer in this case.

So let's look at some theory details:

* Scan begins, no snapshot set. A row is inserted and transaction commits. Scan progresses until it sees a recent row. Scan takes snapshot; the row is now visible to it and progresses. Another row is inserted and transaction commits. When we later come to second new row, we already have a snapshot, so that row is invisible to us. Results of query are consistent to the point we took the snapshot, which happened when we saw the first row. Are the results consistent only to end of transaction that created that row? No, other transactions can also have committed after it and yet before we take snapshot. The recent transaction is the catalyst for us to take a snapshot, though the snapshot is not dependent upon the xid of the new row we have seen.

* Scan begins, no snapshot set. Ahead of scan a row that would have been visible at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no evidence that a removal has taken place, never sees contention and thus never takes a snapshot. This isn't a problem; the row removal created an implicit xmin for our scan. If we later took a snapshot the xmin of the snapshot would be equal or later than our previous implicit xmin and so MVCC would be working. This shows that it is wrong to presume that taking no snapshot at all means that the time consistent point on the scan was at the start of a statement, it may not be.

* We open a cursor, then start issuing updates where current of. Does the cursor need a snapshot? I don't think it does, since we have special visibility rules for rows produced by our own transaction and we do not need a snapshot to disambiguate them. ISTM there may be a corner case where we need cursors to take snapshots, but I haven't seen it yet.

Does that cover all the cases? Some main ones, but let's see if other problems emerge? Anyone?

OK, in theory it seems to work, so how will it work in practice and does that cause other difficulties?

* We will hold a new global variable LastGlobalXmin, which is maintained by GetSnapshotData(). We can access it atomically, without locks.

* In XidInMVCCSnapshot() if our snapshot is NULL then we update RecentXmin from LastGlobalXmin and test using that because we don't have a snapshot xmin. If this is sufficient to return false then that's all we do. Otherwise, we now get a full snapshot and then continue as normal. (There may be some API rework to allow this to happen, so I think I papered over a few difficulties here, but in broad terms, this appears to work).

Lazy snapshots mean that some things normally updated during snapshot taking will fall behind somewhat. This has a couple of effects that we can mitigate in various ways

* In TransactionIdIsInProgress() if xid < RecentXmin we update RecentXmin from globalxmin and retry the test.

* We probably need to do something with HOT page cleaning as well, but that is fairly subtle bit of tuning that I expect to see a range of viewpoints on. Various options exist from do-nothing through to re-check xmin prior to each cleaning check or somewhere in between.

I have no idea whether this idea is patented and I would appreciate some help in researching whether this idea is legally able to be implemented by PGDG, so I can remain untainted.

Benefits

* Scalability: The reduction in ProcArrayLock requests from snapshots will drop away considerably as a result of these changes. (It may prove feasible to provide an option to lightly partition the procarray to increase commit rate, but that would be later)

* Hot Standby: Implementing this will likely significantly reduce the number of queries cancelled during Hot Standby. This will be because many queries will not have snapshots at all and the queries that do will typically have much younger snapshots.

* Accuracy: More accurate answers to long database queries.

I will be removing various parts of code from Hot Standby patch while this is discussed. I'm not very available at moment, so my replies are likely to be considerably delayed.

Best Regards, Simon Riggs

pgsql-hackers by date:

Previous
From: Mark Cave-Ayland
Date:
Subject: Re: Another try at reducing repeated detoast work for PostGIS
Next
From: Heikki Linnakangas
Date:
Subject: Re: Lazy Snapshots