On Tue, Nov 9, 2010 at 3:05 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Well, most of the other MVCC-in-table DBMSes simply don't deal with
>> large, on-disk databases. In fact, I can't think of one which does,
>> currently; while MVCC has been popular for the New Databases, they're
>> all focused on "in-memory" databases. Oracle and InnoDB use rollback
>> segments.
>
> Well rollback segments are still MVCC. However Oracle's MVCC is
> block-based. So they only have to do the visibility check once per
> block, not once per row. Once they find the right block version they
> can process all the rows on it.
>
> Also Oracle's snapshots are just the log position. Instead of having
> to check whether every transaction committed or not, they just find
> the block version which was last modified before the log position for
> when their transaction started.
That is cool. One problem is that it might sometimes result in
additional I/O. A transaction begins and writes a tuple. We must
write a preimage of the page (or at least, sufficient information to
reconstruct a preimage of the page) to the undo segment. If the
transaction commits relatively quickly, and all transactions which
took their snapshots before the commit end either by committing or by
aborting, we can discard that information from the undo segment
without ever writing it to disk. However, if that doesn't happen, the
undo log page may get evicted, and we're now doing three writes (WAL,
page, undo) rather than just two (WAL, page). That's no worse than an
update where the old and new tuples land on different pages, but it IS
worse than an update where the old and new tuples are on the same
page, or at least I think it is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company