Re: Serializable Snapshot Isolation - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Serializable Snapshot Isolation
Date
Msg-id AANLkTikkwxs56YkVvW3AzTYxxvjMKj2Nf-nZmxZsfFC0@mail.gmail.com
Whole thread Raw
In response to Re: Serializable Snapshot Isolation  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Serializable Snapshot Isolation
List pgsql-hackers
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> When a transaction is commits, its predicate locks must be held,
> but it's not important anymore *who* holds them, as long as
> they're hold for long enough.
>
> Let's move the finishedBefore field from SERIALIZABLEXACT to
> PREDICATELOCK. When a transaction commits, set the finishedBefore
> field in all the PREDICATELOCKs it holds, and then release the
> SERIALIZABLEXACT struct. The predicate locks stay without an
> associated SERIALIZABLEXACT entry until finishedBefore expires.
>
> Whenever there are two predicate locks on the same target that
> both belonged to an already-committed transaction, the one with a
> smaller finishedBefore can be dropped, because the one with higher
> finishedBefore value covers it already.

I don't think this works.  Gory details follow.

The predicate locks only matter when a tuple is being written which
might conflict with one.  In the notation often used for the
dangerous structures, the conflict only occurs if TN writes
something which T1 can't read or T1 writes something which T0 can't
read.  When you combine this with the fact that you don't have a
problem unless TN commits *first*, then you can't have a problem
with TN looking up a predicate lock of a committed transaction; if
it's still writing tuples after T1's commit, the conflict can't
matter and really should be ignored.  If T1 is looking up a
predicate lock for T0 and finds it committed, there are two things
which must be true for this to generate a real conflict: TN must
have committed before T0, and T0 must have overlapped T1 -- T0 must
not have been able to see T1's write.  If we have a way to establish
these two facts without keeping transaction level data for committed
transactions, predicate lock *lookup* wouldn't stand in the way of
your proposal.

Since the writing transaction is active, if the xmin of its starting
transaction comes before the finishedBefore value, they must have
overlapped; so I think we have that part covered, and I can't see a
problem with your proposed use of the earliest finishedBefore value.

There is a rub on the other point, though.  Without transaction
information you have no way of telling whether TN committed before
T0, so you would need to assume that it did.  So on this count,
there is bound to be some increase in false positives leading to
transaction rollback.  Without more study, and maybe some tests, I'm
not sure how significant it is.  (Actually, we might want to track
commit sequence somehow, so we can determine this with greater
accuracy.)

But wait, the bigger problems are yet to come.

The other way we can detect conflicts is a read by a serializable
transaction noticing that a different and overlapping serializable
transaction wrote the tuple we're trying to read.  How do you
propose to know that the other transaction was serializable without
keeping the SERIALIZABLEXACT information?  And how do you propose to
record the conflict without it?  The wheels pretty much fall off the
idea entirely here, as far as I can see.

Finally, this would preclude some optimizations which I *think* will
pay off, which trade a few hundred kB more of shared memory, and
some additional CPU to maintain more detailed conflict data, for a
lower false positive rate -- meaning fewer transactions rolled back
for hard-to-explain reasons.  This more detailed information is also
what seems to be desired by Dan S (on another thread) to be able to
log the information needed to be able to reduce rollbacks.

-Kevin


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Another Modest Proposal: Platforms
Next
From: Tom Lane
Date:
Subject: Re: Why is time with timezone 12 bytes?