On Thu, Jun 4, 2020 at 3:11 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
> Yes! It's not always this obvious--G2-item encompasses any dependency cycle
> between transactions such that at least one dependency involves a transaction
> writing state which was not observed by some (ostensibly prior) transaction's
> read. We call these "rw dependencies" in the paper, because they involve a read
> which must have occurred before a write. Another way to think of G2-item is "A
> transaction failed to see something that happened in its logical past".
Are you familiar with the paper "Serializable Snapshot Isolation in
PostgreSQL"? You might find it helpful:
http://vldb.org/pvldb/vol5/p1850_danrkports_vldb2012.pdf
Is there a difference between "rw dependencies" as you understand the
term, and what the paper calls "rw-antidependencies"?
> A special case of G2-item, G-single, is commonly known as read skew. In Elle, we
> tag G-single separately, so all the G2-item anomalies reported actually involve
> 2+ rw dependencies, not just 1+. I haven't seen G-single yet, which is
> good--that means Postgres isn't violating SI, just SSI. Or, of course, the test
> itself could be broken--maybe the SQL statements themselves are subtly wrong, or
> our inference is incorrect.
I'm glad that you don't suspect snapshot isolation has been violated.
Frankly I'd be astonished if Postgres is found to be violating SI
here. Anything is possible, but if that happened then it would almost
certainly be far more obvious. The way MVCC works in Postgres is
relatively simple. If an xact in repeatable read mode really did
return a row that wasn't visible to its snapshot, then it's probably
just as likely to return two row versions for the same logical row, or
zero row versions. These are symptoms of various types of data
corruption that we see from time to time. Note, in particular, that
violating SI cannot happen because a transaction released a lock in an
index when it shouldn't have -- because we simply don't have those.
(Actually, we do have something called predicate locks, but those are
not at all like 2PL index value locks -- see the paper I linked to for
more.)
> Give Jepsen a138843d a shot!
>
> 1553 jepsen process 27 16 LOG: execute <unnamed>: select (val) from txn0 where
> sk = $1
>
> 1553 jepsen process 27 17 DETAIL: parameters: $1 = '9'
Attached is:
* A Jepsen failure of the kind we've been talking about
* Log output from Postgres that shows all log lines with the
relevant-to-failure Jepsen worker numbers, as discussed. This is
interleaved based on timestamp order.
Can you explain the anomaly with reference to the actual SQL queries
executed in the log? Is the information that I've provided sufficient?
--
Peter Geoghegan