Tom Lane wrote:
> I agree that we aren't MVCC with respect to DDL operations (and for this
> purpose CLUSTER is DDL). Trying to become so would open a can of worms
> far larger than it's worth, though, IMHO.
I think if we can come up with a reasonable way to handle all the
consequences, it's worth doing. And yes, I realize there are a lot of
consequences, so it may well not be possible.
> Transaction 1 Transaction 2
>
> BEGIN;
>
> ... BEGIN;
>
> ... INSERT INTO a;
>
> CLUSTER a; ...
>
> Currently, because T2 continues to hold a write lock on A until it
> commits, T1's CLUSTER will block until T2 commits; therefore CLUSTER's
> use of SnapshotNow is sufficient to not lose any live tuples. Were T1
> to use a transaction-aware snapshot to scan A, this would not be so.
I think this is somewhat tangential: we're discussing changing the
snapshot used to scan system catalogs, not user relations like A. The
only reason that CLUSTER's use of SnapshotNow is a problem at the moment
is the same reason that TRUNCATE is a problem -- a concurrent
serializable transaction will use the new relfilenode, not the old one.
> Transaction 1 Transaction 2
>
> BEGIN;
>
> ... CLUSTER a;
>
> INSERT INTO a;
>
> Were T1 using a transaction-aware snapshot to read pg_class, it would
> insert its new tuple into the wrong relfilenode for A, causing either
> immediate failure or eventual loss of a live tuple.
Yes, definitely a problem :( The same applies to TRUNCATE, naturally.
The only somewhat reasonable behavior I can think of is to cause
modifications to the oldrelfilenode to fail in a concurrent serializable
transaction. The behavior would be somewhat analogous to an UPDATE in a
serializable transaction failing because of a concurrent data
modification, although in this case we would error out on any
modification (e.g. INSERT).
-Neil