Thread: RFC: Making TRUNCATE more "MVCC-safe"

RFC: Making TRUNCATE more "MVCC-safe"

From
Marti Raudsepp
Date:
Hi!

I've always been a little wary of using the TRUNCATE command due to
the warning in the documentation about it not being "MVCC-safe":
queries may silently give wrong results and it's hard to tell when
they are affected.

That got me thinking, why can't we handle this like a standby server
does -- if some query has data removed from underneath it, it aborts
with a serialization failure.

Does this solution sound like a good idea?

The attached patch is a lame attempt at implementing this. I added a
new pg_class.relvalidxmin attribute which tracks the Xid of the last
TRUNCATE (maybe it should be called reltruncatexid?). Whenever
starting a relation scan with a snapshot older than relvalidxmin, an
error is thrown. This seems to work out well since TRUNCATE updates
pg_class anyway, and anyone who sees the new relfilenode automatically
knows when it was truncated.

Am I on the right track? Are there any better ways to attach this
information to a relation?
Should I also add another counter to pg_stat_database_conflicts?
Currently this table is only used on standby servers.

Since I wrote it just this afternoon, there are a few things still
wrong with the patch (it doesn't handle xid wraparound for one), so
don't be too picky about the code yet. :)

Example:
  CREATE TABLE foo (i int);
Session A:
  BEGIN ISOLATION LEVEL REPEATABLE READ;
  SELECT txid_current(); -- Force snapshot open
Session B:
  TRUNCATE TABLE foo;
Session A:
  SELECT * FROM foo;
ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
DETAIL:  Rows visible to this transaction have been removed.


Patch also available in my github 'truncate' branch:
https://github.com/intgr/postgres/commits/truncate

Regards,
Marti

Attachment

Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Noah Misch
Date:
On Thu, Feb 09, 2012 at 11:11:16PM +0200, Marti Raudsepp wrote:
> I've always been a little wary of using the TRUNCATE command due to
> the warning in the documentation about it not being "MVCC-safe":
> queries may silently give wrong results and it's hard to tell when
> they are affected.
> 
> That got me thinking, why can't we handle this like a standby server
> does -- if some query has data removed from underneath it, it aborts
> with a serialization failure.
> 
> Does this solution sound like a good idea?
> 
> The attached patch is a lame attempt at implementing this. I added a
> new pg_class.relvalidxmin attribute which tracks the Xid of the last
> TRUNCATE (maybe it should be called reltruncatexid?). Whenever
> starting a relation scan with a snapshot older than relvalidxmin, an
> error is thrown. This seems to work out well since TRUNCATE updates
> pg_class anyway, and anyone who sees the new relfilenode automatically
> knows when it was truncated.

I like the design you have chosen.  It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound.  For example, older
snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC.  When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

> Should I also add another counter to pg_stat_database_conflicts?
> Currently this table is only used on standby servers.

> ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
> DETAIL:  Rows visible to this transaction have been removed.

My initial reaction is not to portray this like a recovery conflict, since
several aspects distinguish it from all recovery conflict types.

(I have not read your actual patch.)

Thanks,
nm


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <noah@leadboat.com> wrote:
> I like the design you have chosen.  It would find applications beyond
> TRUNCATE, so your use of non-specific naming is sound.  For example, older
> snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
> commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
> should perhaps just become MVCC-safe, but there will always be use cases for
> operations that shortcut MVCC.  When one truly does want that, your proposal
> for keeping behavior consistent makes plenty of sense.

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED.  Are there examples of this behavior at other
isolation levels?

But I have to admit I'm intrigued by the idea of extending this to
other cases, if there's a reasonable way to do that.  For example, if
we could fix things up so that we don't see a table at all if it was
created after we took our snapshot, that would remove one of the
obstacles to marking pages bulk-loaded into that table with FrozenXID
and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
mighty happy about that.

But the necessary semantics seem somewhat different.  For TRUNCATE,
you want to throw a serialization error; but is that also what you
want for CREATE TABLE?  Or do you just want it to appear empty?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Noah Misch
Date:
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
> On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <noah@leadboat.com> wrote:
> > I like the design you have chosen. ?It would find applications beyond
> > TRUNCATE, so your use of non-specific naming is sound. ?For example, older
> > snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
> > commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands
> > should perhaps just become MVCC-safe, but there will always be use cases for
> > operations that shortcut MVCC. ?When one truly does want that, your proposal
> > for keeping behavior consistent makes plenty of sense.
> 
> I guess I'm not particularly excited by the idea of trying to make
> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
> READ isolation level, which is already known to be busted in a variety
> of ways; that's why we now have SERIALIZABLE, and why most people use
> READ COMMITTED.  Are there examples of this behavior at other
> isolation levels?

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED.  They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ.  I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

> But I have to admit I'm intrigued by the idea of extending this to
> other cases, if there's a reasonable way to do that.  For example, if
> we could fix things up so that we don't see a table at all if it was
> created after we took our snapshot, that would remove one of the
> obstacles to marking pages bulk-loaded into that table with FrozenXID
> and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
> mighty happy about that.

Exactly.

> But the necessary semantics seem somewhat different.  For TRUNCATE,
> you want to throw a serialization error; but is that also what you
> want for CREATE TABLE?  Or do you just want it to appear empty?

I think an error helps just as much there.  If you create a table and populate
it with data in the same transaction, letting some snapshot see an empty table
is an atomicity failure.

Your comment illustrates a helpful point: this is just another kind of
ordinary serialization failure, one that can happen at any isolation level.
No serial transaction sequence can yield one of these errors.

Thanks,
nm


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Dan Ports
Date:
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
> I guess I'm not particularly excited by the idea of trying to make
> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
> READ isolation level, which is already known to be busted in a variety
> of ways; that's why we now have SERIALIZABLE, and why most people use
> READ COMMITTED.  Are there examples of this behavior at other
> isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
>> On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <noah@leadboat.com> wrote:
>> > I like the design you have chosen. ?It would find applications beyond
>> > TRUNCATE, so your use of non-specific naming is sound. ?For example, older
>> > snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
>> > commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands
>> > should perhaps just become MVCC-safe, but there will always be use cases for
>> > operations that shortcut MVCC. ?When one truly does want that, your proposal
>> > for keeping behavior consistent makes plenty of sense.
>>
>> I guess I'm not particularly excited by the idea of trying to make
>> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
>> READ isolation level, which is already known to be busted in a variety
>> of ways; that's why we now have SERIALIZABLE, and why most people use
>> READ COMMITTED.  Are there examples of this behavior at other
>> isolation levels?
>
> I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
> not at READ COMMITTED.  They tend to be narrow race conditions at READ
> COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
> http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah.  Well, that's actually an interesting example, because it
illustrates how general this problem is.  We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated.  But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter.  There could also be intermediate cases where
updates are invalidating for some purposes but not others.  I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

> Incidentally, people use READ COMMITTED because they don't question the
> default, not because they know hazards of REPEATABLE READ.  I don't know the
> bustedness you speak of; could we improve the documentation to inform folks?

The example that I remember was related to SELECT FOR UPDATE/SELECT
FOR SHARE.  The idea of those statements is that you want to prevent
the row from being updated or deleted until some other concurrent
action is complete; for example, in the case of a foreign key, we'd
like to prevent the referenced row from being deleted or updated in
the relevant columns until the inserting transaction is committed.
But it doesn't work, because when the updating or deleting process
gets done with the lock wait, they are still using the same snapshot
as before, and merrily do exactly the the thing that the lock-wait was
supposed to prevent.  If an actual UPDATE is used, it's safe (I
think): anyone who was going to UPDATE or DELETE the row will fail
with some kind of serialization error.  But a SELECT FOR UPDATE that
commits is treated more like an UPDATE that rolls back: it's as if the
lock never existed.  Someone (Florian?) proposed a patch to change
this, but it seemed problematic for reasons I no longer exactly
remember.

When using an actual foreign key, we work around this by taking a new
snapshot to cross-check that things haven't changed under us, but
user-level code can't do that.  At READ COMMITTED, depending on the
situation, either the fact that we take new snapshots pretty
frequently or the EPQ machinery sometimes make things work sensibly
anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies.
But REPEATABLE READ has no protection.  I wish I could find the thread
where we discussed this before.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> The example that I remember was related to SELECT FOR
> UPDATE/SELECT FOR SHARE.  The idea of those statements is that you
> want to prevent the row from being updated or deleted until some
> other concurrent action is complete; for example, in the case of a
> foreign key, we'd like to prevent the referenced row from being
> deleted or updated in the relevant columns until the inserting
> transaction is committed.  But it doesn't work, because when the
> updating or deleting process gets done with the lock wait, they
> are still using the same snapshot as before, and merrily do
> exactly the the thing that the lock-wait was supposed to prevent.
This issue is one which appears to be a problem for people trying to
migrate from Oracle, where a write conflict would be generated.
> If an actual UPDATE is used, it's safe (I think): anyone who was
> going to UPDATE or DELETE the row will fail with some kind of
> serialization error.
Right; a write conflict.
> But a SELECT FOR UPDATE that commits is treated more like an
> UPDATE that rolls back: it's as if the lock never existed. 
> Someone (Florian?) proposed a patch to change this, but it seemed
> problematic for reasons I no longer exactly remember.
It had to do with only having one xmax and how that worked with
subtransactions.
Of course, besides the technical obstacles, such a semantic change
could break existing code for PostgreSQL users.  :-(
> When using an actual foreign key, we work around this by taking a
> new snapshot to cross-check that things haven't changed under us,
> but user-level code can't do that.  At READ COMMITTED, depending
> on the situation, either the fact that we take new snapshots
> pretty frequently or the EPQ machinery sometimes make things work
> sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of
> anomalies.  But REPEATABLE READ has no protection.
Well, personally I have a hard time calling READ COMMITTED behavior
sensible.  Consider this:
-- connection 1
test=# create table t (id int not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"t_pkey" for table "t"
CREATE TABLE
test=# insert into t select generate_series(1, 10);
INSERT 0 10
-- connection 2
test=# begin;
BEGIN
test=# update t set id = id - 1;
UPDATE 10
-- connection 1
test=# select * from t where id = (select min(id) from t) for
update;
[blocks]
-- connection 2
test=# commit;
COMMIT
-- connection 1
[unblocks]id 
----
(0 rows)
-Kevin


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Mon, Feb 13, 2012 at 10:48 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Well, personally I have a hard time calling READ COMMITTED behavior
> sensible.  Consider this:
>
> [ gigantic pile of fail ]

Yeah, that's bad all right.  I think it's hard to argue that the
current behavior is sensible; the trick is to figure out something
that's better but doesn't impose too much additional overhead.  I
wonder if it's possible to use SSI (or some stripped-down mechanism
along similar lines) to track these kinds of write conflicts, rather
than cluttering the system catalogs with lots more TransactionId
fields.  Like, when we TRUNCATE a table, we could essentially make a
note in memory someplace recording the write conflict.

Unfortunately, the full-blown SSI machinery seems too expensive to use
for this, especially on all-read workloads where there are no actual
conflicts but lots of conflict checks.  But that could probably be
optimized.  The attraction of using something like an in-memory
conflict manager is that it would probably be quite general.  We could
fix problems of this type with no on-disk format changes whenever we
discover them (as we will certainly continue to do) just by adding the
appropriate flag-a-conflict calls to the right places in the code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote: 
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> Well, personally I have a hard time calling READ COMMITTED
>> behavior sensible.  Consider this:
>>
>> [ gigantic pile of fail ]
> 
> Yeah, that's bad all right.  I think it's hard to argue that the
> current behavior is sensible; the trick is to figure out something
> that's better but doesn't impose too much additional overhead.  I
> wonder if it's possible to use SSI (or some stripped-down
> mechanism along similar lines) to track these kinds of write
> conflicts, rather than cluttering the system catalogs with lots
> more TransactionId fields.  Like, when we TRUNCATE a table, we
> could essentially make a note in memory someplace recording the
> write conflict.
Potential additional uses of the predicate locking developed for SSI
keep surfacing.  At some point we should probably pick a couple of
them and try to fashion an API for the non-blocking predicate
locking logic that serves them and SSI.  Since this predicate
locking system is explicitly interested only in read-write
conflicts, it does seem like it could work for SELECT FOR UPDATE
versus writes.
As mentioned once or twice before, it was pretty clear that while
predicate locking is required for SSI and is probably 80% of the C
code in the patch, it is really a separate thing -- we just didn't
want to try to create a "generalized" API based on the one initial
usage.  I think that an API based on registering and listening would
be the ticket.
> Unfortunately, the full-blown SSI machinery seems too expensive to
> use for this, especially on all-read workloads where there are no
> actual conflicts but lots of conflict checks.
In an all-read workload, if you actually declare the transactions to
be read-only SSI should not introduce much overhead.  If there's
much overhead showing up there at the moment, it would probably be
pretty easy to eliminate.  When there are any read-write
transactions active, it's a different story.
> But that could probably be optimized.
Undoubtedly.  It's disappointing that neither Dan nor I could find
the round tuits to make the kinds of changes in the SSI locking that
you made in some other areas for 9.2.  I'm not really sure how the
performance impact breaks down between predicate locking and SSI
proper, although I would tend to start from the assumption that,
like the lines of code, it's 80% in the predicate locking.
> The attraction of using something like an in-memory conflict
> manager is that it would probably be quite general.  We could fix
> problems of this type with no on-disk format changes whenever we
> discover them (as we will certainly continue to do) just by adding
> the appropriate flag-a-conflict calls to the right places in the
> code.
Assuming that the problems could be expressed in terms of read-write
conflicts, that's probably largely true.  I'm not sure that holds
for some of the funny READ COMMITTED behaviors, though.  I think the
only real "cure" there would be to make subtransactions cheap enough
that we could wrap execution of each SELECT and DML statement in a
subtransaction and roll back for another try with a new snapshot on
conflict.
If you want to track something other than read-write conflicts
and/or you want blocking when a conflict is found, you might be
better off looking to bend the heavyweight locks to your purposes.
-Kevin


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Sat, Feb 11, 2012 at 4:46 AM, Noah Misch <noah@leadboat.com> wrote:

>> But I have to admit I'm intrigued by the idea of extending this to
>> other cases, if there's a reasonable way to do that.  For example, if
>> we could fix things up so that we don't see a table at all if it was
>> created after we took our snapshot, that would remove one of the
>> obstacles to marking pages bulk-loaded into that table with FrozenXID
>> and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
>> mighty happy about that.
>
> Exactly.
>
>> But the necessary semantics seem somewhat different.  For TRUNCATE,
>> you want to throw a serialization error; but is that also what you
>> want for CREATE TABLE?  Or do you just want it to appear empty?
>
> I think an error helps just as much there.  If you create a table and populate
> it with data in the same transaction, letting some snapshot see an empty table
> is an atomicity failure.
>
> Your comment illustrates a helpful point: this is just another kind of
> ordinary serialization failure, one that can happen at any isolation level.
> No serial transaction sequence can yield one of these errors.

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

v2 patch attached, updated to latest HEAD. Patch adds
* a GUC called strict_mvcc to isolate the new behaviour if required
* relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

At present this lacks docs for strict_mvcc and doesn't attempt to
handle CREATE TABLE case yet, but should be straightforward to do so.

Hint bit setting is in separate patch on other thread.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Marti Raudsepp
Date:
On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:
> Thanks Noah for drawing attention to this thread. I hadn't been
> watching. As you say, this work would allow me to freeze rows at load
> time and avoid the overhead of hint bit setting, which avoids
> performance issues from hint bit setting in checksum patch.
>
> I've reviewed this patch and it seems OK to me. Good work Marti.

Thanks! This approach wasn't universally liked, but if it gets us
tangible benefits (COPY with frozenxid) then I guess it's a reason to
reconsider.

> v2 patch attached, updated to latest HEAD. Patch adds
> * relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

Personally I'd rather keep this out of ANALYZE -- since its purpose is
to collect stats; VACUUM is responsible for correctness (xid
wraparound etc). But I don't feel strongly about this.

A more important consideration is how this interacts with hot standby.
Currently you compare OldestXmin to relvalidxmin to decide when to
reset it. But the standby's OldestXmin may be older than the master's.
(If VACUUM removes rows then this causes a recovery conflict, but
AFAICT that won't happen if only relvalidxmin changes)

It might be more robust to wait until relfrozenxid exceeds
relvalidxmin -- by then, recovery conflict mechanisms will have taken
care of killing all older snapshots, or am I mistaken?

And a few typos the code...

+ gettext_noop("When enabled viewing a truncated or newly created table "
+ "will throw a serialization error to prevent MVCC "
+ "discrepancies that were possible priot to 9.2.")

"prior" not "priot"

+ * Reset relvalidxmin if its old enough

Should be "it's" in this context.

Regards,
Marti


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <marti@juffo.org> wrote:
> On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Thanks Noah for drawing attention to this thread. I hadn't been
>> watching. As you say, this work would allow me to freeze rows at load
>> time and avoid the overhead of hint bit setting, which avoids
>> performance issues from hint bit setting in checksum patch.
>>
>> I've reviewed this patch and it seems OK to me. Good work Marti.
>
> Thanks! This approach wasn't universally liked, but if it gets us
> tangible benefits (COPY with frozenxid) then I guess it's a reason to
> reconsider.

Comments I see support this idea. If we did this purely for truncate
correctness I probably wouldn't care either, which is why I didn't
read this thread in the first place.

>> v2 patch attached, updated to latest HEAD. Patch adds
>> * relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure
>
> Personally I'd rather keep this out of ANALYZE -- since its purpose is
> to collect stats; VACUUM is responsible for correctness (xid
> wraparound etc). But I don't feel strongly about this.

If there were a reason to do it, then I would agree. Later you point
out a reason, so I will make this change.

> A more important consideration is how this interacts with hot standby.
> Currently you compare OldestXmin to relvalidxmin to decide when to
> reset it. But the standby's OldestXmin may be older than the master's.
> (If VACUUM removes rows then this causes a recovery conflict, but
> AFAICT that won't happen if only relvalidxmin changes)

As of 9.1, the standby's oldestxmin is incorporated into the master's
via hot_standby_feedback, so it wouldn't typically be a problem these
days. It's possible that the standby has this set off by choice, in
which case anomalies could exist in the case that a VACUUM doesn't
clean any rows, as you say.

So we'll use vacrelstats->latestRemovedXid instead of OldestXmin when
we call vac_update_relstats()
which means ANALYZE should always pass InvalidTransactionId.

> It might be more robust to wait until relfrozenxid exceeds
> relvalidxmin -- by then, recovery conflict mechanisms will have taken
> care of killing all older snapshots, or am I mistaken?

It would be better to set it as early as possible to reduce the cost
of the test in heap_begin_scan()

> And a few typos the code...
>
> + gettext_noop("When enabled viewing a truncated or newly created table "
> + "will throw a serialization error to prevent MVCC "
> + "discrepancies that were possible priot to 9.2.")
>
> "prior" not "priot"

Yep

> + * Reset relvalidxmin if its old enough
>
> Should be "it's" in this context.

Cool, thanks for the review.

v3 attached.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <marti@juffo.org> wrote:
>> On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Thanks Noah for drawing attention to this thread. I hadn't been
>>> watching. As you say, this work would allow me to freeze rows at load
>>> time and avoid the overhead of hint bit setting, which avoids
>>> performance issues from hint bit setting in checksum patch.
>>>
>>> I've reviewed this patch and it seems OK to me. Good work Marti.

...

> v3 attached.

More detailed thoughts show that the test in heap_beginscan_internal()
is not right enough, i.e. wrong.

We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
needs to be a specific xid, not an xmin because otherwise we can get
concurrent transactions failing, not just older transactions.

If we're going freeze tuples on load this needs to be watertight, so
some minor rework needed.

Of course, if we only have a valid xid on the class we might get new
columns added when we do repeated SELECT * statements using the same
snapshot while concurrent DDL occurs. That is impractical, so if we
define this as applying to rows it can work; if we want it to apply to
everything its getting more difficult.

Longer term we might fix this by making all catalog access use MVCC,
but that suffers the same problems with ALTER TABLEs that rewrite rows
to add columns. I don't see a neat future solution that is worth
waiting for.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Sun, Mar 4, 2012 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <marti@juffo.org> wrote:
>>> On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> Thanks Noah for drawing attention to this thread. I hadn't been
>>>> watching. As you say, this work would allow me to freeze rows at load
>>>> time and avoid the overhead of hint bit setting, which avoids
>>>> performance issues from hint bit setting in checksum patch.
>>>>
>>>> I've reviewed this patch and it seems OK to me. Good work Marti.
>
> ...
>
>> v3 attached.
>
> More detailed thoughts show that the test in heap_beginscan_internal()
> is not right enough, i.e. wrong.
>
> We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
> needs to be a specific xid, not an xmin because otherwise we can get
> concurrent transactions failing, not just older transactions.

Marti, please review this latest version which has new isolation tests added.

This does both TRUNCATE and CREATE TABLE.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Marti, please review this latest version which has new isolation tests added.
>
> This does both TRUNCATE and CREATE TABLE.

I don't see any need for a GUC to control this behavior.  The current
behavior is wrong, so if we're going to choose this path to fix it,
then we should just fix it, period.  The narrow set of circumstances
in which it might be beneficial to disable this behavior doesn't seem
to me to be sufficient to justify a behavior-changing GUC.

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal().  Surely this is just as much of
a problem for an index-scan or index-only-scan.  We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

The actual text of the error message could use some work.  Maybe
something like "could not serialize access due to concurrent DDL",
although I think we try to avoid using acronyms like DDL in
translatable strings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Mon, Mar 5, 2012 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Marti, please review this latest version which has new isolation tests added.
>>
>> This does both TRUNCATE and CREATE TABLE.
>
> I don't see any need for a GUC to control this behavior.  The current
> behavior is wrong, so if we're going to choose this path to fix it,
> then we should just fix it, period.  The narrow set of circumstances
> in which it might be beneficial to disable this behavior doesn't seem
> to me to be sufficient to justify a behavior-changing GUC.

I agree behaviour is wrong, the only question is whether our users
rely in some way on that behaviour. Given the long discussion on that
point earlier I thought it best to add a GUC. Easy to remove, now or
later.

> It does not seem right that the logic for detecting the serialization
> error is in heap_beginscan_internal().  Surely this is just as much of
> a problem for an index-scan or index-only-scan.

err, very good point. Doh.

> We don't want to
> patch all those places individually, either: I think the check should
> happen right around the time we initially lock the relation and build
> its relcache entry.

OK, that makes sense and works if we need to rebuild relcache.

> The actual text of the error message could use some work.  Maybe
> something like "could not serialize access due to concurrent DDL",
> although I think we try to avoid using acronyms like DDL in
> translatable strings.

Yeh that was designed-to-be-replaced text. We do use DDL already
elsewhere without really explaining it; its also one of those acronyms
that doesn't actually explain what it really means very well. So I
like the phrase you suggest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

>> It does not seem right that the logic for detecting the serialization
>> error is in heap_beginscan_internal().  Surely this is just as much of
>> a problem for an index-scan or index-only-scan.
>
> err, very good point. Doh.
>
>> We don't want to
>> patch all those places individually, either: I think the check should
>> happen right around the time we initially lock the relation and build
>> its relcache entry.
>
> OK, that makes sense and works if we need to rebuild relcache.

Except the reason to do it at the start of the scan is that is the
first time a specific snapshot has been associated with a relation and
also the last point we can apply the check before the errant behaviour
occurs.

If we reject locks against tables that might be used against an
illegal snapshot then we could easily prevent valid snapshot use cases
when a transaction has multiple snapshots, one illegal, one not.

We can certainly have a looser test when we first get the lock and
then another test later, but I don't think we can avoid making all
scans apply this test. And while I'm there, we have to add tests for
things like index build scans.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Mon, Mar 5, 2012 at 11:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I agree behaviour is wrong, the only question is whether our users
> rely in some way on that behaviour. Given the long discussion on that
> point earlier I thought it best to add a GUC. Easy to remove, now or
> later.

AFAICT, all the discussion upthread was about whether we ought to be
trying to implement this in some entirely-different way that doesn't
rely on storing XIDs in the catalog.  I have a feeling that there are
a whole lot more cases like this and that we're in for a lot of
unpleasant nastiness if we go very far down this route; pg_constraint
is another one, as SnapshotNow can see constraints that may not be
valid under the query's MVCC snapshot.  On the other hand, if someone
comes up with a better way, I suppose we can always rip this out.  In
any case, I don't remember anyone saying that this needed to be
configurable.

Speaking of that, though, I have one further thought on this: we need
to be absolutely certain that autovacuum is going to prevent this XID
value from wrapping around.  I suppose this is safe since, even if
autovacuum is turned off, we'll forcibly kick it off every so often to
advance relfrozenxid, and that will reset relvalidxid while it's
there.  But then again on second thought, what if relvalidxid lags
relfrozenxid?  Then the emergency autovacuum might not kick in until
relvalidxid has already wrapped around.  I think that could happen
after a TRUNCATE, perhaps, since I think that would leave relfrozenxid
alone while advancing relvalidxid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Mon, Mar 5, 2012 at 12:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> It does not seem right that the logic for detecting the serialization
>>> error is in heap_beginscan_internal().  Surely this is just as much of
>>> a problem for an index-scan or index-only-scan.
>>
>> err, very good point. Doh.
>>
>>> We don't want to
>>> patch all those places individually, either: I think the check should
>>> happen right around the time we initially lock the relation and build
>>> its relcache entry.
>>
>> OK, that makes sense and works if we need to rebuild relcache.
>
> Except the reason to do it at the start of the scan is that is the
> first time a specific snapshot has been associated with a relation and
> also the last point we can apply the check before the errant behaviour
> occurs.
>
> If we reject locks against tables that might be used against an
> illegal snapshot then we could easily prevent valid snapshot use cases
> when a transaction has multiple snapshots, one illegal, one not.
>
> We can certainly have a looser test when we first get the lock and
> then another test later, but I don't think we can avoid making all
> scans apply this test. And while I'm there, we have to add tests for
> things like index build scans.

Well, there's no point that I can see in having two checks.  I just
dislike the idea that we have to remember to add this check for every
method of accessing the relation - doesn't seem terribly future-proof.It gets even worse if you start adding checks to
DDLcode paths - if 
we're going to do that, we really need to cover them all, and that
doesn't seem very practical if they're going to spread out all over
the place.

I don't understand your comment that a snapshot doesn't get associated
with a relation until scan time.  I believe we associated a snapshot
with each query before we even know what relations are involved; that
query then gets passed down to all the individual scans.  The query
also opens and locks those relations.  We ought to be able to arrange
for the query snapshot to be cross-checked at that point, rather than
waiting until scan-start time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Noah Misch
Date:
On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:
> On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <noah@leadboat.com> wrote:
> > I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
> > not at READ COMMITTED. ?They tend to be narrow race conditions at READ
> > COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
> > http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php
> 
> Yeah.  Well, that's actually an interesting example, because it
> illustrates how general this problem is.  We could potentially get
> ourselves into a situation where just about every system catalog table
> needs an xmin field to store the point at which the object came into
> existence - or for that matter, was updated.

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others.  The value in extending this to more catalogs is the
ability to narrow the impact of failing the check.  A failed indcheckxmin
comparison merely excludes plans involving the index.  A failed inhvalidxmin
check might just skip recursion to the table in question.  Those are further
refinements, much like using weaker heavyweight lock types.

> But it's not quite the
> same as the xmin of the row itself, because some updates might be
> judged not to matter.  There could also be intermediate cases where
> updates are invalidating for some purposes but not others.  I think
> we'd better get our hands around more of the problem space before we
> start trying to engineer solutions.

I'm not seeing that problem.  Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness.  If you have something in mind that
needs more, could you elaborate?

> > Incidentally, people use READ COMMITTED because they don't question the
> > default, not because they know hazards of REPEATABLE READ. ?I don't know the
> > bustedness you speak of; could we improve the documentation to inform folks?
> 
> The example that I remember was related to SELECT FOR UPDATE/SELECT
> FOR SHARE.  The idea of those statements is that you want to prevent
> the row from being updated or deleted until some other concurrent
> action is complete; for example, in the case of a foreign key, we'd
> like to prevent the referenced row from being deleted or updated in
> the relevant columns until the inserting transaction is committed.
> But it doesn't work, because when the updating or deleting process
> gets done with the lock wait, they are still using the same snapshot
> as before, and merrily do exactly the the thing that the lock-wait was
> supposed to prevent.  If an actual UPDATE is used, it's safe (I
> think): anyone who was going to UPDATE or DELETE the row will fail
> with some kind of serialization error.  But a SELECT FOR UPDATE that
> commits is treated more like an UPDATE that rolls back: it's as if the
> lock never existed.  Someone (Florian?) proposed a patch to change
> this, but it seemed problematic for reasons I no longer exactly
> remember.

Thanks.  I vaguely remember that thread.

nm


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Mon, Mar 5, 2012 at 5:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Well, there's no point that I can see in having two checks.  I just
> dislike the idea that we have to remember to add this check for every
> method of accessing the relation - doesn't seem terribly future-proof.
>  It gets even worse if you start adding checks to DDL code paths - if
> we're going to do that, we really need to cover them all, and that
> doesn't seem very practical if they're going to spread out all over
> the place.

Understood. Will look.

> I don't understand your comment that a snapshot doesn't get associated
> with a relation until scan time.  I believe we associated a snapshot
> with each query before we even know what relations are involved; that
> query then gets passed down to all the individual scans.  The query
> also opens and locks those relations.  We ought to be able to arrange
> for the query snapshot to be cross-checked at that point, rather than
> waiting until scan-start time.

What's to stop other code using an older snapshot explicitly? That
fear may be bogus.

Any suggestions? ISTM we don't know whether we're already locked until
we get to LockAcquire() and there's no easy way to pass down snapshot
information through that, let alone handle RI snapshots. Ideas please.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:
>> On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <noah@leadboat.com> wrote:
>> > I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
>> > not at READ COMMITTED. ?They tend to be narrow race conditions at READ
>> > COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
>> > http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php
>>
>> Yeah.  Well, that's actually an interesting example, because it
>> illustrates how general this problem is.  We could potentially get
>> ourselves into a situation where just about every system catalog table
>> needs an xmin field to store the point at which the object came into
>> existence - or for that matter, was updated.
>
> I can see this strategy applying to many relation-pertinent system catalogs.
> Do you foresee applications to non-relation catalogs?

Well, in theory, we have similar issues if, say, a query uses a
function that didn't exist at the time the snapshot as taken; the
actual results the user sees may not be consistent with any serial
execution schedule.  And the same could be true for any other SQL
object.  It's unclear that those cases are as compelling as this one,
but then again it's unclear that no one will ever want to fix them,
either.  For example, suppose we have a view v over a table t that
calls a function f.  Somebody alters f to give different results and,
in the same transaction, modifies the contents of t (but no DDL).
This doesn't strike me as a terribly unlikely scenario; the change to
t could well be envisioned as a compensating transaction.  But now if
somebody uses the new definition of f against the old contents of t,
the user may fail to get what they were hoping for out of bundling
those changes together in one transaction.

Now, maybe we're never going to fix those kinds of anomalies anyway,
but if we go with this architecture, then I think the chances of it
ever being palatable to try are pretty low.

> In any event, I think a pg_class.relvalidxmin is the right starting point.
> One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
> (already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
> AccessExclusiveLock of that family; it necessarily blocks everything that
> might impugn the others.  The value in extending this to more catalogs is the
> ability to narrow the impact of failing the check.  A failed indcheckxmin
> comparison merely excludes plans involving the index.  A failed inhvalidxmin
> check might just skip recursion to the table in question.  Those are further
> refinements, much like using weaker heavyweight lock types.

Yes, good parallel.

>> But it's not quite the
>> same as the xmin of the row itself, because some updates might be
>> judged not to matter.  There could also be intermediate cases where
>> updates are invalidating for some purposes but not others.  I think
>> we'd better get our hands around more of the problem space before we
>> start trying to engineer solutions.
>
> I'm not seeing that problem.  Any operation that would update some xmin
> horizon should set it to the greater of its current value and the value the
> operation needs for its own correctness.  If you have something in mind that
> needs more, could you elaborate?

Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE.  On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly.  Maybe not a very important one, but does that prove that
there's no significant problem of this type in general, or just
nobody's thought through all the cases yet?  After all, the issues
with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
for a very long time, and we're only just getting around to looking at
them, so I don't have much confidence that there aren't other cases
floating around out there.

I guess another way to put this is that you could need "locks" of a
great number of different strengths to really handle all the cases.
It's going to be unappealing to, say, set the relation xmin when
setting the constraint xmin would do, or to fail for a concurrent
TRUNCATE as well as a concurrent CREATE TABLE when only the latter
logically requires a failure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Mon, Mar 5, 2012 at 8:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>> In any event, I think a pg_class.relvalidxmin is the right starting point.
>> One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
>> (already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
>> AccessExclusiveLock of that family; it necessarily blocks everything that
>> might impugn the others.  The value in extending this to more catalogs is the
>> ability to narrow the impact of failing the check.  A failed indcheckxmin
>> comparison merely excludes plans involving the index.  A failed inhvalidxmin
>> check might just skip recursion to the table in question.  Those are further
>> refinements, much like using weaker heavyweight lock types.
>
> Yes, good parallel.

Did you guys get my comment about not being able to use an xmin value,
we have to use an xid value and to a an XidInMVCCSnapshot() test? Just
checking whether you agree/disagree.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Noah Misch
Date:
On Sun, Mar 04, 2012 at 01:02:57PM +0000, Simon Riggs wrote:
> More detailed thoughts show that the test in heap_beginscan_internal()
> is not right enough, i.e. wrong.
> 
> We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
> needs to be a specific xid, not an xmin because otherwise we can get
> concurrent transactions failing, not just older transactions.

Good point; I agree.  indcheckxmin's level of pessimism isn't appropriate for
this new check.

> If we're going freeze tuples on load this needs to be watertight, so
> some minor rework needed.
> 
> Of course, if we only have a valid xid on the class we might get new
> columns added when we do repeated SELECT * statements using the same
> snapshot while concurrent DDL occurs. That is impractical, so if we
> define this as applying to rows it can work; if we want it to apply to
> everything its getting more difficult.

Excess columns seem less grave to me than excess or missing rows.  I'm having
difficulty thinking up an explanation for that opinion.


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Noah Misch
Date:
On Mon, Mar 05, 2012 at 03:46:16PM -0500, Robert Haas wrote:
> On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch <noah@leadboat.com> wrote:
> > I can see this strategy applying to many relation-pertinent system catalogs.
> > Do you foresee applications to non-relation catalogs?
> 
> Well, in theory, we have similar issues if, say, a query uses a
> function that didn't exist at the time the snapshot as taken; the
> actual results the user sees may not be consistent with any serial
> execution schedule.  And the same could be true for any other SQL
> object.  It's unclear that those cases are as compelling as this one,
> but then again it's unclear that no one will ever want to fix them,
> either.  For example, suppose we have a view v over a table t that
> calls a function f.  Somebody alters f to give different results and,
> in the same transaction, modifies the contents of t (but no DDL).
> This doesn't strike me as a terribly unlikely scenario; the change to
> t could well be envisioned as a compensating transaction.  But now if
> somebody uses the new definition of f against the old contents of t,
> the user may fail to get what they were hoping for out of bundling
> those changes together in one transaction.

Good example.

> Now, maybe we're never going to fix those kinds of anomalies anyway,
> but if we go with this architecture, then I think the chances of it
> ever being palatable to try are pretty low.

Why?

> >> But it's not quite the
> >> same as the xmin of the row itself, because some updates might be
> >> judged not to matter. ?There could also be intermediate cases where
> >> updates are invalidating for some purposes but not others. ?I think
> >> we'd better get our hands around more of the problem space before we
> >> start trying to engineer solutions.
> >
> > I'm not seeing that problem. ?Any operation that would update some xmin
> > horizon should set it to the greater of its current value and the value the
> > operation needs for its own correctness. ?If you have something in mind that
> > needs more, could you elaborate?

Simon's point about xmin vs. xid probably leads to an example.  One value is
fine for TRUNCATE, because only the most recent TRUNCATE matters.  Not all DDL
is so simple.

> Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
> to operate on a table that has been truncated since CLUSTER's snapshot
> was taken, and no serialization anomaly is created that would not have
> already existed as a result of the non-MVCC-safe TRUNCATE.  On the
> other hand, if CLUSTER operates on a table that was created since
> CLUSTER's snapshot was taken, then you have a bona fide serialization
> anomaly.

Core CLUSTER does not use any MVCC snapshot.  We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot.  Could you elaborate this example?

> Maybe not a very important one, but does that prove that
> there's no significant problem of this type in general, or just
> nobody's thought through all the cases yet?  After all, the issues
> with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
> for a very long time, and we're only just getting around to looking at
> them, so I don't have much confidence that there aren't other cases
> floating around out there.

Granted.

Thanks,
nm


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch <noah@leadboat.com> wrote:
>> Now, maybe we're never going to fix those kinds of anomalies anyway,
>> but if we go with this architecture, then I think the chances of it
>> ever being palatable to try are pretty low.
>
> Why?

Because it'll require at least one XID column in every system catalog
that represents an SQL catalog to plug all the cases, and I doubt very
much that we want to go there.

> Simon's point about xmin vs. xid probably leads to an example.  One value is
> fine for TRUNCATE, because only the most recent TRUNCATE matters.  Not all DDL
> is so simple.

Yep.

>> Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
>> to operate on a table that has been truncated since CLUSTER's snapshot
>> was taken, and no serialization anomaly is created that would not have
>> already existed as a result of the non-MVCC-safe TRUNCATE.  On the
>> other hand, if CLUSTER operates on a table that was created since
>> CLUSTER's snapshot was taken, then you have a bona fide serialization
>> anomaly.
>
> Core CLUSTER does not use any MVCC snapshot.  We do push one for the benefit
> of functions called during the reindex phase, but it does not appear that you
> speak of that snapshot.  Could you elaborate this example?

Imagine this:

- Transaction #1 acquires a snapshot.
- Transaction #2 creates tables A, inserts a row into table B, and then commits.
- Transaction #1 tries to CLUSTER A and then select from B.

The only serial execution schedules are T1 < T2, in which case the
transaction fails, or T2 < T1, in which case the row is seen.  But
what actually happens is that the row isn't seen and yet the
transaction doesn't fail.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Noah Misch
Date:
On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:
> On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch <noah@leadboat.com> wrote:
> >> Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
> >> to operate on a table that has been truncated since CLUSTER's snapshot
> >> was taken, and no serialization anomaly is created that would not have
> >> already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
> >> other hand, if CLUSTER operates on a table that was created since
> >> CLUSTER's snapshot was taken, then you have a bona fide serialization
> >> anomaly.
> >
> > Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit
> > of functions called during the reindex phase, but it does not appear that you
> > speak of that snapshot. ?Could you elaborate this example?
> 
> Imagine this:
> 
> - Transaction #1 acquires a snapshot.
> - Transaction #2 creates tables A, inserts a row into table B, and then commits.
> - Transaction #1 tries to CLUSTER A and then select from B.
> 
> The only serial execution schedules are T1 < T2, in which case the
> transaction fails, or T2 < T1, in which case the row is seen.  But
> what actually happens is that the row isn't seen and yet the
> transaction doesn't fail.

For the purpose of contemplating this anomaly, one could just as well replace
CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
table, correct?

I agree this test case is good to keep in mind while designing, but we could
well conclude not to bother improving it.


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Wed, Mar 7, 2012 at 7:49 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:
>> On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch <noah@leadboat.com> wrote:
>> >> Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
>> >> to operate on a table that has been truncated since CLUSTER's snapshot
>> >> was taken, and no serialization anomaly is created that would not have
>> >> already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
>> >> other hand, if CLUSTER operates on a table that was created since
>> >> CLUSTER's snapshot was taken, then you have a bona fide serialization
>> >> anomaly.
>> >
>> > Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit
>> > of functions called during the reindex phase, but it does not appear that you
>> > speak of that snapshot. ?Could you elaborate this example?
>>
>> Imagine this:
>>
>> - Transaction #1 acquires a snapshot.
>> - Transaction #2 creates tables A, inserts a row into table B, and then commits.
>> - Transaction #1 tries to CLUSTER A and then select from B.
>>
>> The only serial execution schedules are T1 < T2, in which case the
>> transaction fails, or T2 < T1, in which case the row is seen.  But
>> what actually happens is that the row isn't seen and yet the
>> transaction doesn't fail.
>
> For the purpose of contemplating this anomaly, one could just as well replace
> CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
> table, correct?
>
> I agree this test case is good to keep in mind while designing, but we could
> well conclude not to bother improving it.

All true.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> All true.

So gentlemen, do we think this is worth pursuing further for this release?

I'm sure usual arguments apply all round, so I'll skip that part.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Wed, Mar 7, 2012 at 10:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> All true.
>
> So gentlemen, do we think this is worth pursuing further for this release?
>
> I'm sure usual arguments apply all round, so I'll skip that part.

This patch is awfully late to the party, but if we can nail it down
reasonably quickly I guess I'd be in favor of slipping something in.
I am not thrilled with the design as it stands, but bulk loading is a
known and serious pain point for us, so it would be awfully nice to
improve it.  I'm not sure whether we should only go as far as setting
HEAP_XMIN_COMMITTED or whether we should actually try to mark the
tuples with FrozenXID.  The former has the advantage of (I think) not
requiring any other changes to preserve MVCC semantics while the
latter is, obviously, a bigger performance improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Wed, Mar 7, 2012 at 5:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 7, 2012 at 10:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> All true.
>>
>> So gentlemen, do we think this is worth pursuing further for this release?
>>
>> I'm sure usual arguments apply all round, so I'll skip that part.
>
> This patch is awfully late to the party, but if we can nail it down
> reasonably quickly I guess I'd be in favor of slipping something in.

Cool

> I am not thrilled with the design as it stands, but bulk loading is a
> known and serious pain point for us, so it would be awfully nice to
> improve it.  I'm not sure whether we should only go as far as setting
> HEAP_XMIN_COMMITTED or whether we should actually try to mark the
> tuples with FrozenXID.  The former has the advantage of (I think) not
> requiring any other changes to preserve MVCC semantics while the
> latter is, obviously, a bigger performance improvement.

It's the other way around. Setting to FrozenTransactionId makes the
test in XidInMVCCSnapshot() pass when accessed by later commands in
the same transaction. If we just set the hint we need to play around
to get it accepted. So the frozen route is both best for performance
and least impact on fastpath visibility code. That part of the code is
solid.

The only problem is the visibility from older snapshots, we just
need/desire a better place to put the test. So I'll finish that off.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Wed, Mar 7, 2012 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I am not thrilled with the design as it stands, but bulk loading is a
>> known and serious pain point for us, so it would be awfully nice to
>> improve it.  I'm not sure whether we should only go as far as setting
>> HEAP_XMIN_COMMITTED or whether we should actually try to mark the
>> tuples with FrozenXID.  The former has the advantage of (I think) not
>> requiring any other changes to preserve MVCC semantics while the
>> latter is, obviously, a bigger performance improvement.
>
> It's the other way around. Setting to FrozenTransactionId makes the
> test in XidInMVCCSnapshot() pass when accessed by later commands in
> the same transaction. If we just set the hint we need to play around
> to get it accepted. So the frozen route is both best for performance
> and least impact on fastpath visibility code. That part of the code is
> solid.

Your comment is reminding me that there are actually two problems
here, or at least I think there are.

1. Some other transaction might look at the tuples.
2. An older snapshot (e.g. cursor) might look at the tuples.

Case #1 can happen when we create a table, insert some data, and
commit, and then some other transaction that took a snapshot before we
committed reads the table.  It's OK if the tuples are marked
HEAP_XMIN_COMMITTED, because if we abort no other transaction will
ever see the new pg_class row as alive, and therefore no other
transaction can examine the table contents.  But using FrozenXID as
the tuple xmin would allow those tuples to be seen by a transaction
that took its snapshot before we committed; this is the problem that
relvalidxid is designed to fix, and what I was thinking of when I said
that we need more infrastructure to handle the FrozenXID case.

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots.  And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED.  I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around.  I'm not
sure whether the patch does that or not, but I think it probably needs
to, unless you have some other idea for how to fix this.  It doesn't
seem like an important restriction in practice because it's unlikely
that anyone would keep a cursor open across a bulk data load - and if
they do, this isn't the only problem they're going to have.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Wed, Mar 7, 2012 at 8:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 7, 2012 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> I am not thrilled with the design as it stands, but bulk loading is a
>>> known and serious pain point for us, so it would be awfully nice to
>>> improve it.  I'm not sure whether we should only go as far as setting
>>> HEAP_XMIN_COMMITTED or whether we should actually try to mark the
>>> tuples with FrozenXID.  The former has the advantage of (I think) not
>>> requiring any other changes to preserve MVCC semantics while the
>>> latter is, obviously, a bigger performance improvement.
>>
>> It's the other way around. Setting to FrozenTransactionId makes the
>> test in XidInMVCCSnapshot() pass when accessed by later commands in
>> the same transaction. If we just set the hint we need to play around
>> to get it accepted. So the frozen route is both best for performance
>> and least impact on fastpath visibility code. That part of the code is
>> solid.
>
> Your comment is reminding me that there are actually two problems
> here, or at least I think there are.
>
> 1. Some other transaction might look at the tuples.
> 2. An older snapshot (e.g. cursor) might look at the tuples.
>
> Case #1 can happen when we create a table, insert some data, and
> commit, and then some other transaction that took a snapshot before we
> committed reads the table.  It's OK if the tuples are marked
> HEAP_XMIN_COMMITTED, because if we abort no other transaction will
> ever see the new pg_class row as alive, and therefore no other
> transaction can examine the table contents.  But using FrozenXID as
> the tuple xmin would allow those tuples to be seen by a transaction
> that took its snapshot before we committed; this is the problem that
> relvalidxid is designed to fix, and what I was thinking of when I said
> that we need more infrastructure to handle the FrozenXID case.

Yes. If your purpose was to summarise, then yes that's where we're at.

> Case #2 is certainly a problem for FrozenXID as well, because anything
> that's marked with FrozenXID is going to look visible to everybody,
> including our older snapshots.  And I gather you're saying it's also a
> problem for HEAP_XMIN_COMMITTED.

The problem there is that later subtransactions often have xids that
are greater than xmax, so the xid shows as running when we do
XidInMVCCSnapshot(), which must then be altered for this one weird
case. I tried that and not happy with result.

> I had assumed that the way we were
> fixing this problem was to disable these optimizations for
> transactions that had more than one snapshot floating around.  I'm not
> sure whether the patch does that or not, but I think it probably needs
> to

It does. I thought you already read the patch?

>, unless you have some other idea for how to fix this.  It doesn't
> seem like an important restriction in practice because it's unlikely
> that anyone would keep a cursor open across a bulk data load - and if
> they do, this isn't the only problem they're going to have.

Exactly.

So we're all good, apart from deciding the exact place to prevent
other transaction's older snapshots from seeing the newly frozen rows.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Case #2 is certainly a problem for FrozenXID as well, because anything
>> that's marked with FrozenXID is going to look visible to everybody,
>> including our older snapshots.  And I gather you're saying it's also a
>> problem for HEAP_XMIN_COMMITTED.
>
> The problem there is that later subtransactions often have xids that
> are greater than xmax, so the xid shows as running when we do
> XidInMVCCSnapshot(), which must then be altered for this one weird
> case. I tried that and not happy with result.

Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
confess I don't quite follow what you're describing here otherwise.

>> I had assumed that the way we were
>> fixing this problem was to disable these optimizations for
>> transactions that had more than one snapshot floating around.  I'm not
>> sure whether the patch does that or not, but I think it probably needs
>> to
>
> It does. I thought you already read the patch?

I glanced over it, but did not look through it in detail.  I'll do a
more careful look at your next version.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Simon Riggs
Date:
On Fri, Mar 9, 2012 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Case #2 is certainly a problem for FrozenXID as well, because anything
>>> that's marked with FrozenXID is going to look visible to everybody,
>>> including our older snapshots.  And I gather you're saying it's also a
>>> problem for HEAP_XMIN_COMMITTED.
>>
>> The problem there is that later subtransactions often have xids that
>> are greater than xmax, so the xid shows as running when we do
>> XidInMVCCSnapshot(), which must then be altered for this one weird
>> case. I tried that and not happy with result.
>
> Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
> confess I don't quite follow what you're describing here otherwise.
>
>>> I had assumed that the way we were
>>> fixing this problem was to disable these optimizations for
>>> transactions that had more than one snapshot floating around.  I'm not
>>> sure whether the patch does that or not, but I think it probably needs
>>> to
>>
>> It does. I thought you already read the patch?
>
> I glanced over it, but did not look through it in detail.  I'll do a
> more careful look at your next version.

I'm not confident about the restrictions this patch imposes and we
aren't close enough to a final version for me to honestly request this
be considered for this release. I think its time to close this door
for now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Robert Haas
Date:
On Fri, Mar 9, 2012 at 2:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, Mar 9, 2012 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> Case #2 is certainly a problem for FrozenXID as well, because anything
>>>> that's marked with FrozenXID is going to look visible to everybody,
>>>> including our older snapshots.  And I gather you're saying it's also a
>>>> problem for HEAP_XMIN_COMMITTED.
>>>
>>> The problem there is that later subtransactions often have xids that
>>> are greater than xmax, so the xid shows as running when we do
>>> XidInMVCCSnapshot(), which must then be altered for this one weird
>>> case. I tried that and not happy with result.
>>
>> Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
>> confess I don't quite follow what you're describing here otherwise.
>>
>>>> I had assumed that the way we were
>>>> fixing this problem was to disable these optimizations for
>>>> transactions that had more than one snapshot floating around.  I'm not
>>>> sure whether the patch does that or not, but I think it probably needs
>>>> to
>>>
>>> It does. I thought you already read the patch?
>>
>> I glanced over it, but did not look through it in detail.  I'll do a
>> more careful look at your next version.
>
> I'm not confident about the restrictions this patch imposes and we
> aren't close enough to a final version for me to honestly request this
> be considered for this release. I think its time to close this door
> for now.

OK, makes sense.  I was trying to hold my nose, because I really would
like to see this stuff work better than it does, but I had my doubts,
too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: RFC: Making TRUNCATE more "MVCC-safe"

From
Jeff Davis
Date:
On Sun, 2012-03-04 at 16:39 +0000, Simon Riggs wrote:
> >> v3 attached.

Added to next commitfest.

Regards,Jeff Davis