Thread: Commits 8de72b and 5457a1 (COPY FREEZE)

Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

04 December 2012, 01:34:12

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62

Part of that patch was reverted later:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5457a130d3a66db807d1e0ee2b8e829321809b83

I assume that refers to the discussion here:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg01177.php

But that was quite a while ago, so is there a more recent discussion
that prompted this commit now?

I am a little confused about the case where setting HEAP_XMIN_COMMITTED
when loading the table in the same transaction is wrong. There was some
discussion about subtransactions, but those problems only seemed to
appear when the CREATE and the INSERT/COPY happened in different
subtransactions.

So, I guess my question is, why the partial revert?

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

04 December 2012, 10:15:23

On 4 December 2012 01:34, Jeff Davis <pgsql@j-davis.com> wrote:
>I assume that refers to the discussion here:
>
> http://archives.postgresql.org/pgsql-hackers/2012-02/msg01177.php
>
> But that was quite a while ago, so is there a more recent discussion
> that prompted this commit now?

Yes, this was discussed within the last month, thread "TRUNCATE SERIALIZABLE..."

The patch for that was already posted, so I committed it.

> I am a little confused about the case where setting HEAP_XMIN_COMMITTED
> when loading the table in the same transaction is wrong. There was some
> discussion about subtransactions, but those problems only seemed to
> appear when the CREATE and the INSERT/COPY happened in different
> subtransactions.
>
> So, I guess my question is, why the partial revert?

Well, first, because I realised that it wasn't wanted. I was
re-reading the threads trying to figure out a way to help the checksum
patch further.

Second, because I realised why it wasn't wanted. Setting
HEAP_XMIN_COMMITTED causes MVCC violations within the transaction
issuing the COPY. Accepting the MVCC violation requires explicit
consent from user. If we have that, we may as well set everything we
can. So there's no middle ground. Nothing, or all frozen. We could
change that, but it would require some complex tweaks to tqual
routines, which I tried and was not happy with, since they would need
to get more complex. It's possible a route through that minefield
exists. I'm inclined to believe other approaches are more likely to
yield benefit.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

04 December 2012, 20:38:08

On Tue, 2012-12-04 at 10:15 +0000, Simon Riggs wrote:
> On 4 December 2012 01:34, Jeff Davis <pgsql@j-davis.com> wrote:
> >
>  I assume that refers to the discussion here:
> >
> > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01177.php
> >
> > But that was quite a while ago, so is there a more recent discussion
> > that prompted this commit now?
> 
> Yes, this was discussed within the last month, thread "TRUNCATE SERIALIZABLE..."
> 
> The patch for that was already posted, so I committed it.

Thank you for pointing me toward that thread.

While experimenting with COPY FREEZE, I noticed something:

The simple case of BEGIN; CREATE TABLE ...; COPY ... WITH (FREEZE);
doesn't meet the pre-conditions. It only meets the conditions if
preceded by a TRUNCATE, which all of the tests do. I looked into it, and
I think the test:
 ... && cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId()

should be:
   ... &&(cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId() || cstate->rel->rd_createSubid ==
GetCurrentSubTransactionId())

(haven't tested). Another option to consider is that
rd_newRelfilenodeSubid could be set on newly-created tables as well as
newly-truncated tables.

Also, I recommend a hint along with the NOTICE when the pre-conditions
aren't met to tell the user what they need to do. Perhaps something
like:
  "Create or truncate the table in the same transaction as COPY
FREEZE."

The documentation could expand on that, clarifying that a TRUNCATE in
the same transaction (perhaps even ALTER?) allows a COPY FREEZE.

The code comment could be improved a little, while we're at it:
 "Note that the stronger test of exactly which subtransaction created
it is crucial for correctness of this optimisation."

is a little vague, can you explain using the reasoning from the thread?
I would say something like:
 "The transaction and subtransaction creating or truncating the table
must match that of the COPY FREEZE. Otherwise, it could mix tuples from
different transactions together, and it would be impossible to
distinguish them for the purposes of atomicity."

> > I am a little confused about the case where setting HEAP_XMIN_COMMITTED
> > when loading the table in the same transaction is wrong. There was some
> > discussion about subtransactions, but those problems only seemed to
> > appear when the CREATE and the INSERT/COPY happened in different
> > subtransactions.
> >
> > So, I guess my question is, why the partial revert?
> 
> Well, first, because I realised that it wasn't wanted. I was
> re-reading the threads trying to figure out a way to help the checksum
> patch further.
> 
> Second, because I realised why it wasn't wanted. Setting
> HEAP_XMIN_COMMITTED causes MVCC violations within the transaction
> issuing the COPY.

After reading that thread, I still don't understand why it's unsafe to
set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would
think that a sufficiently narrow case -- such as CTAS outside of a
transaction block -- would be safe, along with some slightly broader
cases (like BEGIN; CREATE TABLE; INSERT/COPY).

But perhaps that requires more discussion. I was just curious if there
was a concrete problem that I was missing.

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Robert Haas

Date:

05 December 2012, 23:47:42

On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> After reading that thread, I still don't understand why it's unsafe to
> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would
> think that a sufficiently narrow case -- such as CTAS outside of a
> transaction block -- would be safe, along with some slightly broader
> cases (like BEGIN; CREATE TABLE; INSERT/COPY).

I haven't looked at the committed patch - which seemed a bit
precipitous to me given the stage the discussion was at - but I
believe the general issue with HEAP_XMIN_COMMITTED is that there might
be other snapshots in the same transaction, for example from open
cursors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Tom Lane

Date:

06 December 2012, 00:43:13

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> After reading that thread, I still don't understand why it's unsafe to
>> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would
>> think that a sufficiently narrow case -- such as CTAS outside of a
>> transaction block -- would be safe, along with some slightly broader
>> cases (like BEGIN; CREATE TABLE; INSERT/COPY).

> I haven't looked at the committed patch - which seemed a bit
> precipitous to me given the stage the discussion was at - but I
> believe the general issue with HEAP_XMIN_COMMITTED is that there might
> be other snapshots in the same transaction, for example from open
> cursors.

From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED
couldn't possibly be from its own transaction, and thus it doesn't make
the tests that would be appropriate for a tuple that is from the current
transaction.  Maybe it's all right anyway (i.e. if we should always treat
such a tuple as good) but I don't recall exactly what's tested in those
paths.
        regards, tom lane

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

06 December 2012, 00:57:59

On Wed, 2012-12-05 at 19:43 -0500, Tom Lane wrote:
> From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED
> couldn't possibly be from its own transaction, and thus it doesn't make
> the tests that would be appropriate for a tuple that is from the current
> transaction.  Maybe it's all right anyway (i.e. if we should always treat
> such a tuple as good) but I don't recall exactly what's tested in those
> paths.

Oh, I see, it probably warrants some more discussion. It seems like we
could solve these problems if we narrow the conditions enough.

Anyway, the partial revert looks good, and the commit message seems
appropriate (essentially, the code got ahead of the discussion).

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

06 December 2012, 12:41:12

On 6 December 2012 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>>> After reading that thread, I still don't understand why it's unsafe to
>>> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would
>>> think that a sufficiently narrow case -- such as CTAS outside of a
>>> transaction block -- would be safe, along with some slightly broader
>>> cases (like BEGIN; CREATE TABLE; INSERT/COPY).
>
>> I haven't looked at the committed patch - which seemed a bit
>> precipitous to me given the stage the discussion was at - but I
>> believe the general issue with HEAP_XMIN_COMMITTED is that there might
>> be other snapshots in the same transaction, for example from open
>> cursors.
>
> From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED
> couldn't possibly be from its own transaction, and thus it doesn't make
> the tests that would be appropriate for a tuple that is from the current
> transaction.  Maybe it's all right anyway (i.e. if we should always treat
> such a tuple as good) but I don't recall exactly what's tested in those
> paths.

Yes.

We'd need to add in a call to
TransactionIdIsCurrentTransactionId(xmin), which is basically just
wasted path in 99% of use cases.

I've looked at optimising TransactionIdIsCurrentTransactionId() with
what appears some vague success. Attached patch gives more consistent
response.

The other thing to do is to have a backend local flag that gets set
when we use the HEAP_XMIN_COMMITTED anywhere. When not set we just
skip past the TransactionIdIsCurrent test altogether.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

optimize_transactionidiscurrent.v1.patch

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

06 December 2012, 12:47:45

On 4 December 2012 20:38, Jeff Davis <pgsql@j-davis.com> wrote:

> The simple case of BEGIN; CREATE TABLE ...; COPY ... WITH (FREEZE);
> doesn't meet the pre-conditions. It only meets the conditions if
> preceded by a TRUNCATE, which all of the tests do. I looked into it, and
> I think the test:
>
>   ... &&
>   cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId()
>
> should be:
>
>     ... &&
>  (cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId() ||
>   cstate->rel->rd_createSubid == GetCurrentSubTransactionId())
>
> (haven't tested). Another option to consider is that
> rd_newRelfilenodeSubid could be set on newly-created tables as well as
> newly-truncated tables.

I'll expand the test above and add new regression. I focused on
correctness ahead of a wide use case and missed this.

> Also, I recommend a hint along with the NOTICE when the pre-conditions
> aren't met to tell the user what they need to do. Perhaps something
> like:
>
>    "Create or truncate the table in the same transaction as COPY
> FREEZE."

OK, will add hint.

> The documentation could expand on that, clarifying that a TRUNCATE in
> the same transaction (perhaps even ALTER?) allows a COPY FREEZE.
>
> The code comment could be improved a little, while we're at it:
>
>   "Note that the stronger test of exactly which subtransaction created
> it is crucial for correctness of this optimisation."
>
> is a little vague, can you explain using the reasoning from the thread?
> I would say something like:
>
>   "The transaction and subtransaction creating or truncating the table
> must match that of the COPY FREEZE. Otherwise, it could mix tuples from
> different transactions together, and it would be impossible to
> distinguish them for the purposes of atomicity."

OK, will try.

> After reading that thread, I still don't understand why it's unsafe to
> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would
> think that a sufficiently narrow case -- such as CTAS outside of a
> transaction block -- would be safe, along with some slightly broader
> cases (like BEGIN; CREATE TABLE; INSERT/COPY).

It *could* be safe, but needs changes to visibility rules, which as
explained I had not been able to optimise sufficiently.

The code is easy enough to add back in at the time it is sufficiently
well optimised and we all agree.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Andres Freund

Date:

06 December 2012, 13:13:04

Hi,

On 2012-12-03 17:34:01 -0800, Jeff Davis wrote:
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62

On the subject of that patch. I am not a big fan of only emitting a NOTICE if
FREEZE wasn't properly used:

+       if (cstate->freeze && (hi_options & HEAP_INSERT_FROZEN) == 0)
+               ereport(NOTICE,
+                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                errmsg("FREEZE option specified but pre-conditions not met")));
+

Imo it should fail. Imagine adding FREEZE to the loading part of your
application and not noticing the optimization broke because somebody
else changed something in another part of the loading procedure.

Not sure whether that discussed previously.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

06 December 2012, 14:07:40

On 6 December 2012 13:12, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> On 2012-12-03 17:34:01 -0800, Jeff Davis wrote:
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62
>
> On the subject of that patch. I am not a big fan of only emitting a NOTICE if
> FREEZE wasn't properly used:
>
> +       if (cstate->freeze && (hi_options & HEAP_INSERT_FROZEN) == 0)
> +               ereport(NOTICE,
> +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                errmsg("FREEZE option specified but pre-conditions not met")));
> +
>
> Imo it should fail. Imagine adding FREEZE to the loading part of your
> application and not noticing the optimization broke because somebody
> else changed something in another part of the loading procedure.
>
> Not sure whether that discussed previously.

It was. Only Robert and I spoke about that.

Also imagine having to analyze your code in detail to reevaluate the
exact optimisation required each time. This way you get to add FREEZE
and have it work always, with feedback if its not optimized.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Andres Freund

Date:

06 December 2012, 14:12:28

On 2012-12-06 14:07:32 +0000, Simon Riggs wrote:
> On 6 December 2012 13:12, Andres Freund <andres@2ndquadrant.com> wrote:
> > Hi,
> >
> > On 2012-12-03 17:34:01 -0800, Jeff Davis wrote:
> >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62
> >
> > On the subject of that patch. I am not a big fan of only emitting a NOTICE if
> > FREEZE wasn't properly used:
> >
> > +       if (cstate->freeze && (hi_options & HEAP_INSERT_FROZEN) == 0)
> > +               ereport(NOTICE,
> > +                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                                errmsg("FREEZE option specified but pre-conditions not met")));
> > +
> >
> > Imo it should fail. Imagine adding FREEZE to the loading part of your
> > application and not noticing the optimization broke because somebody
> > else changed something in another part of the loading procedure.
> >
> > Not sure whether that discussed previously.
>
> It was. Only Robert and I spoke about that.
>
> Also imagine having to analyze your code in detail to reevaluate the
> exact optimisation required each time. This way you get to add FREEZE
> and have it work always, with feedback if its not optimized.

I remain unconvinced by that argument, but if I am alone with this
ok. Could we at least make it a WARNING? Nobody ever reads NOTICEs
because it contains so much noise. And this is isn't noise. Its a bug
on the client side.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

06 December 2012, 14:20:56

On 6 December 2012 14:12, Andres Freund <andres@2ndquadrant.com> wrote:

> I remain unconvinced by that argument, but if I am alone with this
> ok. Could we at least make it a WARNING? Nobody ever reads NOTICEs
> because it contains so much noise. And this is isn't noise. Its a bug
> on the client side.

It's not a bug. Requesting a useful, but not critical optimisation is
just a hint. The preconditions are not easy to understand, so I see no
reason to punish people that misunderstand, or cause programs to fail
in ways that need detailed understanding to make them work again.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

06 December 2012, 16:55:46

* Robert Haas (robertmhaas@gmail.com) wrote:
> I haven't looked at the committed patch - which seemed a bit

Disclaimer- neither have I, but..

When I last recall this discussion (likely in some bar in Europe), the
problem was also that an independent session would be able to:

a) see that the table exists (due to SnapshotNow being used for lookups)
b) see rows in the table

Part 'a' is something which I think would be good to fix (but hard),
and it sounds like this patch might make part 'b' happen, which I think
makes the part 'a' problem worse.  I'm guessing this isn't a problem for
the TRUNCATE case because the second session would still be looking at
the pre-TRUNCATE files on disk, right?  Is that reference to exactly
which files on disk to look at being used to address the CREATE problem
too..?

That said, I love the idea of having a way to drop tuples in w/o having
to go back and rewrite them three times..
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

06 December 2012, 17:02:57

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> It's not a bug. Requesting a useful, but not critical optimisation is
> just a hint. The preconditions are not easy to understand, so I see no
> reason to punish people that misunderstand, or cause programs to fail
> in ways that need detailed understanding to make them work again.

I tend to agree with Andres on this one.  This feels a bit like
accepting a command but then not actually following-through on it
if it turns out we can't actually do it.  If it's truely an optimization
(and I suspect my other email/question might provide insight into that),
then it should be something we can 'just do' without needing to be asked
to do it, along the same lines of not WAL'ing when the appropriate
conditions are met (table created in this transaction, etc, etc).
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

06 December 2012, 17:57:06

On Thu, 2012-12-06 at 11:55 -0500, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> When I last recall this discussion (likely in some bar in Europe), the
> problem was also that an independent session would be able to:
> 
> a) see that the table exists (due to SnapshotNow being used for lookups)
> b) see rows in the table
> 
> Part 'a' is something which I think would be good to fix (but hard),
> and it sounds like this patch might make part 'b' happen, which I think
> makes the part 'a' problem worse.  I'm guessing this isn't a problem for
> the TRUNCATE case because the second session would still be looking at
> the pre-TRUNCATE files on disk, right?  Is that reference to exactly
> which files on disk to look at being used to address the CREATE problem
> too..?

That isn't a problem, because the other session won't see the tuple in
pg_class until the creating transaction commits, at which point the rows
have committed, too (because this would only kick in when the rows are
loaded in the same transaction as the CREATE).

So, yes, it's like TRUNCATE in that regard.

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

06 December 2012, 18:16:11

On 6 December 2012 17:02, Stephen Frost <sfrost@snowman.net> wrote:
> * Simon Riggs (simon@2ndQuadrant.com) wrote:
>> It's not a bug. Requesting a useful, but not critical optimisation is
>> just a hint. The preconditions are not easy to understand, so I see no
>> reason to punish people that misunderstand, or cause programs to fail
>> in ways that need detailed understanding to make them work again.
>
> I tend to agree with Andres on this one.  This feels a bit like
> accepting a command but then not actually following-through on it
> if it turns out we can't actually do it.  If it's truely an optimization
> (and I suspect my other email/question might provide insight into that),
> then it should be something we can 'just do' without needing to be asked
> to do it, along the same lines of not WAL'ing when the appropriate
> conditions are met (table created in this transaction, etc, etc).

That depends whether its a command or a do-if-possible hint. Its
documented as the latter.

Similar to the way VACUUM tries to truncate a relation, but gives up
if it can't. And on an even closer example, VACUUM FREEZE itself,
which doesn't guarantee that all rows are frozen at the end of it...

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

06 December 2012, 19:10:54

On Thu, 2012-12-06 at 18:16 +0000, Simon Riggs wrote:
> > I tend to agree with Andres on this one.  This feels a bit like
> > accepting a command but then not actually following-through on it
> > if it turns out we can't actually do it.  If it's truely an optimization
> > (and I suspect my other email/question might provide insight into that),
> > then it should be something we can 'just do' without needing to be asked
> > to do it, along the same lines of not WAL'ing when the appropriate
> > conditions are met (table created in this transaction, etc, etc).
> 
> That depends whether its a command or a do-if-possible hint. Its
> documented as the latter.
> 
> Similar to the way VACUUM tries to truncate a relation, but gives up
> if it can't. And on an even closer example, VACUUM FREEZE itself,
> which doesn't guarantee that all rows are frozen at the end of it...

Also, if the set of conditions changes in the future, we would have a
problem if that caused new errors to appear.

I think a WARNING might make more sense than a NOTICE, but I don't have
a strong opinion about that.

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

06 December 2012, 19:18:26

Jeff,

* Jeff Davis (pgsql@j-davis.com) wrote:
> That isn't a problem, because the other session won't see the tuple in
> pg_class until the creating transaction commits, at which point the rows
> have committed, too (because this would only kick in when the rows are
> loaded in the same transaction as the CREATE).

See, that's what I originally thought but was corrected on it at one
point (by Tom, iirc).

I just did a simple test against 9.2.1 using serializable mode.  In this
mode, if you do something like this:

session a
---------

begin;
                             session b                          ---------                          begin;
          create table q (a integer);                          insert into q values (1);
commit;

select * from q;

You'll get an empty table.  That's not great, but it's life- once
something is in pg_class, all sessions can see it because the table
lookups are done using SnapshotNow and aren't truely transactional, but
at least you can't see any rows in the table because the individual rows
are marked with the transaction ID which created them and we can't see
them in our transaction that started before the table was created.

It sounds like, with this patch/change, this behavior would change.
Now, I'm not necessairly against this, but it's clearly something
different than what we had before and might be an issue for some.  If,
in the general case, we're actually 'ok' with this, I'd ask why we don't
simply do it by default..?  If we're *not* 'ok' with it, perhaps we
shouldn't be doing it at all...

If we fix the bigger issue (which, as I understand it from various
discussions with Tom and Robert, is very difficult), then this whole
problem goes away entirely.  I always figured/expected that to be how
we'd fix this, not just punt on the issue and tell the user "sure, you
can do this, hope you know what you're doing...".
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

07 December 2012, 00:41:41

On Thu, 2012-12-06 at 14:18 -0500, Stephen Frost wrote:
> begin;
> 

You need to do a SELECT here to actually get a snapshot.

>                               session b
>                               ---------
>                               begin;
>                               create table q (a integer);
>                               insert into q values (1);
>                               commit;
> 
> select * from q;
> 
> 
> You'll get an empty table.  That's not great, but it's life- once
> something is in pg_class, all sessions can see it because the table
> lookups are done using SnapshotNow and aren't truely transactional, but
> at least you can't see any rows in the table because the individual rows
> are marked with the transaction ID which created them and we can't see
> them in our transaction that started before the table was created.
> 
> It sounds like, with this patch/change, this behavior would change.

No, it would not change. Session A would see that the table exists and
see that the rows' inserting transaction (in Session B) committed. That
is correct because the inserting transaction *did* commit, and it's the
same as we have now.

However, the rows will *not* be visible, because the serializable
snapshot doesn't contain the inserting transaction.

Think about the current behavior: right after the commit, another select
could come along and set all those hint bits anyway. Even if the hint
bits aren't set, it will do a CLOG lookup and find that the transaction
committed.

The change being proposed is just to set those hint bits preemptively,
because the fate of the INSERT is identical to the fate of the CREATE
(they are in the same transaction). There will be no observable problem
outside of that CREATE+INSERT transaction. The only catch is what to do
about visibility of the tuples when still inside the transaction (which
is not a problem for transactions doing a simple load).

The interesting thing about HEAP_XMIN_COMMITTED is that it can be set
preemptively if we know that the transaction will actually commit (aside
from the visibility issues within the transaction). Even if the
transaction doesn't commit, it would still be possible to clean out the
dead tuples with a VACUUM, because no information has really been lost
in the process. So there may yet be some kind of safe protocol to set
these even during a load into an existing table...

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

07 December 2012, 01:12:31

* Jeff Davis (pgsql@j-davis.com) wrote:
> However, the rows will *not* be visible, because the serializable
> snapshot doesn't contain the inserting transaction.

That's what we've got now and what would be expected, however...

> Think about the current behavior: right after the commit, another select
> could come along and set all those hint bits anyway. Even if the hint
> bits aren't set, it will do a CLOG lookup and find that the transaction
> committed.

The command is 'FREEZE', which sounded to me like the transaction ID
would be set to FrozenXID, meaning that we wouldn't be able to tell if
the inserting transaction was before or after ours...

Your analysis of the hint bits themselves sounds reasonable but it seems
independent of the issue regarding setting the actual transaction ID.
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

07 December 2012, 01:31:18

On Thu, 2012-12-06 at 20:12 -0500, Stephen Frost wrote:
> The command is 'FREEZE', which sounded to me like the transaction ID
> would be set to FrozenXID, meaning that we wouldn't be able to tell if
> the inserting transaction was before or after ours...

Freezing does lose information, but I thought that this sub-thread was
about the HEAP_XMIN_COMMITTED optimization that was in the first version
of the commit but removed in a later commit. Setting HEAP_XMIN_COMMITTED
does not lose information.

> Your analysis of the hint bits themselves sounds reasonable but it seems
> independent of the issue regarding setting the actual transaction ID.

Upon re-reading, my last paragraph was worded a little too loosely.

"The interesting thing about HEAP_XMIN_COMMITTED is that it can be set
preemptively if we know that the transaction will actually commit (aside
from the visibility issues within the transaction)."

That should really be: "aside from the visibility issues before it does
commit".

Anyway, the HEAP_XMIN_COMMITTED loading optimizations require more
discussion, but I think they are worth pursuing. The simpler form is
when the table is created and loaded in the same transaction, but there
may be some more sophisticated approaches, as well.

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

07 December 2012, 01:49:28

Jeff,

* Jeff Davis (pgsql@j-davis.com) wrote:
> On Thu, 2012-12-06 at 20:12 -0500, Stephen Frost wrote:
> > The command is 'FREEZE', which sounded to me like the transaction ID
> > would be set to FrozenXID, meaning that we wouldn't be able to tell if
> > the inserting transaction was before or after ours...
>
> Freezing does lose information, but I thought that this sub-thread was
> about the HEAP_XMIN_COMMITTED optimization that was in the first version
> of the commit but removed in a later commit. Setting HEAP_XMIN_COMMITTED
> does not lose information.

I'm less concerned about the hint bits and more concerned about the
implications of the FrozenXID being used, which would make the rows
visible to other transactions even if they began before the rows were
loaded.

> That should really be: "aside from the visibility issues before it does
> commit".

My concern is more about what happens to transactions which are opened
before this transaction commits and that they might be able to see data
in the table.

As I mentioned before, I'm not *convinced* that this is really a big
deal, or even a problem for this patch, but it's something to *consider*
and think about because, as much as we like to say that DDL is
transaction-safe, it's *not* completely MVCC and things like creating
tables and committing them will show up as visible even in transactions
that shouldn't be able to see them.  Due to that, we need to think about
what happens with data in those tables, etc.

> Anyway, the HEAP_XMIN_COMMITTED loading optimizations require more
> discussion, but I think they are worth pursuing. The simpler form is
> when the table is created and loaded in the same transaction, but there
> may be some more sophisticated approaches, as well.

Sure.
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

07 December 2012, 03:07:48

On Thu, 2012-12-06 at 20:49 -0500, Stephen Frost wrote:
> I'm less concerned about the hint bits and more concerned about the
> implications of the FrozenXID being used, which would make the rows
> visible to other transactions even if they began before the rows were
> loaded.

That is documented in the committed patch -- it's a trade, basically
saying that you lose isolation but avoid extra writes. It seems
reasonable that the user gets this behavior if specifically requested.

> My concern is more about what happens to transactions which are opened
> before this transaction commits and that they might be able to see data
> in the table.

In the simple approach that only affects loads in the same transaction
as the create, this is not an issue. The only issue there is for other
reads in the same transaction. I think you already know that, but I am
repeating for clarity.

> As I mentioned before, I'm not *convinced* that this is really a big
> deal, or even a problem for this patch, but it's something to *consider*
> and think about because, as much as we like to say that DDL is
> transaction-safe, it's *not* completely MVCC and things like creating
> tables and committing them will show up as visible even in transactions
> that shouldn't be able to see them.  Due to that, we need to think about
> what happens with data in those tables, etc.

Agreed. I don't think anyone is suggesting that we change the semantics
of existing commands, unless it's to improve the serializability of DDL.

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

07 December 2012, 03:34:28

* Jeff Davis (pgsql@j-davis.com) wrote:
> That is documented in the committed patch -- it's a trade, basically
> saying that you lose isolation but avoid extra writes. It seems
> reasonable that the user gets this behavior if specifically requested.

Strictly speaking, it could actually be two different uesrs..

> In the simple approach that only affects loads in the same transaction
> as the create, this is not an issue. The only issue there is for other
> reads in the same transaction. I think you already know that, but I am
> repeating for clarity.

Actually, no, I'm not convinced of that.  If a seperate transaction
starts before the create/insert, and is still open when the create/insert
transaction commits, wouldn't that seperate transaction see rows in the
newly created table?

That's more-or-less the specific issue I'm bringing up as a potential
concern.  If it's just a FrozenXID stored in the heap and there isn't
anything telling the 2nd transaction to not consider this table that
exists through SnapshotNow, how are we going to know to not look at the
tuples?  Or do we accept that it's "ok" for those to be visible?

How I wish we could fix the table visibility and not have to worry about
this issue at all, which would also remove the need for some special
keyword to be used to tell us it's 'ok' to use this optimization...
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

07 December 2012, 16:16:24

On 4 December 2012 20:38, Jeff Davis <pgsql@j-davis.com> wrote:
> Even if it is, I would
> think that a sufficiently narrow case -- such as CTAS outside of a
> transaction block -- would be safe

CTAS would be safe to do that with. CLUSTER and VACUUM FULL are already done.

Outside of a transaction block would be automatic.

Inside of a transaction block we could use a parameter called
initially_frozen. This would act same as FREEZE on COPY. Parameter
would not be recorded into the reloptions field on pg_class, since it
is a one-time option and has no meaning after end of xact.

> along with some slightly broader
> cases (like BEGIN; CREATE TABLE; INSERT/COPY).

Historically we've chosen not to allow bulk insert options on INSERT,
so that is a separate discussion.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

07 December 2012, 23:09:45

On Thu, 2012-12-06 at 22:34 -0500, Stephen Frost wrote:
> * Jeff Davis (pgsql@j-davis.com) wrote:
> > That is documented in the committed patch -- it's a trade, basically
> > saying that you lose isolation but avoid extra writes. It seems
> > reasonable that the user gets this behavior if specifically requested.
> 
> Strictly speaking, it could actually be two different uesrs..

That's a good point. Somewhat like SERIALIZABLE, unless everyone is on
board, it doesn't really work.

> > In the simple approach that only affects loads in the same transaction
> > as the create, this is not an issue. The only issue there is for other
> > reads in the same transaction. I think you already know that, but I am
> > repeating for clarity.
> 
> Actually, no, I'm not convinced of that.  If a seperate transaction
> starts before the create/insert, and is still open when the create/insert
> transaction commits, wouldn't that seperate transaction see rows in the
> newly created table?

Not if all you do is set HEAP_XMIN_COMMITTED. Committed <> visible.

> That's more-or-less the specific issue I'm bringing up as a potential
> concern.  If it's just a FrozenXID stored in the heap and there isn't
> anything telling the 2nd transaction to not consider this table that
> exists through SnapshotNow, how are we going to know to not look at the
> tuples?  Or do we accept that it's "ok" for those to be visible?

I am talking about HEAP_XMIN_COMMITTED. I think it's understood that
freezing has much stricter requirements for safety in various situations
because it loses information.

> How I wish we could fix the table visibility and not have to worry about
> this issue at all, which would also remove the need for some special
> keyword to be used to tell us it's 'ok' to use this optimization...

I think we need to be clear about which one we're discussing:
HEAP_XMIN_COMMITTED or FrozenXID. I believe discussing them together has
caused a significant amount of confusion in this thread.

Most of your concerns seem to be related to freezing, and I'm mostly
interested in HEAP_XMIN_COMMITTED optimizations. So I think we're
talking past each other.

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

07 December 2012, 23:51:23

Jeff,

* Jeff Davis (pgsql@j-davis.com) wrote:
> Most of your concerns seem to be related to freezing, and I'm mostly
> interested in HEAP_XMIN_COMMITTED optimizations. So I think we're
> talking past each other.

My concern is about this patch/capability as a whole.  I agree that the
hint bit setting may be fine by itself, I'm not terribly concerned with
that.  Now, if you (and others) aren't concerned about the overall
behavior that is being introduced here, that's fine, but it was my
understanding from previous discussions that making tuples visible to
all transactions, even those started before the committing transaction
which are set more strictly than 'read-committed', wasn't 'ok'.

Now, what I've honestly been hoping for on this thread was for someone
to speak up and point out why I'm wrong about this concern and explain
how this patch addresses that issue.  If that's been done, I've missed
it..
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

08 December 2012, 10:16:56

On 7 December 2012 23:51, Stephen Frost <sfrost@snowman.net> wrote:
> Jeff,
>
> * Jeff Davis (pgsql@j-davis.com) wrote:
>> Most of your concerns seem to be related to freezing, and I'm mostly
>> interested in HEAP_XMIN_COMMITTED optimizations. So I think we're
>> talking past each other.
>
> My concern is about this patch/capability as a whole.  I agree that the
> hint bit setting may be fine by itself, I'm not terribly concerned with
> that.  Now, if you (and others) aren't concerned about the overall
> behavior that is being introduced here, that's fine, but it was my
> understanding from previous discussions that making tuples visible to
> all transactions, even those started before the committing transaction
> which are set more strictly than 'read-committed', wasn't 'ok'.
>
> Now, what I've honestly been hoping for on this thread was for someone
> to speak up and point out why I'm wrong about this concern and explain
> how this patch addresses that issue.  If that's been done, I've missed
> it..

Visibility of pre-hinted data is an issue and because of that we
previously agreed that it would be allowed only when explicitly
requested by the user, which has been implemented as the FREEZE option
on COPY. This then makes it identical to TRUNCATE, where a separate
command differentiates MVCC-compliant row removal (DELETE) from
non-MVCC row removal (TRUNCATE).

So the committed feature does address the visibility issue. Jeff has
been speaking about an extension to that behaviour, which would be to
allow HEAP_XMIN_COMMITTED to be set in some cases also. The committed
feature specifically does not do that.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

08 December 2012, 22:18:34

Simon,

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> Visibility of pre-hinted data is an issue and because of that we
> previously agreed that it would be allowed only when explicitly
> requested by the user, which has been implemented as the FREEZE option
> on COPY. This then makes it identical to TRUNCATE, where a separate
> command differentiates MVCC-compliant row removal (DELETE) from
> non-MVCC row removal (TRUNCATE).

To be honest, I really don't buy off on this.  Yes, TRUNCATE (a
top-level, individually permissioned command) can violate MVCC and make
things which might have been visible to a given transaction no longer
visible to that transaction.  I find that very different from making
rows which should *not* be visible suddenly appear.

> So the committed feature does address the visibility issue.

Not hardly.  It lets a user completely violate the basic rules of the
overall database.  The *correct* solution to this problem is to actually
*fix* it, by setting it up such that tables created after a particular
transaction starts aren't visible and similar.  Not by punting to the
user with very little control over it (or so I'm guessing).  Will we
accept a patch to do an INSERT or UPDATE FREEZE...?  I realize this
might be a bit of a stretch, but why not allow February 31st to be
accepted as a date also, should the user request it?  This is quite a
slippery slope, in my view.

> Jeff has
> been speaking about an extension to that behaviour, which would be to
> allow HEAP_XMIN_COMMITTED to be set in some cases also. The committed
> feature specifically does not do that.

It's not clear to me why you *wouldn't* just go ahead and set everything
you can at the same time...?  It hardly seems like it could be worse
than what is apparently going to happen with this command...
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

09 December 2012, 16:21:39

On 8 December 2012 22:18, Stephen Frost <sfrost@snowman.net> wrote:

>> So the committed feature does address the visibility issue.
>
> Not hardly.  It lets a user completely violate the basic rules of the
> overall database.  The *correct* solution to this problem is to actually
> *fix* it, by setting it up such that tables created after a particular
> transaction starts aren't visible and similar.

Agreed, but that is also be a silent change of current behaviour.

But the above will only work for CREATE TABLE, not for TRUNCATE.

I've invested a lot of time in trying to improve the situation and
investigated many routes forwards, none of them very satisfying. Until
someone comes up with a better plan, FREEZE is a pragmatic way
forwards that improves things now rather than waiting for the perfect
solution. And if we want checksums anytime soon we need ways to
ameliorate the effect of hints on checksums, which this does,
soemwhat. Better plans, with code, welcome.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

09 December 2012, 18:44:38

On Wed, Dec 05, 2012 at 07:43:08PM -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> >> After reading that thread, I still don't understand why it's unsafe to
> >> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would
> >> think that a sufficiently narrow case -- such as CTAS outside of a
> >> transaction block -- would be safe, along with some slightly broader
> >> cases (like BEGIN; CREATE TABLE; INSERT/COPY).
> 
> > I haven't looked at the committed patch - which seemed a bit
> > precipitous to me given the stage the discussion was at - but I
> > believe the general issue with HEAP_XMIN_COMMITTED is that there might
> > be other snapshots in the same transaction, for example from open
> > cursors.
> 
> From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED
> couldn't possibly be from its own transaction, and thus it doesn't make
> the tests that would be appropriate for a tuple that is from the current
> transaction.  Maybe it's all right anyway (i.e. if we should always treat
> such a tuple as good) but I don't recall exactly what's tested in those
> paths.

I don't see semantics preservable by freezing, yet omitting
HEAP_XMIN_COMMITTED.  The "HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid"
test is the one at risk.  HeapTupleSatisfiesMVCC() does skip that test for
HEAP_XMIN_COMMITTED tuples, but seeing xmin==FrozenTransactionId hampers it
all the more.

What if one of the preconditions for the optimization were the equivalent of
CheckTableNotInUse()?  I cannot immediately think of a older-cmin-scan source
not caught thereby.  Unmodified HeapTupleSatisfiesMVCC() will then suffice.
Happily, it's not a restriction users will regularly encounter.

Thanks,
nm

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

09 December 2012, 20:06:22

On Fri, Dec 07, 2012 at 06:51:18PM -0500, Stephen Frost wrote:
> * Jeff Davis (pgsql@j-davis.com) wrote:
> > Most of your concerns seem to be related to freezing, and I'm mostly
> > interested in HEAP_XMIN_COMMITTED optimizations. So I think we're
> > talking past each other.
> 
> My concern is about this patch/capability as a whole.  I agree that the
> hint bit setting may be fine by itself, I'm not terribly concerned with
> that.  Now, if you (and others) aren't concerned about the overall
> behavior that is being introduced here, that's fine, but it was my
> understanding from previous discussions that making tuples visible to
> all transactions, even those started before the committing transaction
> which are set more strictly than 'read-committed', wasn't 'ok'.
> 
> Now, what I've honestly been hoping for on this thread was for someone
> to speak up and point out why I'm wrong about this concern and explain
> how this patch addresses that issue.  If that's been done, I've missed
> it..

I favor[1] unconditionally letting older snapshots see the new rows after the
CREATE+COPY transaction commits.  To recap, making affected scans see an empty
table is as wrong as making them see those rows.  Robert also listed[2] that
as a credible option, and I don't recall anyone opining against it in previous
discussions.  I did perceive an undercurrent preference, all other things
being equal, for an optimization free from semantic side-effects.  I shared
that preference, but investigations showed that we must compromise something.

Though I like the idea of making new relations appear nonexistent to older
snapshots, let's keep that as an independent patch.  Otherwise, the chance of
having none of the above in PostgreSQL 9.3 escalates significantly.

Thanks,
nm

[1] http://archives.postgresql.org/message-id/20120302205834.GC29160@tornado.leadboat.com
[2] http://archives.postgresql.org/message-id/CA+TgmoYh_KXErp9eOejMV6EJJaczeZZcSj3kRtq=yg1SjiMidg@mail.gmail.com

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

10 December 2012, 13:16:12

Simon,

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> Agreed, but that is also be a silent change of current behaviour.

Sure, proper MVCC for catalog entries would be a change, but I think
it's one which would actually reduce the surprises for our users.  Today
we tell people we have transactional DDL, and that's somewhat true, but
it's not MVCC-safe and that can lead to surprises.

> But the above will only work for CREATE TABLE, not for TRUNCATE.

I'm trying to figure out why there are all the constraints around this
command to begin with.  If we're going to support this, why do we
require the user to create or truncate the table in the same
transaction?  Clearly that's going to reduce the usefulness of this
feature and it's not clear to me why that constraint is needed,
code-wise.

Also, what about adding FREEZE options to INSERT and maybe even UPDATE?
Surely it would reduce the overhead associated with those commands as
well.

> I've invested a lot of time in trying to improve the situation and
> investigated many routes forwards, none of them very satisfying. Until
> someone comes up with a better plan, FREEZE is a pragmatic way
> forwards that improves things now rather than waiting for the perfect
> solution.

I agree that the perfect can sometimes be the enemy of the good, but I
still feel that this is quite a slippery slope that's going to end up
getting ourselves into trouble- regardless of how much we document it or
set up constraints around it.

> And if we want checksums anytime soon we need ways to
> ameliorate the effect of hints on checksums, which this does,
> soemwhat.

I don't buy into this argument in the least.

> Better plans, with code, welcome.

While I appreciate the mentality that those-who-code-win, I don't
believe that it can be used in an argument based on *correctness*.
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

10 December 2012, 13:32:58

* Noah Misch (noah@leadboat.com) wrote:
> On Fri, Dec 07, 2012 at 06:51:18PM -0500, Stephen Frost wrote:
> > Now, what I've honestly been hoping for on this thread was for someone
> > to speak up and point out why I'm wrong about this concern and explain
> > how this patch addresses that issue.  If that's been done, I've missed
> > it..

[...]

So, apparently I'm not wrong about my concern, but no one seems to share
my feelings on this change.

I continue to hold that this could end up being a slippery slope for us
to go down wrt 'correctness' vs. 'do whatever the user wants'.  If we
keep this to only COPY and where the table has to be truncated/created
in the same transaction (which requires the user to have sufficient
privileges to do non-MVCC-safe things on the table already), perhaps
it's alright.  It'll definitely reduce the interest in finding a real
solution though, which is unfortunate.
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

10 December 2012, 23:57:29

On Mon, Dec 10, 2012 at 08:32:53AM -0500, Stephen Frost wrote:
> * Noah Misch (noah@leadboat.com) wrote:
> > On Fri, Dec 07, 2012 at 06:51:18PM -0500, Stephen Frost wrote:
> > > Now, what I've honestly been hoping for on this thread was for someone
> > > to speak up and point out why I'm wrong about this concern and explain
> > > how this patch addresses that issue.  If that's been done, I've missed
> > > it..
> 
> [...]
> 
> So, apparently I'm not wrong about my concern, but no one seems to share
> my feelings on this change.
> 
> I continue to hold that this could end up being a slippery slope for us
> to go down wrt 'correctness' vs. 'do whatever the user wants'.

I agree we should be reticent to compromise correctness for convenience.
Compromising mere bug-compatibility, trading one incorrect behavior for
another incorrect behavior, is not as bad.  Furthermore, today's behavior in
question is not something I can see applications deliberately and successfully
relying upon.

> If we
> keep this to only COPY and where the table has to be truncated/created
> in the same transaction (which requires the user to have sufficient
> privileges to do non-MVCC-safe things on the table already), perhaps
> it's alright.

Extending it to cases not involving a just-created or just-truncated table
really would compromise correctness; errors could leave the table in an
otherwise-impossible state.  Let's indeed not go there.

I see no particular need to restrict this to COPY; that's just the most
important case by far.  As a side note, the calculus for whether to extend the
optimization to INSERT and UPDATE differs from that for WAL avoidance.  WAL
avoidance can be a substantial loss when the total change is small.  For
pre-hinting/freezing, the worst case is having needlessly checked a few local
variables to rule out the optimization.

> It'll definitely reduce the interest in finding a real
> solution though, which is unfortunate.

That effect seems likely, but I do not find it unfortunate.  The change
variant I have advocated does not stand in contrast to some "real solution" to
PostgreSQL's treatment for readers of tables created or truncated by a
transaction not in the reader's snapshot.  The two topics interact at arm's
length.  Bundling them into one patch, artificially making them to stand or
fall as a pair, is not a win for PostgreSQL.

That does raise another disadvantage of making the change syntax-controlled:
if we someday implement the other improvement, COPY FREEZE will have minimal
reason not to be the default.  FREEZE then becomes a relic noise word.

Thanks,
nm

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Robert Haas

Date:

11 December 2012, 01:04:58

On Sun, Dec 9, 2012 at 3:06 PM, Noah Misch <noah@leadboat.com> wrote:
> I favor[1] unconditionally letting older snapshots see the new rows after the
> CREATE+COPY transaction commits.  To recap, making affected scans see an empty
> table is as wrong as making them see those rows.  Robert also listed[2] that
> as a credible option, and I don't recall anyone opining against it in previous
> discussions.  I did perceive an undercurrent preference, all other things
> being equal, for an optimization free from semantic side-effects.  I shared
> that preference, but investigations showed that we must compromise something.

You know, I hadn't been taking that option terribly seriously, but
maybe we ought to reconsider it.  It would certainly be simpler, and
as you point out, it's not really any worse from an MVCC point of view
than anything else we do.  Moreover, it would make this available to
clients like pg_dump without further hackery.

I think the current behavior, where we treat FREEZE as a hint, is just
awful.  Regardless of whether the behavior is automatic or manually
requested, the idea that you might get the optimization or not
depending on the timing of relcache flushes seems very much
undesirable.  I mean, if the optimization is actually important for
performance, then you want to get it when you ask for it.  If it
isn't, then why bother having it at all?  Let's say that COPY FREEZE
normally doubles performance on a data load that therefore takes 8
hours - somebody who suddenly loses that benefit because of a relcache
flush that they can't prevent or control and ends up with a 16 hour
data load is going to pop a gasket.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

11 December 2012, 01:54:12

Noah,

* Noah Misch (noah@leadboat.com) wrote:
> I agree we should be reticent to compromise correctness for convenience.
> Compromising mere bug-compatibility, trading one incorrect behavior for
> another incorrect behavior, is not as bad.  Furthermore, today's behavior in
> question is not something I can see applications deliberately and successfully
> relying upon.

I actually don't agree with the notion that one bad bug should allow us
to introduce additional such bugs.  I agree that it's unlikely that
applications are depending on today's behavior of TRUNCATE making
concurrent transactions see an empty table, but it does *not* follow
that applications *won't* start depending on this new behavior of COPY
FREEZE.

> Extending it to cases not involving a just-created or just-truncated table
> really would compromise correctness; errors could leave the table in an
> otherwise-impossible state.  Let's indeed not go there.

Even if we could fix that, I'd be against allowing it arbitrairly for
any regular user INSERT or UPDATE; I'm still not particularly happy with
this approach for COPY.

> > It'll definitely reduce the interest in finding a real
> > solution though, which is unfortunate.
>
> That effect seems likely, but I do not find it unfortunate.  The change
> variant I have advocated does not stand in contrast to some "real solution" to
> PostgreSQL's treatment for readers of tables created or truncated by a
> transaction not in the reader's snapshot.  The two topics interact at arm's
> length.  Bundling them into one patch, artificially making them to stand or
> fall as a pair, is not a win for PostgreSQL.

Having proper MVCC support for DDL *would* be a win for PostgreSQL and
this *does* reduce the chances of that ever happening.

> That does raise another disadvantage of making the change syntax-controlled:
> if we someday implement the other improvement, COPY FREEZE will have minimal
> reason not to be the default.  FREEZE then becomes a relic noise word.

Indeed, that's certainly unfortunate as well.  Really, though, it just
goes to show how much of a hack this is rather than a real solution.
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Stephen Frost

Date:

11 December 2012, 02:03:23

* Robert Haas (robertmhaas@gmail.com) wrote:
> You know, I hadn't been taking that option terribly seriously, but
> maybe we ought to reconsider it.  It would certainly be simpler, and
> as you point out, it's not really any worse from an MVCC point of view
> than anything else we do.  Moreover, it would make this available to
> clients like pg_dump without further hackery.

I really don't agree with this notion that the behavior of TRUNCATE, a
top-level, seperately permissioned command, makes it OK to introduce
other busted behavior in existing commands.

> I think the current behavior, where we treat FREEZE as a hint, is just
> awful.

I agree that it's pretty grotty, but I had assumed it was at least
deterministic, ala TRUNCATE/COPY and WAL...  If it isn't, then this
certainly gets really ugly really quickly.

I don't think that means we should go ahead and try to always optimize
it though- even when it isn't explicit, there will be an expectation
that it's going to work when all the 'right' conditions are met.  I know
that's certainly how I feel about TRUNCATE/COPY and WAL'ing.
Thanks,
    Stephen

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

11 December 2012, 03:01:22

On Mon, Dec 10, 2012 at 08:04:55PM -0500, Robert Haas wrote:
> I think the current behavior, where we treat FREEZE as a hint, is just
> awful.  Regardless of whether the behavior is automatic or manually
> requested, the idea that you might get the optimization or not
> depending on the timing of relcache flushes seems very much
> undesirable.  I mean, if the optimization is actually important for
> performance, then you want to get it when you ask for it.  If it
> isn't, then why bother having it at all?  Let's say that COPY FREEZE
> normally doubles performance on a data load that therefore takes 8
> hours - somebody who suddenly loses that benefit because of a relcache
> flush that they can't prevent or control and ends up with a 16 hour
> data load is going to pop a gasket.

Until these threads, I did not know that a relcache invalidation could trip up
the WAL avoidance optimization, and now this.  I poked at the relevant
relcache.c code, and it already takes pains to preserve the needed facts.  The
header comment of RelationCacheInvalidate() indicates that entries bearing an
rd_newRelfilenodeSubid can safely survive the invalidation, but the code does
not implement that.  I think the comment is right, and this is just an
oversight in the code going back to its beginning (fba8113c).

I doubt the comment at the declaration of rd_createSubid in rel.h, though I
can't presently say what correct comment should replace it.  CLUSTER does
preserve the old value, at least for the main table relation.  CLUSTER
probably should *set* rd_newRelfilenodeSubid.

Thanks,
nm

Attachment

sinval-newrelfilenode-v1.patch

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

11 December 2012, 03:30:48

On Mon, Dec 10, 2012 at 08:54:04PM -0500, Stephen Frost wrote:
> I agree that it's unlikely that
> applications are depending on today's behavior of TRUNCATE making
> concurrent transactions see an empty table, but it does *not* follow
> that applications *won't* start depending on this new behavior of COPY
> FREEZE.

That is a good point.  I'm not sure whether that should worry us enough to
implement an error in the scenario before letting COPY write frozen tuples.
But it's the strongest argument I've seen for imposing that prerequisite.

> Even if we could fix that, I'd be against allowing it arbitrairly for
> any regular user INSERT or UPDATE; I'm still not particularly happy with
> this approach for COPY.

Sure; COPY is the 99% important case here.

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Pavan Deolasee

Date:

11 December 2012, 04:38:37

On Mon, Dec 10, 2012 at 7:02 PM, Stephen Frost <sfrost@snowman.net> wrote:

>
> I continue to hold that this could end up being a slippery slope for us
> to go down wrt 'correctness' vs. 'do whatever the user wants'.  If we
> keep this to only COPY and where the table has to be truncated/created
> in the same transaction (which requires the user to have sufficient
> privileges to do non-MVCC-safe things on the table already), perhaps
> it's alright.  It'll definitely reduce the interest in finding a real
> solution though, which is unfortunate.
>

I wonder if something more complete can be done by forcing COPY FREEZE
or whatever we call it to take an exclusive lock on the table and run
loading as an append-only operation. By taking a strong lock, we will
block out any concurrent read/writes to the table. If an error occurs
while loading the data, the table will be truncated at the previously
recorded size. We may need some additional book keeping and WAL
logging to handle crash recovery.

To solve the visibility issue for old snapshots that should not see
the new data, we can store some additional visibility information in
the pg_class itself. For example, we can store the size of the table
before the COPY FREEZE started and the XID of the COPY FREEZE
operation. An old snapshot that can not see the XID, can not see the
tuples inserted in the new blocks either. Once the COPY FREEZE
finishes and the lock on the relation is released, new transactions
can start writing to the table and write past the old size of the
table. But that should be OK. If an old snapshot can't see the  tuples
inserted by COPY FREEZE, AFAIK it can't see any of those other tuples
as well.

I'm sure there will still be challenges with this approach. But I
wonder if we can guarantee correctness by proper use of
synchronization and still avoid multiple writes for most common data
loading scenarios.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Jeff Davis

Date:

11 December 2012, 05:33:52

On Mon, 2012-12-10 at 08:16 -0500, Stephen Frost wrote: 
> I'm trying to figure out why there are all the constraints around this
> command to begin with.  If we're going to support this, why do we
> require the user to create or truncate the table in the same
> transaction?  Clearly that's going to reduce the usefulness of this
> feature and it's not clear to me why that constraint is needed,
> code-wise.

There is a very specific reason:

If you insert frozen tuples into a table that already has tuples in it,
then you no longer have just an isolation issue, you have an *atomicity*
issue (and probably a number of other serious issues) because you can't
roll back. Doing it in the same transaction as the table was created
leaves you with a way to roll back: just delete the table (which will
happen implicitly because the creation is part of the same transaction
anyway).

Perhaps we can take some partial steps toward MVCC-safe access? For
instance, how about trying to recheck the xmin of a pg_class tuple when
starting a scan?

Regards,Jeff Davis

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Bruce Momjian

Date:

11 December 2012, 15:21:27

On Mon, Dec 10, 2012 at 08:04:55PM -0500, Robert Haas wrote:
> You know, I hadn't been taking that option terribly seriously, but
> maybe we ought to reconsider it.  It would certainly be simpler, and
> as you point out, it's not really any worse from an MVCC point of view
> than anything else we do.  Moreover, it would make this available to
> clients like pg_dump without further hackery.
> 
> I think the current behavior, where we treat FREEZE as a hint, is just
> awful.  Regardless of whether the behavior is automatic or manually
> requested, the idea that you might get the optimization or not
> depending on the timing of relcache flushes seems very much
> undesirable.  I mean, if the optimization is actually important for
> performance, then you want to get it when you ask for it.  If it
> isn't, then why bother having it at all?  Let's say that COPY FREEZE
> normally doubles performance on a data load that therefore takes 8
> hours - somebody who suddenly loses that benefit because of a relcache
> flush that they can't prevent or control and ends up with a 16 hour
> data load is going to pop a gasket.

Why was this patch applied when there are obviously so many concerns
about its behavior?  Was that not clear at commit time?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

21 December 2012, 18:48:05

On 11 December 2012 03:01, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Dec 10, 2012 at 08:04:55PM -0500, Robert Haas wrote:
>> I think the current behavior, where we treat FREEZE as a hint, is just
>> awful.  Regardless of whether the behavior is automatic or manually
>> requested, the idea that you might get the optimization or not
>> depending on the timing of relcache flushes seems very much
>> undesirable.  I mean, if the optimization is actually important for
>> performance, then you want to get it when you ask for it.  If it
>> isn't, then why bother having it at all?  Let's say that COPY FREEZE
>> normally doubles performance on a data load that therefore takes 8
>> hours - somebody who suddenly loses that benefit because of a relcache
>> flush that they can't prevent or control and ends up with a 16 hour
>> data load is going to pop a gasket.
>
> Until these threads, I did not know that a relcache invalidation could trip up
> the WAL avoidance optimization, and now this.  I poked at the relevant
> relcache.c code, and it already takes pains to preserve the needed facts.  The
> header comment of RelationCacheInvalidate() indicates that entries bearing an
> rd_newRelfilenodeSubid can safely survive the invalidation, but the code does
> not implement that.  I think the comment is right, and this is just an
> oversight in the code going back to its beginning (fba8113c).

I think the comment is right also and so the patch is good. I will
apply, barring objections.

The information is only ever non-zero inside a single backend. There
isn't any reason I can see why we wouldn't be able to remember this
information in all cases, perhaps with a few push-ups.

> I doubt the comment at the declaration of rd_createSubid in rel.h, though I
> can't presently say what correct comment should replace it.

rd_createSubid certainly does *not* get blown away by a message
overflow as copy.c claims. I can't see any other way for a message
overflow to cause it to be reset.

I can no longer see a reason for us to regard the rd_createSubid as
merely a hint. So we should change copy.c also.

> CLUSTER does
> preserve the old value, at least for the main table relation.  CLUSTER
> probably should *set* rd_newRelfilenodeSubid.

I can't see a reason not to do this in terms of correctness.

However, I can't see any reason why you'd want to CLUSTER a table and
then load more data into it in the same transaction, so I'm tempted to
just leave that as is to avoid introducing other bugs.

But I dare say people will think we should fix it there also.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

21 December 2012, 20:10:39

On Fri, Dec 21, 2012 at 06:47:56PM +0000, Simon Riggs wrote:
> On 11 December 2012 03:01, Noah Misch <noah@leadboat.com> wrote:
> > Until these threads, I did not know that a relcache invalidation could trip up
> > the WAL avoidance optimization, and now this.  I poked at the relevant
> > relcache.c code, and it already takes pains to preserve the needed facts.  The
> > header comment of RelationCacheInvalidate() indicates that entries bearing an
> > rd_newRelfilenodeSubid can safely survive the invalidation, but the code does
> > not implement that.  I think the comment is right, and this is just an
> > oversight in the code going back to its beginning (fba8113c).
> 
> I think the comment is right also and so the patch is good. I will
> apply, barring objections.
> 
> The information is only ever non-zero inside a single backend. There
> isn't any reason I can see why we wouldn't be able to remember this
> information in all cases, perhaps with a few push-ups.
> 
> > I doubt the comment at the declaration of rd_createSubid in rel.h, though I
> > can't presently say what correct comment should replace it.
> 
> rd_createSubid certainly does *not* get blown away by a message
> overflow as copy.c claims. I can't see any other way for a message
> overflow to cause it to be reset.
> 
> I can no longer see a reason for us to regard the rd_createSubid as
> merely a hint. So we should change copy.c also.

I thought of one case where we do currently forget rd_newRelfilenodeSubid:

BEGIN;
TRUNCATE t;
SAVEPOINT save;
TRUNCATE t;
ROLLBACK TO save;

I don't mind that one too much.

> > CLUSTER does
> > preserve the old value, at least for the main table relation.  CLUSTER
> > probably should *set* rd_newRelfilenodeSubid.
> 
> I can't see a reason not to do this in terms of correctness.
> 
> However, I can't see any reason why you'd want to CLUSTER a table and
> then load more data into it in the same transaction, so I'm tempted to
> just leave that as is to avoid introducing other bugs.
> 
> But I dare say people will think we should fix it there also.

I could see using that capability occasionally, but I wouldn't mix such a
change in with the goals of this thread.

Thanks,
nm

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Simon Riggs

Date:

22 December 2012, 00:42:51

On 21 December 2012 20:10, Noah Misch <noah@leadboat.com> wrote:

> I thought of one case where we do currently forget rd_newRelfilenodeSubid:
>
> BEGIN;
> TRUNCATE t;
> SAVEPOINT save;
> TRUNCATE t;
> ROLLBACK TO save;

That's a weird one. Aborting a subtransacton that sets it, when it was
already set.

The loss of rd_newRelfilenodeSubid in that case is deterministic, but
tracking the full complexity of multiple relations and multiple nested
subxids isn't worth the trouble for such rare cases [assumption].

I'd go for just setting an its_too_complex flag (with better name)
that we can use to trigger a message in COPY to say that FREEZE option
won't be honoured. That would then be completely consistent, rather
than the lack of deterministic behaviour that Robert rightly objects
to.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

22 December 2012, 02:36:04

On Sat, Dec 22, 2012 at 12:42:43AM +0000, Simon Riggs wrote:
> On 21 December 2012 20:10, Noah Misch <noah@leadboat.com> wrote:
> > I thought of one case where we do currently forget rd_newRelfilenodeSubid:
> >
> > BEGIN;
> > TRUNCATE t;
> > SAVEPOINT save;
> > TRUNCATE t;
> > ROLLBACK TO save;
> 
> That's a weird one. Aborting a subtransacton that sets it, when it was
> already set.
> 
> The loss of rd_newRelfilenodeSubid in that case is deterministic, but
> tracking the full complexity of multiple relations and multiple nested
> subxids isn't worth the trouble for such rare cases [assumption].
> 
> I'd go for just setting an its_too_complex flag (with better name)
> that we can use to trigger a message in COPY to say that FREEZE option
> won't be honoured. That would then be completely consistent, rather
> than the lack of deterministic behaviour that Robert rightly objects
> to.

I wouldn't bother.  The behavior here is deterministic, the cause clearly
traceable to the specific commands issued.  Stable software won't suddenly
miss the optimization for no visible reason.

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Tom Lane

Date:

01 May 2019, 18:44:12

[ blast-from-the-past department ]

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 11 December 2012 03:01, Noah Misch <noah@leadboat.com> wrote:
>> Until these threads, I did not know that a relcache invalidation could trip up
>> the WAL avoidance optimization, and now this.  I poked at the relevant
>> relcache.c code, and it already takes pains to preserve the needed facts.  The
>> header comment of RelationCacheInvalidate() indicates that entries bearing an
>> rd_newRelfilenodeSubid can safely survive the invalidation, but the code does
>> not implement that.  I think the comment is right, and this is just an
>> oversight in the code going back to its beginning (fba8113c).

> I think the comment is right also and so the patch is good. I will
> apply, barring objections.

I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
gotten a failure, your session is hosed:

regression=# select 1;
 ?column?
----------
        1
(1 row)

regression=# reindex index pg_class_oid_index;
psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes
regression=# select 1;
psql: ERROR:  could not open file "base/16384/35344": No such file or directory

The problem is that the nailed relcache entry for pg_class_oid_index got
updated to point to the new relfilenode, and that change did not get
undone by transaction abort.  There seem to be several bugs contributing
to that, but the one I'm asking about right now is that commit ae9aba69a
caused RelationCacheInvalidate to skip over relations that have
rd_newRelfilenodeSubid set, as indeed pg_class_oid_index will have at the
instant of the failure.  This seems quite wrong, because it prevents us
from rebuilding the entry with corrected values.  In particular notice
that the change causes us to skip the RelationInitPhysicalAddr call that
would normally be applied to a nailed mapped index in that loop.  That's
completely fatal in this case --- it keeps us from restoring the correct
relfilenode that the mapper would now tell us, if we only bothered to ask.

I think perhaps what we should do is revert ae9aba69a and instead do

        relcacheInvalsReceived++;

-        if (RelationHasReferenceCountZero(relation))
+        if (RelationHasReferenceCountZero(relation) &&
+            relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
        {
            /* Delete this entry immediately */
            Assert(!relation->rd_isnailed);

so that entries with rd_newRelfilenodeSubid nonzero are preserved but are
rebuilt.

There might be an argument for treating rd_createSubid the same way,
not sure.  That field wouldn't ever be set on a system catalog, so
it's less of a hazard, but there's something to be said for treating
the two fields alike.

Thoughts?

            regards, tom lane

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Andres Freund

Date:

01 May 2019, 19:05:50

Hi,

On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
> gotten a failure, your session is hosed:
> 
> regression=# select 1;
>  ?column? 
> ----------
>         1
> (1 row)
> 
> regression=# reindex index pg_class_oid_index;
> psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes
> regression=# select 1;
> psql: ERROR:  could not open file "base/16384/35344": No such file or directory

Yea, I noticed that too. Lead me astray for a while, because it
triggered apparent REINDEX failures for pg_index too, even though not
actually related to the reindex.


> The problem is that the nailed relcache entry for pg_class_oid_index got
> updated to point to the new relfilenode, and that change did not get
> undone by transaction abort.  There seem to be several bugs contributing
> to that, but the one I'm asking about right now is that commit ae9aba69a
> caused RelationCacheInvalidate to skip over relations that have
> rd_newRelfilenodeSubid set, as indeed pg_class_oid_index will have at the
> instant of the failure.

That commit message is quite unhelpful about what exactly this was
intended to fix. I assume it was intended to make COPY FREEZE usage
predictable, but...


> This seems quite wrong, because it prevents us from rebuilding the
> entry with corrected values.  In particular notice that the change
> causes us to skip the RelationInitPhysicalAddr call that would
> normally be applied to a nailed mapped index in that loop.  That's
> completely fatal in this case --- it keeps us from restoring the
> correct relfilenode that the mapper would now tell us, if we only
> bothered to ask.

Indeed. I'm a bit surprised that doesn't lead to more problems.

Not sure I understand where the RelationCacheInvalidate() call is coming
from in this case though. Shouldn't the entry have been invalidated
individually through ATEOXact_Inval(false)?


> I think perhaps what we should do is revert ae9aba69a and instead do
> 
>         relcacheInvalsReceived++;
> 
> -        if (RelationHasReferenceCountZero(relation))
> +        if (RelationHasReferenceCountZero(relation) &&
> +            relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
>         {
>             /* Delete this entry immediately */
>             Assert(!relation->rd_isnailed);
> 
> so that entries with rd_newRelfilenodeSubid nonzero are preserved but are
> rebuilt.

Seems like a reasonable approach.


> There might be an argument for treating rd_createSubid the same way,
> not sure.  That field wouldn't ever be set on a system catalog, so
> it's less of a hazard, but there's something to be said for treating
> the two fields alike.

Seems sensible to treat it the same way. Not sure if there's an actual
problem where the current treatment could cause a problem, but seems
worth improving.

Greetings,

Andres Freund

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Tom Lane

Date:

01 May 2019, 19:13:12

Andres Freund <andres@anarazel.de> writes:
> On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
>> This seems quite wrong, because it prevents us from rebuilding the
>> entry with corrected values.  In particular notice that the change
>> causes us to skip the RelationInitPhysicalAddr call that would
>> normally be applied to a nailed mapped index in that loop.  That's
>> completely fatal in this case --- it keeps us from restoring the
>> correct relfilenode that the mapper would now tell us, if we only
>> bothered to ask.

> Indeed. I'm a bit surprised that doesn't lead to more problems.

> Not sure I understand where the RelationCacheInvalidate() call is coming
> from in this case though. Shouldn't the entry have been invalidated
> individually through ATEOXact_Inval(false)?

In CLOBBER_CACHE_ALWAYS mode it's likely that we get a
RelationCacheInvalidate call first.

Note that the change I'm talking about here is not sufficient to fix
the failure; there are more problems behind it.  I just wanted to know
if there was something I was missing about that old patch.

            regards, tom lane

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Tom Lane

Date:

01 May 2019, 21:00:43

Andres Freund <andres@anarazel.de> writes:
> On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
>> There might be an argument for treating rd_createSubid the same way,
>> not sure.  That field wouldn't ever be set on a system catalog, so
>> it's less of a hazard, but there's something to be said for treating
>> the two fields alike.

> Seems sensible to treat it the same way. Not sure if there's an actual
> problem where the current treatment could cause a problem, but seems
> worth improving.

So if I try to treat rd_createSubid the same way, it passes regression
in normal mode, but CLOBBER_CACHE_ALWAYS is a total disaster: all
table creations fail, eg

regression=# CREATE TABLE BOOLTBL1 (f1 bool);
psql: ERROR:  relation 92773 deleted while still in use

What is happening is that once heap_create has created a relcache
entry for the new relation, any RelationCacheInvalidate call causes
us to try to reload that relcache entry, and that fails because there's
no pg_class or pg_attribute entries for the new relation yet.

This gets back to the thing we were poking at in the other thread,
which is whether or not a relcache entry is *always* reconstructible
from on-disk state.  I had in the back of my mind "um, what about
initial table creation?" but hadn't looked closely.  It seems that
the only reason that it works is that RelationCacheInvalidate is
unwilling to touch new-in-transaction relcache entries.

The ideal fix for this, perhaps, would be to rejigger things far enough
that we would not try to create a relcache entry for a new rel until
it was supported by some minimal set of catalog entries.  But that's
more change than I really want to undertake right now, and for sure
I wouldn't want to back-patch it.

Anyway, I believe the original justification for skipping
new-in-transaction entries here, which is that they couldn't possibly be
subjects of any cross-backend notification messages.  I think the argument
that that applies to rd_newRelfilenodeSubid is a whole lot weaker though.
But I'm going to let this go for the moment, because it's not clear whether
a patch like this is actually relevant to our immediate problem.  There
seem to be some other order-of-operations issues in invalidation
processing, and maybe fixing those would suffice.  More later.

            regards, tom lane

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Tom Lane

Date:

01 May 2019, 23:01:23

Andres Freund <andres@anarazel.de> writes:
> On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
>> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
>> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
>> gotten a failure, your session is hosed:
>>
>> regression=# select 1;
>> ?column?
>> ----------
>> 1
>> (1 row)
>>
>> regression=# reindex index pg_class_oid_index;
>> psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes
>> regression=# select 1;
>> psql: ERROR:  could not open file "base/16384/35344": No such file or directory

> Yea, I noticed that too. Lead me astray for a while, because it
> triggered apparent REINDEX failures for pg_index too, even though not
> actually related to the reindex.

It seems that the reason we fail to recover is that we detect the error
during CommandEndInvalidationMessages, after we've updated the relcache
entry for pg_class_oid_index; of course at that point, any additional
pg_class access is going to fail because it'll try to use the
not-yet-existent index.  However, when we then abort the transaction,
we fail to revert pg_class_oid_index's relcache entry because *there is
not yet an invalidation queue entry telling us to do so*.

This is, therefore, an order-of-operations bug in
CommandEndInvalidationMessages.  It should attach the "current command"
queue entries to PriorCmdInvalidMsgs *before* it processes them, not
after.  In this way they'll be available to be reprocessed by
AtEOXact_Inval or AtEOSubXact_Inval if we fail during CCI.

(A different way we could attack this is to make AtEOXact_Inval and
AtEOSubXact_Inval process "current" messages, as they do not today.
But I think that's a relatively inefficient solution, as it would
force those messages to be reprocessed even in the much-more-common
case where we abort before reaching CommandEndInvalidationMessages.)

The attached quick-hack patch changes that around, and seems to survive
testing (though I've not completed a CCA run with it).  The basic change
I had to make to make this work is to make AppendInvalidationMessageList
actually append the current-commands list to the prior-cmds list, not
prepend as it's really always done.  (In that way, we can still scan the
current-commands list just by starting at its head.)  The comments for
struct InvalidationChunk explicitly disclaim any significance to the order
of the entries, so this should be okay, and I'm not seeing any problem
in testing.  It's been a long time since I wrote that code, but I think
the reason I did it like that was the thought that in a long transaction,
the prior-cmds list might be much longer than the current-commands list,
so iterating down the prior-cmds list could add an O(N^2) cost.  The
easy answer to that, of course, is to make the lists doubly-linked,
and I'd do that before committing; this version is just for testing.
(It's short on comments, too.)

The thing I was worried about in RelationCacheInvalidate does seem
to be a red herring, at least fixing it is not necessary to make
the broken-session-state problem go away.

Comments?

            regards, tom lane

diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..8894bdf 100644
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
*************** AddInvalidationMessage(InvalidationChunk
*** 264,287 ****
  }

  /*
!  * Append one list of invalidation message chunks to another, resetting
!  * the source chunk-list pointer to NULL.
   */
  static void
  AppendInvalidationMessageList(InvalidationChunk **destHdr,
                                InvalidationChunk **srcHdr)
  {
!     InvalidationChunk *chunk = *srcHdr;

!     if (chunk == NULL)
          return;                    /* nothing to do */

!     while (chunk->next != NULL)
!         chunk = chunk->next;

!     chunk->next = *destHdr;

!     *destHdr = *srcHdr;

      *srcHdr = NULL;
  }
--- 264,294 ----
  }

  /*
!  * Append the "source" list of invalidation message chunks to the "dest"
!  * list, resetting the source chunk-list pointer to NULL.
!  * Note that if the caller hangs onto the previous source pointer,
!  * the source list is still separately traversable afterwards.
   */
  static void
  AppendInvalidationMessageList(InvalidationChunk **destHdr,
                                InvalidationChunk **srcHdr)
  {
!     InvalidationChunk *chunk;

!     if (*srcHdr == NULL)
          return;                    /* nothing to do */

!     chunk = *destHdr;

!     if (chunk == NULL)
!         *destHdr = *srcHdr;
!     else
!     {
!         while (chunk->next != NULL)
!             chunk = chunk->next;

!         chunk->next = *srcHdr;
!     }

      *srcHdr = NULL;
  }
*************** xactGetCommittedInvalidationMessages(Sha
*** 858,867 ****
       */
      oldcontext = MemoryContextSwitchTo(CurTransactionContext);

-     ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
-                                      MakeSharedInvalidMessagesArray);
      ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
                                       MakeSharedInvalidMessagesArray);
      MemoryContextSwitchTo(oldcontext);

      Assert(!(numSharedInvalidMessagesArray > 0 &&
--- 865,874 ----
       */
      oldcontext = MemoryContextSwitchTo(CurTransactionContext);

      ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
                                       MakeSharedInvalidMessagesArray);
+     ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+                                      MakeSharedInvalidMessagesArray);
      MemoryContextSwitchTo(oldcontext);

      Assert(!(numSharedInvalidMessagesArray > 0 &&
*************** AtEOSubXact_Inval(bool isCommit)
*** 1084,1089 ****
--- 1091,1098 ----
  void
  CommandEndInvalidationMessages(void)
  {
+     InvalidationListHeader currentCmdInvalidMsgs;
+
      /*
       * You might think this shouldn't be called outside any transaction, but
       * bootstrap does it, and also ABORT issued when not in a transaction. So
*************** CommandEndInvalidationMessages(void)
*** 1092,1101 ****
      if (transInvalInfo == NULL)
          return;

!     ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
!                                 LocalExecuteInvalidationMessage);
      AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
                                 &transInvalInfo->CurrentCmdInvalidMsgs);
  }

--- 1101,1113 ----
      if (transInvalInfo == NULL)
          return;

!     currentCmdInvalidMsgs = transInvalInfo->CurrentCmdInvalidMsgs;
!
      AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
                                 &transInvalInfo->CurrentCmdInvalidMsgs);
+
+     ProcessInvalidationMessages(¤tCmdInvalidMsgs,
+                                 LocalExecuteInvalidationMessage);
  }

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Noah Misch

Date:

04 May 2019, 20:24:42

On Wed, May 01, 2019 at 07:01:23PM -0400, Tom Lane wrote:
> > On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
> >> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
> >> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
> >> gotten a failure, your session is hosed:
> >> 
> >> regression=# select 1;
> >> ?column? 
> >> ----------
> >> 1
> >> (1 row)
> >> 
> >> regression=# reindex index pg_class_oid_index;
> >> psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes
> >> regression=# select 1;
> >> psql: ERROR:  could not open file "base/16384/35344": No such file or directory

> It seems that the reason we fail to recover is that we detect the error
> during CommandEndInvalidationMessages, after we've updated the relcache
> entry for pg_class_oid_index; of course at that point, any additional
> pg_class access is going to fail because it'll try to use the
> not-yet-existent index.  However, when we then abort the transaction,
> we fail to revert pg_class_oid_index's relcache entry because *there is
> not yet an invalidation queue entry telling us to do so*.
> 
> This is, therefore, an order-of-operations bug in
> CommandEndInvalidationMessages.  It should attach the "current command"
> queue entries to PriorCmdInvalidMsgs *before* it processes them, not
> after.  In this way they'll be available to be reprocessed by 
> AtEOXact_Inval or AtEOSubXact_Inval if we fail during CCI.

That makes sense.

> The attached quick-hack patch changes that around, and seems to survive
> testing (though I've not completed a CCA run with it).  The basic change
> I had to make to make this work is to make AppendInvalidationMessageList
> actually append the current-commands list to the prior-cmds list, not
> prepend as it's really always done.  (In that way, we can still scan the
> current-commands list just by starting at its head.)  The comments for
> struct InvalidationChunk explicitly disclaim any significance to the order
> of the entries, so this should be okay, and I'm not seeing any problem
> in testing.  It's been a long time since I wrote that code, but I think
> the reason I did it like that was the thought that in a long transaction,
> the prior-cmds list might be much longer than the current-commands list,
> so iterating down the prior-cmds list could add an O(N^2) cost.  The
> easy answer to that, of course, is to make the lists doubly-linked,
> and I'd do that before committing; this version is just for testing.
> (It's short on comments, too.)

Looks reasonable so far.

> The thing I was worried about in RelationCacheInvalidate does seem
> to be a red herring, at least fixing it is not necessary to make
> the broken-session-state problem go away.

Your earlier proposal would have made RelationCacheInvalidate() work more like
RelationFlushRelation() when rd_newRelfilenodeSubid is set.  That's a good
direction, all else being equal, though I'm not aware of a specific bug
reachable today.  I think RelationCacheInvalidate() would then need the
reference count bits that RelationFlushRelation() has.

> *************** xactGetCommittedInvalidationMessages(Sha
> *** 858,867 ****
>        */
>       oldcontext = MemoryContextSwitchTo(CurTransactionContext);
>   
> -     ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
> -                                      MakeSharedInvalidMessagesArray);
>       ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
>                                        MakeSharedInvalidMessagesArray);
>       MemoryContextSwitchTo(oldcontext);
>   
>       Assert(!(numSharedInvalidMessagesArray > 0 &&
> --- 865,874 ----
>        */
>       oldcontext = MemoryContextSwitchTo(CurTransactionContext);
>   
>       ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
>                                        MakeSharedInvalidMessagesArray);
> +     ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
> +                                      MakeSharedInvalidMessagesArray);

Why this order change?

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

From

Tom Lane

Date:

04 May 2019, 21:14:36

Noah Misch <noah@leadboat.com> writes:
> On Wed, May 01, 2019 at 07:01:23PM -0400, Tom Lane wrote:
>> The thing I was worried about in RelationCacheInvalidate does seem
>> to be a red herring, at least fixing it is not necessary to make
>> the broken-session-state problem go away.

> Your earlier proposal would have made RelationCacheInvalidate() work more like
> RelationFlushRelation() when rd_newRelfilenodeSubid is set.  That's a good
> direction, all else being equal, though I'm not aware of a specific bug
> reachable today.  I think RelationCacheInvalidate() would then need the
> reference count bits that RelationFlushRelation() has.

Yeah.  I'm not actually convinced that treating rd_createSubid and
rd_newRelfilenodeSubid alike here is appropriate, though.  If
rd_createSubid is set then we certainly can assume that no other sessions
can see/modify the relation, but we cannot make the same assumption when
rd_newRelfilenodeSubid is set.  The comment argues, in essence, that it's
okay if we have AEL on the relation, but I'm not 100% convinced about
that ... still, I can't construct a counterexample at the moment.

> Why this order change?

Because of the comment just above:

     * ... Maintain the order that they
     * would be processed in by AtEOXact_Inval(), to ensure emulated behaviour
     * in redo is as similar as possible to original. We want the same bugs,
     * if any, not new ones.

In principle the order of processing inval events should not matter (if it
does, then this patch is much more dangerous than it looks).  But I concur
with this comment that it's best if standby servers apply the events in
the same order the master would; and this patch does cause that order to
change.

            regards, tom lane