Thread: Followup Timestamp to timestamp with TZ conversion

Followup Timestamp to timestamp with TZ conversion

From

Peter Volk

Date:

21 July 2021, 08:48:28

Hi,

this is a followup to a performance optimization during the conversion of a column from a timestamp column to a "timestamp with tz" column. The initial patch I am referring to is this one:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3c59263#patch4

and the previous discussion on this list is this one:

https://www.postgresql.org/message-id/CANsFX06xN-vPYxM+YXyfLezK9twjtK3dFJcOHoubTXng40muoQ@mail.gmail.com

The problem is that I have a 60TB+ PG installation for which we need to modify all of the timestamp columns to timestamp with tz. The data in the columns are already in UTC so we can benefit from the patch listed above. Yet there are 2 cases in which we are having an issue.

1) Index rebuilds: The patch is only avoiding a rewrite of the table data but is not avoiding a rebuild of the indexes. Following the logic in the patch above this should also be avoidable under the same condition

2) Partitioned tables with the timestamp as partition column: In this case the current version does not allow a modification of the column data type at all. Yet also following the logic in the patch this can also be allowed under the side condition if no table rewrite is required.

Question: What chances to we have to get the optimisations from the patch above also "promoted" to the other 2 cases I listed?

Cheers,

Peter

Re: Followup Timestamp to timestamp with TZ conversion

From

Tom Lane

Date:

22 July 2021, 18:28:53

Peter Volk <peterb.volk@gmx.net> writes:
> The problem is that I have a 60TB+ PG installation for which we need to
> modify all of the timestamp columns to timestamp with tz. The data in the
> columns are already in UTC so we can benefit from the patch listed above.
> Yet there are 2 cases in which we are having an issue.

> 1) Index rebuilds: The patch is only avoiding a rewrite of the table data
> but is not avoiding a rebuild of the indexes. Following the logic in the
> patch above this should also be avoidable under the same condition

I don't think that follows.  What we are concerned about when determining
whether a heap rewrite can be skipped is whether the stored heap entries
are bit-compatible between the two data types.  To decide that an index
rebuild is not needed, you'd need to further determine that their sort
orders are equivalent (for a btree index, or who-knows-what semantic
condition for other types of indexes).  We don't attempt to do that,
so index rebuilds are always needed.

As a thought experiment to prove that this is an issue, suppose that
somebody invented an unsigned integer type, and made the cast from
regular int4 follow the rules of a C cast, so that e.g. -1 becomes
2^32-1.  Given that, an ALTER TYPE from int4 to the unsigned type
could skip the heap rewrite.  But we absolutely would have to rebuild
any btree index on the column, because the sort ordering of the two
types is different.  OTOH, it's quite likely that a hash index would
not really need to be rebuilt.  So this is a real can of worms and
we've not cared to open it.

> 2) Partitioned tables with the timestamp as partition column: In this case
> the current version does not allow a modification of the column data type
> at all.

PG's partitioning features are still being built out, but I would not
hold my breath for that specific thing to change.  Again, the issue
is that bit-compatibility of individual values doesn't prove much
about comparison semantics, so it's not clear that a change of
data type still allows the value-to-partition assignment to be
identical.  (This is clearly an issue for RANGE partitioning.  Maybe
it could be finessed for HASH or LIST, but you'd still be needing
semantic assumptions that go beyond mere bit-compatibility of values.)

            regards, tom lane

Re: Followup Timestamp to timestamp with TZ conversion

From

Peter Volk

Date:

22 July 2021, 19:36:41

Hi Tom,

thanks for the reply, I do understand that if a rewrite of the table
needs to be avoided the binary image needs to be the same. Since PG 12
there is an optimisation to avoid a rewrite of timestamp columns if
they are converted to timestamp with tz and the target tz offset is 0

I am referring to the function

ATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno)

in which the following is checked:

(b/src/backend/commands/tablecmds.c)

else if (IsA(expr, FuncExpr))
      {
          FuncExpr   *f = (FuncExpr *) expr;

           switch (f->funcid)
           {
               case F_TIMESTAMPTZ_TIMESTAMP:
               case F_TIMESTAMP_TIMESTAMPTZ:
                   if (TimestampTimestampTzRequiresRewrite())
                       return true;
                   else
                       expr = linitial(f->args);
                   break;
               default:
                 return true;
           }


and TimestampTimestampTzRequiresRewrite checks if the offset is 0:

(b/src/backend/utils/adt/timestamp.c)

 TimestampTimestampTzRequiresRewrite()
 *
 * Returns false if the TimeZone GUC setting causes timestamp_timestamptz and
 * timestamptz_timestamp to be no-ops, where the return value has the same
 * bits as the argument.  Since project convention is to assume a GUC changes
 * no more often than STABLE functions change, the answer is valid that long.
 */
bool
TimestampTimestampTzRequiresRewrite(void)
{
   long        offset;

   if (pg_get_timezone_offset(session_timezone, &offset) && offset == 0)
       PG_RETURN_BOOL(false);
   PG_RETURN_BOOL(true);
}

So in this case it is already proven that there is a binary equality
between the data types timestamp and timestamp with tz if the offset
is considered with 0. Hence this type of optimisation should / could
also apply to indexes as well as the columns used in partitions

Thanks,
Peter


On Thu, Jul 22, 2021 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Volk <peterb.volk@gmx.net> writes:
> > The problem is that I have a 60TB+ PG installation for which we need to
> > modify all of the timestamp columns to timestamp with tz. The data in the
> > columns are already in UTC so we can benefit from the patch listed above.
> > Yet there are 2 cases in which we are having an issue.
>
> > 1) Index rebuilds: The patch is only avoiding a rewrite of the table data
> > but is not avoiding a rebuild of the indexes. Following the logic in the
> > patch above this should also be avoidable under the same condition
>
> I don't think that follows.  What we are concerned about when determining
> whether a heap rewrite can be skipped is whether the stored heap entries
> are bit-compatible between the two data types.  To decide that an index
> rebuild is not needed, you'd need to further determine that their sort
> orders are equivalent (for a btree index, or who-knows-what semantic
> condition for other types of indexes).  We don't attempt to do that,
> so index rebuilds are always needed.
>
> As a thought experiment to prove that this is an issue, suppose that
> somebody invented an unsigned integer type, and made the cast from
> regular int4 follow the rules of a C cast, so that e.g. -1 becomes
> 2^32-1.  Given that, an ALTER TYPE from int4 to the unsigned type
> could skip the heap rewrite.  But we absolutely would have to rebuild
> any btree index on the column, because the sort ordering of the two
> types is different.  OTOH, it's quite likely that a hash index would
> not really need to be rebuilt.  So this is a real can of worms and
> we've not cared to open it.
>
> > 2) Partitioned tables with the timestamp as partition column: In this case
> > the current version does not allow a modification of the column data type
> > at all.
>
> PG's partitioning features are still being built out, but I would not
> hold my breath for that specific thing to change.  Again, the issue
> is that bit-compatibility of individual values doesn't prove much
> about comparison semantics, so it's not clear that a change of
> data type still allows the value-to-partition assignment to be
> identical.  (This is clearly an issue for RANGE partitioning.  Maybe
> it could be finessed for HASH or LIST, but you'd still be needing
> semantic assumptions that go beyond mere bit-compatibility of values.)
>
>                         regards, tom lane
>
>

Re: Followup Timestamp to timestamp with TZ conversion

From

Robert Haas

Date:

22 July 2021, 21:48:44

On Thu, Jul 22, 2021 at 11:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As a thought experiment to prove that this is an issue, suppose that
> somebody invented an unsigned integer type, and made the cast from
> regular int4 follow the rules of a C cast, so that e.g. -1 becomes
> 2^32-1.  Given that, an ALTER TYPE from int4 to the unsigned type
> could skip the heap rewrite.  But we absolutely would have to rebuild
> any btree index on the column, because the sort ordering of the two
> types is different.  OTOH, it's quite likely that a hash index would
> not really need to be rebuilt.  So this is a real can of worms and
> we've not cared to open it.

I agree that it doesn't follow in general. I think it does in the case
of timestamp and timestamptz, because I don't think either the choice
of time zone or the fact that we're reckoning relative to a time zone
can change which of two timestamps is considered earlier. However, I
think the only infrastructure we have for proving that is to look to
see whether it's the same operator family in both cases. Because
timestamp_ops and timestamptz_ops are separate, that doesn't help
here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Followup Timestamp to timestamp with TZ conversion

From

Tom Lane

Date:

22 July 2021, 22:09:07

Robert Haas <robertmhaas@gmail.com> writes:
> I agree that it doesn't follow in general. I think it does in the case
> of timestamp and timestamptz, because I don't think either the choice
> of time zone or the fact that we're reckoning relative to a time zone
> can change which of two timestamps is considered earlier. However, I
> think the only infrastructure we have for proving that is to look to
> see whether it's the same operator family in both cases. Because
> timestamp_ops and timestamptz_ops are separate, that doesn't help
> here.

Right.  It would in fact work for these two types, but we do not have
infrastructure that would allow us to know that.  I'm not sure about
your idea that "same operator family" is enough.

(Even for these two types, while a plain btree index should be fine,
I think it wouldn't be hard to construct expression indexes that
would not be compatible.  So there's a lot of worms in that can.)

            regards, tom lane

Re: Followup Timestamp to timestamp with TZ conversion

From

Tom Lane

Date:

23 July 2021, 21:07:32

Peter Volk <peterb.volk@gmx.net> writes:
> thanks for the reply, I do understand that if a rewrite of the table
> needs to be avoided the binary image needs to be the same. Since PG 12
> there is an optimisation to avoid a rewrite of timestamp columns if
> they are converted to timestamp with tz and the target tz offset is 0

Yes, I'm very well aware of that optimization.  While it's certainly
a hack, it fits within a design that isn't a hack, ie that there are
common, well-defined cases where we can skip the table rewrite.
However, for the reasons I explained before, there are no general-purpose
cases where we can skip an index build on a type-changed column, so
there is no place to insert a similar hack for the timestamp[tz] case.
I'm unwilling to kluge up ALTER TYPE to the extent that would be needed
if the result would only be to handle this one case.

            regards, tom lane

Re: Followup Timestamp to timestamp with TZ conversion

From

Robert Haas

Date:

23 July 2021, 23:49:37

On Fri, Jul 23, 2021 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yes, I'm very well aware of that optimization.  While it's certainly
> a hack, it fits within a design that isn't a hack, ie that there are
> common, well-defined cases where we can skip the table rewrite.
> However, for the reasons I explained before, there are no general-purpose
> cases where we can skip an index build on a type-changed column, so
> there is no place to insert a similar hack for the timestamp[tz] case.

Wouldn't the hack just go into CheckIndexCompatible()?

You seemed to think my previous comments about comparing opfamilies
were hypothetical but I think we actually already have the
optimization Peter wants, and it just doesn't apply in this case for
lack of hacks.

Maybe I am missing something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Followup Timestamp to timestamp with TZ conversion

From

Tom Lane

Date:

24 July 2021, 00:47:27

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jul 23, 2021 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, for the reasons I explained before, there are no general-purpose
>> cases where we can skip an index build on a type-changed column, so
>> there is no place to insert a similar hack for the timestamp[tz] case.

> Wouldn't the hack just go into CheckIndexCompatible()?

Oh!  I went looking for code to skip rebuilding indexes during ALTER TYPE,
but I guess I looked in the wrong place, because I missed that somehow.

> You seemed to think my previous comments about comparing opfamilies
> were hypothetical but I think we actually already have the
> optimization Peter wants, and it just doesn't apply in this case for
> lack of hacks.

Hmm.  Note that what this is checking for is same operator *class* not
same operator family (if it were doing the latter, Peter's case would
already work).  I think it has to do that.  Extending my previous
thought experiment about an unsigned integer type, if someone were to
invent one, it would make a lot of sense to include it in integer_ops,
and then the logic you suggest is toast.  (Obviously, the cross-type
comparison operators you'd need to write would have to be careful,
but you'd almost certainly wish to write them anyway.)

Given that we require the non-cross-type members of an opclass to be
immutable, what this is actually doing may be safe.  At least I can't
construct a counterexample after five minutes' thought.  On the other
hand, I'm also a bit confused about how it ever succeeds at all.
If we've changed the heap column's type, it should not be using the
same opclass anymore (unless the opclass is polymorphic, but that
case is rejected too).  I'm suspicious that this is just an expensive
way of writing "we can only preserve indexes that aren't on the
column that changed type".

            regards, tom lane

Re: Followup Timestamp to timestamp with TZ conversion

From

Robert Haas

Date:

24 July 2021, 01:03:42

On Fri, Jul 23, 2021 at 5:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > You seemed to think my previous comments about comparing opfamilies
> > were hypothetical but I think we actually already have the
> > optimization Peter wants, and it just doesn't apply in this case for
> > lack of hacks.
>
> Hmm.  Note that what this is checking for is same operator *class* not
> same operator family (if it were doing the latter, Peter's case would
> already work).  I think it has to do that.  Extending my previous
> thought experiment about an unsigned integer type, if someone were to
> invent one, it would make a lot of sense to include it in integer_ops,
> and then the logic you suggest is toast.  (Obviously, the cross-type
> comparison operators you'd need to write would have to be careful,
> but you'd almost certainly wish to write them anyway.)

Mumble. I hadn't considered that sort of thing. I assumed that when
the documentation and/or code comments talked about a compatible
notion of equality, it was a strong enough notion of "compatible" to
preclude this sort of case. I'm not really sure why we shouldn't think
of it that way; the example you give here is reasonable, but
artificial.

> Given that we require the non-cross-type members of an opclass to be
> immutable, what this is actually doing may be safe.  At least I can't
> construct a counterexample after five minutes' thought.  On the other
> hand, I'm also a bit confused about how it ever succeeds at all.
> If we've changed the heap column's type, it should not be using the
> same opclass anymore (unless the opclass is polymorphic, but that
> case is rejected too).  I'm suspicious that this is just an expensive
> way of writing "we can only preserve indexes that aren't on the
> column that changed type".

Well, you can change just the typemod, for example, which was a case
that motivated this work originally. People wanted to be able to make
varchar(10) into varchar(20) without doing a ton of work, and I think
this lets that work. I seem to recall that there are at least a few
cases that involve actually changing the type as well, but at 6pm on a
Friday evening when I haven't looked at this in years, I can't tell
you what they are off the top of my head.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Followup Timestamp to timestamp with TZ conversion

From

Tom Lane

Date:

24 July 2021, 01:18:06

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jul 23, 2021 at 5:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm.  Note that what this is checking for is same operator *class* not
>> same operator family (if it were doing the latter, Peter's case would
>> already work).  I think it has to do that.  Extending my previous
>> thought experiment about an unsigned integer type, if someone were to
>> invent one, it would make a lot of sense to include it in integer_ops,
>> and then the logic you suggest is toast.

> Mumble. I hadn't considered that sort of thing. I assumed that when
> the documentation and/or code comments talked about a compatible
> notion of equality, it was a strong enough notion of "compatible" to
> preclude this sort of case.

For btree indexes, you need a compatible notion of ordering, not only
equality.  That's really what's breaking my hypothetical case of a uint
type.  But as long as you implement operators that behave in a consistent
fashion, whether they interpret the same heap bitpattern the same is not
something that matters for constructing a consistent operator family.
datetime_ops (which includes timestamp and timestamptz) is already a
counterexample, since unless the timezone is UTC, its operators *don't*
all agree on what a particular bitpattern means.

>> ... I'm also a bit confused about how it ever succeeds at all.

> Well, you can change just the typemod, for example, which was a case
> that motivated this work originally.

Ah, right.  I guess binary-compatible cases such as text and varchar
would also fit into that.

            regards, tom lane

Re: Followup Timestamp to timestamp with TZ conversion

From

Robert Haas

Date:

24 July 2021, 04:02:29

On Fri, Jul 23, 2021 at 6:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> For btree indexes, you need a compatible notion of ordering, not only
> equality.  That's really what's breaking my hypothetical case of a uint
> type.  But as long as you implement operators that behave in a consistent
> fashion, whether they interpret the same heap bitpattern the same is not
> something that matters for constructing a consistent operator family.
> datetime_ops (which includes timestamp and timestamptz) is already a
> counterexample, since unless the timezone is UTC, its operators *don't*
> all agree on what a particular bitpattern means.

Well, that depends a bit on what "means" means. I would argue that the
meaning does not depend on the timezone setting, and that the timezone
merely controls the way that values are printed. That is, I would say
that the meaning is the point in time which the timestamp represents,
considered as an abstract concept, and timezone is merely the user's
way of asking that point in time to be expressed in a way that's easy
for them to understand. Regardless of that philosophical point, I
think it must be true that if a and b are timestamps, a < b implies
a::timestamptz < b::timestamptz, and a > b implies a::timestamptz >
b::timestamptz, and the other way around, which is surely good enough,
but wasn't the case in your example. So the question I suppose is
whether in your example it's really legitimate to include your uint
type in the integer_ops opfamily. I said above that I didn't think so,
but I'm less sure now, because I realize that I read what you wrote
carefully enough, and it's not as artificial as I first thought.

> >> ... I'm also a bit confused about how it ever succeeds at all.
>
> > Well, you can change just the typemod, for example, which was a case
> > that motivated this work originally.
>
> Ah, right.  I guess binary-compatible cases such as text and varchar
> would also fit into that.

Oh, right. That was another one of the cases that motivated that work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com