Thread: Followup Timestamp to timestamp with TZ conversion
Hi,
this is a followup to a performance optimization during the conversion of a column from a timestamp column to a "timestamp with tz" column. The initial patch I am referring to is this one:
and the previous discussion on this list is this one:
The problem is that I have a 60TB+ PG installation for which we need to modify all of the timestamp columns to timestamp with tz. The data in the columns are already in UTC so we can benefit from the patch listed above. Yet there are 2 cases in which we are having an issue.
1) Index rebuilds: The patch is only avoiding a rewrite of the table data but is not avoiding a rebuild of the indexes. Following the logic in the patch above this should also be avoidable under the same condition
2) Partitioned tables with the timestamp as partition column: In this case the current version does not allow a modification of the column data type at all. Yet also following the logic in the patch this can also be allowed under the side condition if no table rewrite is required.
Question: What chances to we have to get the optimisations from the patch above also "promoted" to the other 2 cases I listed?
Cheers,
Peter
Peter Volk <peterb.volk@gmx.net> writes: > The problem is that I have a 60TB+ PG installation for which we need to > modify all of the timestamp columns to timestamp with tz. The data in the > columns are already in UTC so we can benefit from the patch listed above. > Yet there are 2 cases in which we are having an issue. > 1) Index rebuilds: The patch is only avoiding a rewrite of the table data > but is not avoiding a rebuild of the indexes. Following the logic in the > patch above this should also be avoidable under the same condition I don't think that follows. What we are concerned about when determining whether a heap rewrite can be skipped is whether the stored heap entries are bit-compatible between the two data types. To decide that an index rebuild is not needed, you'd need to further determine that their sort orders are equivalent (for a btree index, or who-knows-what semantic condition for other types of indexes). We don't attempt to do that, so index rebuilds are always needed. As a thought experiment to prove that this is an issue, suppose that somebody invented an unsigned integer type, and made the cast from regular int4 follow the rules of a C cast, so that e.g. -1 becomes 2^32-1. Given that, an ALTER TYPE from int4 to the unsigned type could skip the heap rewrite. But we absolutely would have to rebuild any btree index on the column, because the sort ordering of the two types is different. OTOH, it's quite likely that a hash index would not really need to be rebuilt. So this is a real can of worms and we've not cared to open it. > 2) Partitioned tables with the timestamp as partition column: In this case > the current version does not allow a modification of the column data type > at all. PG's partitioning features are still being built out, but I would not hold my breath for that specific thing to change. Again, the issue is that bit-compatibility of individual values doesn't prove much about comparison semantics, so it's not clear that a change of data type still allows the value-to-partition assignment to be identical. (This is clearly an issue for RANGE partitioning. Maybe it could be finessed for HASH or LIST, but you'd still be needing semantic assumptions that go beyond mere bit-compatibility of values.) regards, tom lane
Hi Tom, thanks for the reply, I do understand that if a rewrite of the table needs to be avoided the binary image needs to be the same. Since PG 12 there is an optimisation to avoid a rewrite of timestamp columns if they are converted to timestamp with tz and the target tz offset is 0 I am referring to the function ATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno) in which the following is checked: (b/src/backend/commands/tablecmds.c) else if (IsA(expr, FuncExpr)) { FuncExpr *f = (FuncExpr *) expr; switch (f->funcid) { case F_TIMESTAMPTZ_TIMESTAMP: case F_TIMESTAMP_TIMESTAMPTZ: if (TimestampTimestampTzRequiresRewrite()) return true; else expr = linitial(f->args); break; default: return true; } and TimestampTimestampTzRequiresRewrite checks if the offset is 0: (b/src/backend/utils/adt/timestamp.c) TimestampTimestampTzRequiresRewrite() * * Returns false if the TimeZone GUC setting causes timestamp_timestamptz and * timestamptz_timestamp to be no-ops, where the return value has the same * bits as the argument. Since project convention is to assume a GUC changes * no more often than STABLE functions change, the answer is valid that long. */ bool TimestampTimestampTzRequiresRewrite(void) { long offset; if (pg_get_timezone_offset(session_timezone, &offset) && offset == 0) PG_RETURN_BOOL(false); PG_RETURN_BOOL(true); } So in this case it is already proven that there is a binary equality between the data types timestamp and timestamp with tz if the offset is considered with 0. Hence this type of optimisation should / could also apply to indexes as well as the columns used in partitions Thanks, Peter On Thu, Jul 22, 2021 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Volk <peterb.volk@gmx.net> writes: > > The problem is that I have a 60TB+ PG installation for which we need to > > modify all of the timestamp columns to timestamp with tz. The data in the > > columns are already in UTC so we can benefit from the patch listed above. > > Yet there are 2 cases in which we are having an issue. > > > 1) Index rebuilds: The patch is only avoiding a rewrite of the table data > > but is not avoiding a rebuild of the indexes. Following the logic in the > > patch above this should also be avoidable under the same condition > > I don't think that follows. What we are concerned about when determining > whether a heap rewrite can be skipped is whether the stored heap entries > are bit-compatible between the two data types. To decide that an index > rebuild is not needed, you'd need to further determine that their sort > orders are equivalent (for a btree index, or who-knows-what semantic > condition for other types of indexes). We don't attempt to do that, > so index rebuilds are always needed. > > As a thought experiment to prove that this is an issue, suppose that > somebody invented an unsigned integer type, and made the cast from > regular int4 follow the rules of a C cast, so that e.g. -1 becomes > 2^32-1. Given that, an ALTER TYPE from int4 to the unsigned type > could skip the heap rewrite. But we absolutely would have to rebuild > any btree index on the column, because the sort ordering of the two > types is different. OTOH, it's quite likely that a hash index would > not really need to be rebuilt. So this is a real can of worms and > we've not cared to open it. > > > 2) Partitioned tables with the timestamp as partition column: In this case > > the current version does not allow a modification of the column data type > > at all. > > PG's partitioning features are still being built out, but I would not > hold my breath for that specific thing to change. Again, the issue > is that bit-compatibility of individual values doesn't prove much > about comparison semantics, so it's not clear that a change of > data type still allows the value-to-partition assignment to be > identical. (This is clearly an issue for RANGE partitioning. Maybe > it could be finessed for HASH or LIST, but you'd still be needing > semantic assumptions that go beyond mere bit-compatibility of values.) > > regards, tom lane > >
On Thu, Jul 22, 2021 at 11:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > As a thought experiment to prove that this is an issue, suppose that > somebody invented an unsigned integer type, and made the cast from > regular int4 follow the rules of a C cast, so that e.g. -1 becomes > 2^32-1. Given that, an ALTER TYPE from int4 to the unsigned type > could skip the heap rewrite. But we absolutely would have to rebuild > any btree index on the column, because the sort ordering of the two > types is different. OTOH, it's quite likely that a hash index would > not really need to be rebuilt. So this is a real can of worms and > we've not cared to open it. I agree that it doesn't follow in general. I think it does in the case of timestamp and timestamptz, because I don't think either the choice of time zone or the fact that we're reckoning relative to a time zone can change which of two timestamps is considered earlier. However, I think the only infrastructure we have for proving that is to look to see whether it's the same operator family in both cases. Because timestamp_ops and timestamptz_ops are separate, that doesn't help here. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I agree that it doesn't follow in general. I think it does in the case > of timestamp and timestamptz, because I don't think either the choice > of time zone or the fact that we're reckoning relative to a time zone > can change which of two timestamps is considered earlier. However, I > think the only infrastructure we have for proving that is to look to > see whether it's the same operator family in both cases. Because > timestamp_ops and timestamptz_ops are separate, that doesn't help > here. Right. It would in fact work for these two types, but we do not have infrastructure that would allow us to know that. I'm not sure about your idea that "same operator family" is enough. (Even for these two types, while a plain btree index should be fine, I think it wouldn't be hard to construct expression indexes that would not be compatible. So there's a lot of worms in that can.) regards, tom lane
Peter Volk <peterb.volk@gmx.net> writes: > thanks for the reply, I do understand that if a rewrite of the table > needs to be avoided the binary image needs to be the same. Since PG 12 > there is an optimisation to avoid a rewrite of timestamp columns if > they are converted to timestamp with tz and the target tz offset is 0 Yes, I'm very well aware of that optimization. While it's certainly a hack, it fits within a design that isn't a hack, ie that there are common, well-defined cases where we can skip the table rewrite. However, for the reasons I explained before, there are no general-purpose cases where we can skip an index build on a type-changed column, so there is no place to insert a similar hack for the timestamp[tz] case. I'm unwilling to kluge up ALTER TYPE to the extent that would be needed if the result would only be to handle this one case. regards, tom lane
On Fri, Jul 23, 2021 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yes, I'm very well aware of that optimization. While it's certainly > a hack, it fits within a design that isn't a hack, ie that there are > common, well-defined cases where we can skip the table rewrite. > However, for the reasons I explained before, there are no general-purpose > cases where we can skip an index build on a type-changed column, so > there is no place to insert a similar hack for the timestamp[tz] case. Wouldn't the hack just go into CheckIndexCompatible()? You seemed to think my previous comments about comparing opfamilies were hypothetical but I think we actually already have the optimization Peter wants, and it just doesn't apply in this case for lack of hacks. Maybe I am missing something. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 23, 2021 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, for the reasons I explained before, there are no general-purpose >> cases where we can skip an index build on a type-changed column, so >> there is no place to insert a similar hack for the timestamp[tz] case. > Wouldn't the hack just go into CheckIndexCompatible()? Oh! I went looking for code to skip rebuilding indexes during ALTER TYPE, but I guess I looked in the wrong place, because I missed that somehow. > You seemed to think my previous comments about comparing opfamilies > were hypothetical but I think we actually already have the > optimization Peter wants, and it just doesn't apply in this case for > lack of hacks. Hmm. Note that what this is checking for is same operator *class* not same operator family (if it were doing the latter, Peter's case would already work). I think it has to do that. Extending my previous thought experiment about an unsigned integer type, if someone were to invent one, it would make a lot of sense to include it in integer_ops, and then the logic you suggest is toast. (Obviously, the cross-type comparison operators you'd need to write would have to be careful, but you'd almost certainly wish to write them anyway.) Given that we require the non-cross-type members of an opclass to be immutable, what this is actually doing may be safe. At least I can't construct a counterexample after five minutes' thought. On the other hand, I'm also a bit confused about how it ever succeeds at all. If we've changed the heap column's type, it should not be using the same opclass anymore (unless the opclass is polymorphic, but that case is rejected too). I'm suspicious that this is just an expensive way of writing "we can only preserve indexes that aren't on the column that changed type". regards, tom lane
On Fri, Jul 23, 2021 at 5:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > You seemed to think my previous comments about comparing opfamilies > > were hypothetical but I think we actually already have the > > optimization Peter wants, and it just doesn't apply in this case for > > lack of hacks. > > Hmm. Note that what this is checking for is same operator *class* not > same operator family (if it were doing the latter, Peter's case would > already work). I think it has to do that. Extending my previous > thought experiment about an unsigned integer type, if someone were to > invent one, it would make a lot of sense to include it in integer_ops, > and then the logic you suggest is toast. (Obviously, the cross-type > comparison operators you'd need to write would have to be careful, > but you'd almost certainly wish to write them anyway.) Mumble. I hadn't considered that sort of thing. I assumed that when the documentation and/or code comments talked about a compatible notion of equality, it was a strong enough notion of "compatible" to preclude this sort of case. I'm not really sure why we shouldn't think of it that way; the example you give here is reasonable, but artificial. > Given that we require the non-cross-type members of an opclass to be > immutable, what this is actually doing may be safe. At least I can't > construct a counterexample after five minutes' thought. On the other > hand, I'm also a bit confused about how it ever succeeds at all. > If we've changed the heap column's type, it should not be using the > same opclass anymore (unless the opclass is polymorphic, but that > case is rejected too). I'm suspicious that this is just an expensive > way of writing "we can only preserve indexes that aren't on the > column that changed type". Well, you can change just the typemod, for example, which was a case that motivated this work originally. People wanted to be able to make varchar(10) into varchar(20) without doing a ton of work, and I think this lets that work. I seem to recall that there are at least a few cases that involve actually changing the type as well, but at 6pm on a Friday evening when I haven't looked at this in years, I can't tell you what they are off the top of my head. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 23, 2021 at 5:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. Note that what this is checking for is same operator *class* not >> same operator family (if it were doing the latter, Peter's case would >> already work). I think it has to do that. Extending my previous >> thought experiment about an unsigned integer type, if someone were to >> invent one, it would make a lot of sense to include it in integer_ops, >> and then the logic you suggest is toast. > Mumble. I hadn't considered that sort of thing. I assumed that when > the documentation and/or code comments talked about a compatible > notion of equality, it was a strong enough notion of "compatible" to > preclude this sort of case. For btree indexes, you need a compatible notion of ordering, not only equality. That's really what's breaking my hypothetical case of a uint type. But as long as you implement operators that behave in a consistent fashion, whether they interpret the same heap bitpattern the same is not something that matters for constructing a consistent operator family. datetime_ops (which includes timestamp and timestamptz) is already a counterexample, since unless the timezone is UTC, its operators *don't* all agree on what a particular bitpattern means. >> ... I'm also a bit confused about how it ever succeeds at all. > Well, you can change just the typemod, for example, which was a case > that motivated this work originally. Ah, right. I guess binary-compatible cases such as text and varchar would also fit into that. regards, tom lane
On Fri, Jul 23, 2021 at 6:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > For btree indexes, you need a compatible notion of ordering, not only > equality. That's really what's breaking my hypothetical case of a uint > type. But as long as you implement operators that behave in a consistent > fashion, whether they interpret the same heap bitpattern the same is not > something that matters for constructing a consistent operator family. > datetime_ops (which includes timestamp and timestamptz) is already a > counterexample, since unless the timezone is UTC, its operators *don't* > all agree on what a particular bitpattern means. Well, that depends a bit on what "means" means. I would argue that the meaning does not depend on the timezone setting, and that the timezone merely controls the way that values are printed. That is, I would say that the meaning is the point in time which the timestamp represents, considered as an abstract concept, and timezone is merely the user's way of asking that point in time to be expressed in a way that's easy for them to understand. Regardless of that philosophical point, I think it must be true that if a and b are timestamps, a < b implies a::timestamptz < b::timestamptz, and a > b implies a::timestamptz > b::timestamptz, and the other way around, which is surely good enough, but wasn't the case in your example. So the question I suppose is whether in your example it's really legitimate to include your uint type in the integer_ops opfamily. I said above that I didn't think so, but I'm less sure now, because I realize that I read what you wrote carefully enough, and it's not as artificial as I first thought. > >> ... I'm also a bit confused about how it ever succeeds at all. > > > Well, you can change just the typemod, for example, which was a case > > that motivated this work originally. > > Ah, right. I guess binary-compatible cases such as text and varchar > would also fit into that. Oh, right. That was another one of the cases that motivated that work. -- Robert Haas EDB: http://www.enterprisedb.com