Thread: DecodeInterval fixes
Hi all,
This patch does three things in the DecodeInterval function:
1) Removes dead code for handling unit type RESERVE. There used to be
a unit called "invalid" that was of type RESERVE. At some point that
unit was removed and there were no more units of type RESERVE.
Therefore, the code for RESERVE unit handling is unreachable.
2) Restrict the unit "ago" to only appear at the end of the
interval. According to the docs [0], this is the only valid place to
put it, but we allowed it multiple times at any point in the input.
3) Error when the user has multiple consecutive units or a unit without
an accompanying value. I spent a lot of time trying to come up with
robust ways to detect this and ultimately settled on using the "type"
field. I'm not entirely happy with this approach, because it involves
having to look ahead to the next field in a couple of places. The other
approach I was considering was to introduce a boolean flag called
"unhandled_unit". After parsing a unit it would be set to true, after
applying the unit to a number it would be set to false. If it was true
right before parsing a unit, then we would error. Please let me know
if you have any suggestions here.
There is one more problem I noticed, but didn't fix. We allow multiple
"@" to be sprinkled anywhere in the input, even though the docs [0]
only allow it to appear at the beginning of the input. For example,
the following query works fine:
# SELECT INTERVAL '1 @ year @ @ @ 6 days @ 10 @ minutes';
interval
------------------------
1 year 6 days 00:10:00
(1 row)
Unfortunately, this can't be fixed in DecodeInterval, because all of
the "@" fields are filtered out before this method. Additionally, I
believe this means that the lines
if (type == IGNORE_DTF)
continue;
in DecodeInterval, that appears right after decoding the units, are unreachable since
"@" is the only unit of type IGNORE_DTF. Since "@" is filtered out,
we'll never decode a unit of type IGNORE_DTF.
For reference, I previously merged a couple similar patches to this
one, but for other date-time types [1], [2].
1) Removes dead code for handling unit type RESERVE. There used to be
a unit called "invalid" that was of type RESERVE. At some point that
unit was removed and there were no more units of type RESERVE.
Therefore, the code for RESERVE unit handling is unreachable.
2) Restrict the unit "ago" to only appear at the end of the
interval. According to the docs [0], this is the only valid place to
put it, but we allowed it multiple times at any point in the input.
3) Error when the user has multiple consecutive units or a unit without
an accompanying value. I spent a lot of time trying to come up with
robust ways to detect this and ultimately settled on using the "type"
field. I'm not entirely happy with this approach, because it involves
having to look ahead to the next field in a couple of places. The other
approach I was considering was to introduce a boolean flag called
"unhandled_unit". After parsing a unit it would be set to true, after
applying the unit to a number it would be set to false. If it was true
right before parsing a unit, then we would error. Please let me know
if you have any suggestions here.
There is one more problem I noticed, but didn't fix. We allow multiple
"@" to be sprinkled anywhere in the input, even though the docs [0]
only allow it to appear at the beginning of the input. For example,
the following query works fine:
# SELECT INTERVAL '1 @ year @ @ @ 6 days @ 10 @ minutes';
interval
------------------------
1 year 6 days 00:10:00
(1 row)
Unfortunately, this can't be fixed in DecodeInterval, because all of
the "@" fields are filtered out before this method. Additionally, I
believe this means that the lines
if (type == IGNORE_DTF)
continue;
in DecodeInterval, that appears right after decoding the units, are unreachable since
"@" is the only unit of type IGNORE_DTF. Since "@" is filtered out,
we'll never decode a unit of type IGNORE_DTF.
For reference, I previously merged a couple similar patches to this
one, but for other date-time types [1], [2].
Thanks,
Joe Koshakow
[0] https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b3c5953553bb9fb0b171abc6041e7c7e9ca5b4d
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=bcc704b52490492e6bd73c4444056b3e9644504d
Attachment
Hi Joe, here's a partial review: On Sun, Apr 9, 2023 at 5:44 PM Joseph Koshakow <koshy44@gmail.com> wrote: > 1) Removes dead code for handling unit type RESERVE. Looks good. From a quick skim it looks like the ECPG copy of this code (ecpg/pgtypeslib/interval.c) might need to be updated as well? > 2) Restrict the unit "ago" to only appear at the end of the > interval. According to the docs [0], this is the only valid place to > put it, but we allowed it multiple times at any point in the input. Also looks reasonable to me. (Same note re: ECPG.) > 3) Error when the user has multiple consecutive units or a unit without > an accompanying value. I spent a lot of time trying to come up with > robust ways to detect this and ultimately settled on using the "type" > field. I'm not entirely happy with this approach, because it involves > having to look ahead to the next field in a couple of places. The other > approach I was considering was to introduce a boolean flag called > "unhandled_unit". After parsing a unit it would be set to true, after > applying the unit to a number it would be set to false. If it was true > right before parsing a unit, then we would error. Please let me know > if you have any suggestions here. I'm new to this code, but I agree that the use of `type` and the lookahead are not particularly obvious/intuitive. At the very least, they'd need some more explanation in the code. Your boolean flag idea sounds reasonable, though. > There is one more problem I noticed, but didn't fix. We allow multiple > "@" to be sprinkled anywhere in the input, even though the docs [0] > only allow it to appear at the beginning of the input. (No particular opinion on this.) It looks like this patch needs a rebase for the CI, too, but there are no conflicts. Thanks! --Jacob
Jacob Champion <jchampion@timescale.com> writes: > Hi Joe, here's a partial review: > On Sun, Apr 9, 2023 at 5:44 PM Joseph Koshakow <koshy44@gmail.com> wrote: >> 1) Removes dead code for handling unit type RESERVE. > Looks good. From a quick skim it looks like the ECPG copy of this code > (ecpg/pgtypeslib/interval.c) might need to be updated as well? The ECPG datetime datatype support code has been basically unmaintained for years, and has diverged quite far at this point from the core code. I wouldn't expect that a patch to the core code necessarily applies easily to ECPG, nor would I expect somebody patching the core to bother trying. Perhaps modernizing/resyncing that ECPG code would be a worthwhile undertaking, but it'd be a mighty large one, and I'm not sure about the size of the return. In the meantime, benign neglect is the policy. regards, tom lane
Jacob Champion <jchampion@timescale.com> writes:
> Hi Joe, here's a partial review:
Thanks so much for the review!
> I'm new to this code, but I agree that the use of `type` and the
> lookahead are not particularly obvious/intuitive. At the very least,
> they'd need some more explanation in the code. Your boolean flag idea
> sounds reasonable, though.
I've updated the patch with the boolean flag idea. I think it's a
bit cleaner and more readable.
>> There is one more problem I noticed, but didn't fix. We allow multiple
>> "@" to be sprinkled anywhere in the input, even though the docs [0]
>> only allow it to appear at the beginning of the input.
>
> (No particular opinion on this.)
I looked into this a bit. The reason this works is because the date
time lexer filters out all punctuation. That's what allows us to parse
things like `SELECT date 'January 8, 1999';`. It's probably not worth
trying to be smarter about what punctuation we allow where, at least
for now. Maybe in the future we can exclude "@" from the punctuation
that get's filtered out.
> It looks like this patch needs a rebase for the CI, too, but there are
> no conflicts.
The attached patch is rebased against master.
Thanks,
Joe Koshakow
Attachment
On Fri, Jul 7, 2023 at 4:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The ECPG datetime datatype support code has been basically unmaintained > for years, and has diverged quite far at this point from the core code. I was under the impression that anything in the postgresql.git repository is considered core, and hence maintained as one unit, from release to release. An example of this, to me, were all the contrib/* modules. > I wouldn't expect that a patch to the core code necessarily applies > easily to ECPG, nor would I expect somebody patching the core to bother > trying. The above statement makes me think that only the code inside src/backend/ is considered core. Is that the right assumption? > Perhaps modernizing/resyncing that ECPG code would be a worthwhile > undertaking, but it'd be a mighty large one, and I'm not sure about > the size of the return. In the meantime, benign neglect is the policy. Benign neglect doesn't sound nice from a user's/consumer's perspective. Can it be labeled (i.e. declared as such in docs) as deprecated. Knowing that the tool you use has now been deprecated would be a better message for someone still using it, even if it was left marked deprecated indefinitely. Discovering benign neglect for the tool you depend on, from secondary sources (like this list, forums, etc.), does not evoke a lot of confidence. Best regards, Gurjeet http://Gurje.et
Gurjeet Singh <gurjeet@singh.im> writes: > On Fri, Jul 7, 2023 at 4:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The ECPG datetime datatype support code has been basically unmaintained >> for years, and has diverged quite far at this point from the core code. > I was under the impression that anything in the postgresql.git > repository is considered core, and hence maintained as one unit, from > release to release. When I say that ecpglib is next door to unmaintained, I'm just stating the facts on the ground; project policy is irrelevant. That situation is not likely to change until somebody steps up to do (a lot of) work on it, which is probably unlikely to happen unless we start getting actual complaints from ECPG users. For the meantime, what's there now seems to be satisfactory to whoever is using it ... which might be nobody? In any case, you don't have to look far to notice that some parts of the tree are maintained far more actively than others. ecpglib is just one of the more identifiable bits that's receiving little love. The quality of the code under contrib/ is wildly variable, and even the server code itself has backwaters. (For instance, the "bit" types, which aren't even in the standard anymore; or the geometric types, or "money".) By and large, I don't see this unevenness of maintenance effort as a problem. It's more like directing our limited resources into the most useful places. Code that isn't getting worked on is either not used at all by anybody, or it serves the needs of those who use it well enough already. Since it's difficult to tell which of those cases applies, removing code just because it's not been improved lately is a hard choice to sell. But so is putting maintenance effort into code that there might be little audience for. In the end we solve this via the principle of "scratch your own itch": if somebody is concerned enough about a particular piece of code to put their own time into improving it, then great, it'll get improved. > Benign neglect doesn't sound nice from a user's/consumer's > perspective. Can it be labeled (i.e. declared as such in docs) as > deprecated. Deprecation would imply that we're planning to remove it, which we are not. regards, tom lane
On Sat, Jul 8, 2023 at 1:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Gurjeet Singh <gurjeet@singh.im> writes: > > On Fri, Jul 7, 2023 at 4:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The ECPG datetime datatype support code has been basically unmaintained > >> for years, and has diverged quite far at this point from the core code. > > > I was under the impression that anything in the postgresql.git > > repository is considered core, and hence maintained as one unit, from > > release to release. > > When I say that ecpglib is next door to unmaintained, I'm just stating > the facts on the ground; project policy is irrelevant. That situation > is not likely to change until somebody steps up to do (a lot of) work > on it, which is probably unlikely to happen unless we start getting > actual complaints from ECPG users. For the meantime, what's there now > seems to be satisfactory to whoever is using it ... which might be > nobody? > > In any case, you don't have to look far to notice that some parts of > the tree are maintained far more actively than others. ecpglib is > just one of the more identifiable bits that's receiving little love. > The quality of the code under contrib/ is wildly variable, and even > the server code itself has backwaters. (For instance, the "bit" types, > which aren't even in the standard anymore; or the geometric types, > or "money".) Thanks for sharing your view on different parts of the code. This give a clear direction if someone would be interested in stepping up. As part of my mentoring a GSoC 2023 participant, last night we were looking at the TODO wiki page, for something for the mentee to pick up next. I feel the staleness/deficiencies you mention above are not captured in the TODO wiki page. It'd be nice if these were documented, so that newcomers to the community can pick up work that they feel is an easy lift for them. > By and large, I don't see this unevenness of maintenance effort as > a problem. It's more like directing our limited resources into the > most useful places. Code that isn't getting worked on is either not > used at all by anybody, or it serves the needs of those who use it > well enough already. Since it's difficult to tell which of those > cases applies, removing code just because it's not been improved > lately is a hard choice to sell. But so is putting maintenance effort > into code that there might be little audience for. In the end we > solve this via the principle of "scratch your own itch": if somebody > is concerned enough about a particular piece of code to put their own > time into improving it, then great, it'll get improved. > > > Benign neglect doesn't sound nice from a user's/consumer's > > perspective. Can it be labeled (i.e. declared as such in docs) as > > deprecated. > > Deprecation would imply that we're planning to remove it, which > we are not. Good to know. Sorry I took "benign neglect" to mean that there's no willingness to improve it. This makes it clear that community wants to improve and maintain ECPG, it's just a matter of someone volunteering, and better use of the resources available. Best regards, Gurjeet http://Gurje.et
On Sat, Jul 8, 2023 at 5:06 PM Gurjeet Singh <gurjeet@singh.im> wrote:
> I feel the staleness/deficiencies you mention above are not
> captured in the TODO wiki page. It'd be nice if these were documented,
> so that newcomers to the community can pick up work that they feel is
> an easy lift for them.
I think that's a good idea. I've definitely been confused by this in
previous patches I've submitted.
I've broken up this patch into three logical commits and attached them.
None of the actual code has changed.
Thanks,
Joe Koshakow
> I feel the staleness/deficiencies you mention above are not
> captured in the TODO wiki page. It'd be nice if these were documented,
> so that newcomers to the community can pick up work that they feel is
> an easy lift for them.
I think that's a good idea. I've definitely been confused by this in
previous patches I've submitted.
I've broken up this patch into three logical commits and attached them.
None of the actual code has changed.
Thanks,
Joe Koshakow
Attachment
On Sat, 2023-07-08 at 13:18 -0400, Joseph Koshakow wrote: > Jacob Champion <jchampion@timescale.com> writes: > > Hi Joe, here's a partial review: > > Thanks so much for the review! > > > I'm new to this code, but I agree that the use of `type` and the > > lookahead are not particularly obvious/intuitive. At the very > > least, > > they'd need some more explanation in the code. Your boolean flag > > idea > > sounds reasonable, though. > > I've updated the patch with the boolean flag idea. I think it's a > bit cleaner and more readable. > > > > There is one more problem I noticed, but didn't fix. We allow > > > multiple > > > "@" to be sprinkled anywhere in the input, even though the docs > > > [0] > > > only allow it to appear at the beginning of the input. > > > > (No particular opinion on this.) > > I looked into this a bit. The reason this works is because the date > time lexer filters out all punctuation. That's what allows us to > parse > things like `SELECT date 'January 8, 1999';`. It's probably not worth > trying to be smarter about what punctuation we allow where, at least > for now. Maybe in the future we can exclude "@" from the punctuation > that get's filtered out. > > > It looks like this patch needs a rebase for the CI, too, but there > > are > > no conflicts. > > The attached patch is rebased against master. > > Thanks, > Joe Koshakow Apologies, I'm posting a little behind the curve here. My initial thoughts on the original patch mirrored Jacob's re 1 and 2 - that it looked good, did we need to consider the modified ecpg copy (which has been answered by Tom). I didn't have have any strong thought re 3) or the '@'. The updated patch looks good to me. Seems a little clearer/cleaner. Thanks, Reid
On Sun, 2023-07-09 at 13:24 -0400, Joseph Koshakow wrote: > > I've broken up this patch into three logical commits and attached > them. > None of the actual code has changed. > > Thanks, > Joe Koshakow I made a another pass through the separated patches, it looks good to me. Thanks, Reid
On Mon, Jul 10, 2023 at 10:19 AM Reid Thompson <jreidthompson@nc.rr.com> wrote: > I made a another pass through the separated patches, it looks good to > me. LGTM too. (Thanks Tom for the clarification on ECPG.) Marked RfC. --Jacob
On Mon, Jul 10, 2023 at 10:42:31AM -0700, Jacob Champion wrote: > On Mon, Jul 10, 2023 at 10:19 AM Reid Thompson <jreidthompson@nc.rr.com> wrote: >> I made a another pass through the separated patches, it looks good to >> me. > > LGTM too. (Thanks Tom for the clarification on ECPG.) +SELECT INTERVAL '42 days 2 seconds ago ago'; +SELECT INTERVAL '2 minutes ago 5 days'; [...] +SELECT INTERVAL 'hour 5 months'; +SELECT INTERVAL '1 year months days 5 hours'; 0002 and 0003 make this stuff fail, but isn't there a risk that this breaks applications that relied on these accidental behaviors? Assuming that this is OK makes me nervous. -- Michael
Attachment
On Mon, Aug 21, 2023 at 10:39 PM Michael Paquier <michael@paquier.xyz> wrote: > 0002 and 0003 make this stuff fail, but isn't there a risk that this > breaks applications that relied on these accidental behaviors? > Assuming that this is OK makes me nervous. I wouldn't argue for backpatching, for sure, but I guess I saw this as falling into the same vein as 5b3c5953 and bcc704b52 which were already committed. --Jacob
On Tue, Aug 22, 2023 at 12:58 PM Jacob Champion <jchampion@timescale.com> wrote:
>
> On Mon, Aug 21, 2023 at 10:39 PM Michael Paquier <michael@paquier.xyz> wrote:
> > 0002 and 0003 make this stuff fail, but isn't there a risk that this
> > breaks applications that relied on these accidental behaviors?
> > Assuming that this is OK makes me nervous.
>
> I wouldn't argue for backpatching, for sure, but I guess I saw this as
> falling into the same vein as 5b3c5953 and bcc704b52 which were
> already committed.
I agree, I don't think we should try and backport this. As Jacob
highlighted, we've merged similar patches for other date time types.
If applications were relying on this behavior, the upgrade may be a
good time for them to re-evaluate their usage since it's outside the
documented spec and they may not be getting the units they're expecting
from intervals like '1 day month'.
Thanks,
Joe Koshakow
>
> On Mon, Aug 21, 2023 at 10:39 PM Michael Paquier <michael@paquier.xyz> wrote:
> > 0002 and 0003 make this stuff fail, but isn't there a risk that this
> > breaks applications that relied on these accidental behaviors?
> > Assuming that this is OK makes me nervous.
>
> I wouldn't argue for backpatching, for sure, but I guess I saw this as
> falling into the same vein as 5b3c5953 and bcc704b52 which were
> already committed.
I agree, I don't think we should try and backport this. As Jacob
highlighted, we've merged similar patches for other date time types.
If applications were relying on this behavior, the upgrade may be a
good time for them to re-evaluate their usage since it's outside the
documented spec and they may not be getting the units they're expecting
from intervals like '1 day month'.
Thanks,
Joe Koshakow
On Sun, Aug 27, 2023 at 04:14:00PM -0400, Joseph Koshakow wrote: > On Tue, Aug 22, 2023 at 12:58 PM Jacob Champion <jchampion@timescale.com> > wrote: >> I wouldn't argue for backpatching, for sure, but I guess I saw this as >> falling into the same vein as 5b3c5953 and bcc704b52 which were >> already committed. > > I agree, I don't think we should try and backport this. As Jacob > highlighted, we've merged similar patches for other date time types. > If applications were relying on this behavior, the upgrade may be a > good time for them to re-evaluate their usage since it's outside the > documented spec and they may not be getting the units they're expecting > from intervals like '1 day month'. I felt like asking anyway. I have looked at the patch series and the past compatibility changes in this area, and I kind of agree that this seems like an improvement against confusing interval values. So, I have applied 0001, 0002 and 0003 after more review. 0002 was a bit careless with the code indentation. In 0003, I was wondering a bit if we'd better set parsing_unit_val to false for AGO, but as we do a backward lookup and because after 0002 AGO can only be last, I've let the code as you have suggested, relying on the initial value of this variable. -- Michael