Thread: Re: [PATCHES] to_date() validation

Re: [PATCHES] to_date() validation

From
"Brendan Jurd"
Date:
On Sun, Sep 7, 2008 at 5:58 AM, Alex Hunsaker <badalex@gmail.com> wrote:
> Im just following this:
> http://wiki.postgresql.org/wiki/Reviewing_a_Patch so lets get started.
>

Hi Alex.  Thanks for taking the time to review my patch.

> Feature test: Everything seems to work as advertised. However before
> via sscanf() most limited the max length of the input.  Before they
> were just silently ignored now they get added to the result:
>
> patched:
> # SELECT to_timestamp('11111', 'HHMM');
>         to_timestamp
> ------------------------------
>  0009-03-19 11:00:00-06:59:56
>
> 8.3.3:
> # SELECT to_timestamp('11111', 'HHMM');
>       to_timestamp
> ---------------------------
>  0001-11-01 11:00:00-07 BC
>

It's an interesting point.  I had considered what would happen if the
input string was too short, but hadn't given much thought to what
would happen if it was too long.

In your example case, it isn't clear whether the user wanted to
specify 11 hours and 11 months (and the final '1' is just junk), or if
they really wanted to specify 11 hours and 111 months.

But consider a case like the following:

# SELECT to_date('07-09-20008', 'DD-MM-YYYY');

The documentation says that 'YYYY' is "4 and more digits", so you have
to assume that the user meant to say the year 20,008.  That's why the
code in the patch parses all the remaining digits in the input string
on the final node.

HEAD actually gets this one wrong; in defiance of the documentation it
returns 2000-09-07.  So, it seems to me that the patch shifts the
behaviour in the right direction.

Barring actually teaching the code that some nodes (like YYYY) can
take an open-ended number of characters, while others (like MM) must
take an exact number of characters, I'm not sure what can be done to
improve this.  Perhaps something for a later patch?

> Code review: one minor nit
> *** a/src/backend/utils/adt/formatting.c
> --- b/src/backend/utils/adt/formatting.c
> ***************
> *** 781,787 **** static const KeyWord DCH_keywords[] = {
>        {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},
>
>        /* last */
> !       {NULL, 0, 0, 0}
>  };
>
>  /* ----------
> --- 781,787 ----
>        {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},
>
>        /* last */
> !       {NULL, 0, 0, 0, 0}
>  };

Ah, thank you for picking that up.  I will correct this and send in a
new patch version in the next 24 hours.

Cheers,
BJ


Re: [PATCHES] to_date() validation

From
"Brendan Jurd"
Date:
On Mon, Sep 8, 2008 at 6:24 PM, Brendan Jurd <direvus@gmail.com> wrote:
> On Sun, Sep 7, 2008 at 5:58 AM, Alex Hunsaker <badalex@gmail.com> wrote:
>> Code review: one minor nit
>> *** a/src/backend/utils/adt/formatting.c
>> --- b/src/backend/utils/adt/formatting.c
>> ***************
>> *** 781,787 **** static const KeyWord DCH_keywords[] = {
>>        {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},
>>
>>        /* last */
>> !       {NULL, 0, 0, 0}
>>  };
>>
>>  /* ----------
>> --- 781,787 ----
>>        {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},
>>
>>        /* last */
>> !       {NULL, 0, 0, 0, 0}
>>  };
>
> Ah, thank you for picking that up.  I will correct this and send in a
> new patch version in the next 24 hours.
>

New version attached (and link added to wiki).

Cheers,
BJ

Attachment

Re: [PATCHES] to_date() validation

From
Martijn van Oosterhout
Date:
On Mon, Sep 08, 2008 at 06:24:14PM +1000, Brendan Jurd wrote:
> On Sun, Sep 7, 2008 at 5:58 AM, Alex Hunsaker <badalex@gmail.com> wrote:
> > Im just following this:
> > http://wiki.postgresql.org/wiki/Reviewing_a_Patch so lets get started.
> >
>
> Hi Alex.  Thanks for taking the time to review my patch.

I actually had a look at this patch also, though not as thoroughly as
Alex. There was one part that I had some thoughts about in from_char_parse_int_len:

!               /*
!                * We need to pull exactly the number of characters given in 'len' out
!                * of the string, and convert those.
!                */
<snip>
!               first = palloc(len + 1);
!               strncpy(first, init, len);
!               first[len] = '\0';

The use of palloc/pfree in this routine seems excessive. Does len have
upper bound? If so a simple array will do it.

!               if (strlen(first) < len)

Here you check the length of the remaining string and complain if it's
too short, but:

<snip>
!               result = strtol(first, &last, 10);
!               *src += (last - first);

Here you do not note if we didn't convert the entire string. So it
seems you are allowed to provide too few characters, as long as it's
not the last field?

Other than that it looks like a good patch.

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [PATCHES] to_date() validation

From
"Brendan Jurd"
Date:
On Tue, Sep 9, 2008 at 7:29 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> I actually had a look at this patch also, though not as thoroughly as
> Alex. There was one part that I had some thoughts about in from_char_parse_int_len:
>

Hi Martijn.  Thanks for your comments.

> The use of palloc/pfree in this routine seems excessive. Does len have
> upper bound? If so a simple array will do it.
>

There isn't any *theoretical* upper bound on len.  However, in
practice, with the current set of formatting nodes, the largest len
that will ever be passed to rom_char_parse_int_len() is 6 (for the
microsecond 'US' node).

I suppose I could define a constant FORMATNODE_MAX_LEN, make it
something like 10 and just use that for copying the string, rather
than palloc().  I'll give it a try.

> !               if (strlen(first) < len)
>
> Here you check the length of the remaining string and complain if it's
> too short, but:
>
> <snip>
> !               result = strtol(first, &last, 10);
> !               *src += (last - first);
>
> Here you do not note if we didn't convert the entire string. So it
> seems you are allowed to provide too few characters, as long as it's
> not the last field?

That's true.  The only way to hit that condition would be to provide a
semi-bogus value in a string with no separators between the numbers.
For example:

postgres=# SELECT to_date('20081o13', 'YYYYMMDD');
ERROR:  invalid value for "DD" in source string
DETAIL:  Value must be an integer.

What happens here is that strtol() happily consumes the '1' for the
format node MM, and as you point out it does not complain that it
expected to consume two characters and only got one.  On the next node
(DD), the function tries to start parsing an integer, but the first
character it encounters is 'o', so it freaks out.

Certainly the error message here should be more apropos, and tell the
user that the problem is in the MM node.  Blaming the DD node is pure
red herring.

However, if the mistake is at the end of the string, there is no error at all:

postgres=# SELECT to_date('2008101!', 'YYYYMMDD'); to_date
------------2008-10-01

This is because the end of the string counts as a "separator", so we
just run an unbounded strtol() on whatever characters remain in the
string.

As discussed in my response to Alex's review, I made the end of the
string a separator so that things like 'DD-MM-YYYY' would actually
work for years with more than four digits.

Now I'm wondering if that was the wrong way to go.  The case for years
with more than four digits is extremely narrow, and if somebody really
wanted to parse '01-01-20008' as 1 Jan 20,008 they could always use
the 'FM' modifier to get the behaviour they want ('DD-MM-FMYYYY').

I'll send in a new version which addresses these issues.

Cheers,
BJ


Re: [PATCHES] to_date() validation

From
"Brendan Jurd"
Date:
On Tue, Sep 9, 2008 at 9:04 PM, Brendan Jurd <direvus@gmail.com> wrote:
> On Tue, Sep 9, 2008 at 7:29 PM, Martijn van Oosterhout
> <kleptog@svana.org> wrote:
>> The use of palloc/pfree in this routine seems excessive. Does len have
>> upper bound? If so a simple array will do it.
>>
>
> I suppose I could define a constant FORMATNODE_MAX_LEN, make it
> something like 10 and just use that for copying the string, rather
> than palloc().  I'll give it a try.
>

Turns out there was already a relevant constant defined in
formatting.c: DCH_MAX_ITEM_SIZ, set to 9.  So I just used that, +1 for
the trailing null.

>>
>> Here you do not note if we didn't convert the entire string. So it
>> seems you are allowed to provide too few characters, as long as it's
>> not the last field?
>
> That's true.  The only way to hit that condition would be to provide a
> semi-bogus value in a string with no separators between the numbers.

I've now plugged this hole.  I added the following error for
fixed-width inputs that are too short:

postgres=# SELECT to_date('200%1010', 'YYYYMMDD');
ERROR:  invalid value for "YYYY" in source string
DETAIL:  Field requires 4 characters, but only 3 could be parsed.
HINT:  If your source string is not fixed-width, try using the "FM" modifier.

I've attached a new version of the patch (version 3), as well as an
incremental patch to show the differences between versions 2 and 3.

Cheers,
BJ

Attachment

Re: [PATCHES] to_date() validation

From
"Alex Hunsaker"
Date:
On Mon, Sep 8, 2008 at 2:24 AM, Brendan Jurd <direvus@gmail.com> wrote:
> HEAD actually gets this one wrong; in defiance of the documentation it
> returns 2000-09-07.  So, it seems to me that the patch shifts the
> behaviour in the right direction.
>
> Barring actually teaching the code that some nodes (like YYYY) can
> take an open-ended number of characters, while others (like MM) must
> take an exact number of characters, I'm not sure what can be done to
> improve this.  Perhaps something for a later patch?

Sound good to me and I would probably argue that things like MM should
not be hard coded to take only 2 chars...
But then again to play devils advocate I can just as easily do things
like to_char(...) + '30 months'::interval;


Re: [PATCHES] to_date() validation

From
"Alex Hunsaker"
Date:
On Tue, Sep 9, 2008 at 6:46 AM, Brendan Jurd <direvus@gmail.com> wrote:
> On Tue, Sep 9, 2008 at 9:04 PM, Brendan Jurd <direvus@gmail.com> wrote:
>> On Tue, Sep 9, 2008 at 7:29 PM, Martijn van Oosterhout
>> <kleptog@svana.org> wrote:
>>> The use of palloc/pfree in this routine seems excessive. Does len have
>>> upper bound? If so a simple array will do it.
>>>
>>

Good catch Martijn!

>> I suppose I could define a constant FORMATNODE_MAX_LEN, make it
>> something like 10 and just use that for copying the string, rather
>> than palloc().  I'll give it a try.
>>
>
> Turns out there was already a relevant constant defined in
> formatting.c: DCH_MAX_ITEM_SIZ, set to 9.  So I just used that, +1 for
> the trailing null.

Cool.

>>>
>>> Here you do not note if we didn't convert the entire string. So it
>>> seems you are allowed to provide too few characters, as long as it's
>>> not the last field?
>>
>> That's true.  The only way to hit that condition would be to provide a
>> semi-bogus value in a string with no separators between the numbers.
>
> I've now plugged this hole.  I added the following error for
> fixed-width inputs that are too short:
>
> postgres=# SELECT to_date('200%1010', 'YYYYMMDD');
> ERROR:  invalid value for "YYYY" in source string
> DETAIL:  Field requires 4 characters, but only 3 could be parsed.
> HINT:  If your source string is not fixed-width, try using the "FM" modifier.

I think thats a big improvement.

> I've attached a new version of the patch (version 3), as well as an
> incremental patch to show the differences between versions 2 and 3.

I looked it over, looks good to me!

> Cheers,
> BJ
>


Re: [PATCHES] to_date() validation

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
> I've attached a new version of the patch (version 3), as well as an
> incremental patch to show the differences between versions 2 and 3.

Applied with minimal modifications.  I changed a couple of error
messages that didn't seem to meet the style guidelines, and I moved all
of the to_timestamp tests into horology.sql, rather than having
essentially duplicate tests in timestamp.sql and timestamptz.sql.
        regards, tom lane


Re: [PATCHES] to_date() validation

From
"Brendan Jurd"
Date:
On Fri, Sep 12, 2008 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Applied with minimal modifications.  I changed a couple of error
> messages that didn't seem to meet the style guidelines,

Great, thanks Tom.  And thanks again to Alex and Martijn for helping
me refine the patch.

> and I moved all
> of the to_timestamp tests into horology.sql, rather than having
> essentially duplicate tests in timestamp.sql and timestamptz.sql.
>

Nice.  That will make maintaining/extending the tests easier in future.

Cheers,
BJ