Thread: Meridiem markers (was: [BUGS] Incorrect "invalid AM/PM string" error from to_timestamp)

Hey folks,

Tom pointed out in the parent thread [1] that the error message for
bogus AM/PM markers in to_timestamp is pretty lame:
   ERROR:  invalid AM/PM string

I agree, and once I started thinking about this, I came up with other
gripes concerning the treatment of 12-hour time in to_timestamp, the
use of the meridiem markers (AM/PM) in particular.

Currently, Postgres accepts four separate flavours for specifying
meridiem markers, given by uppercase/lowercase and with/without
periods:
* am/pm* AM/PM* a.m./p.m.* A.M./P.M.

What I find surprising about the implementation is that you must
specify the correct formatting pattern for the particular flavour of
meridiem marker in your input string, or you get the aforementioned
lame error message.  Consider:

postgres=# select to_timestamp('11:47 pm 27 Sep 2008', 'HH:MI PM DD Mon YYYY');
ERROR:  invalid AM/PM string

Here it seems to me that Postgres is being unnecessarily pedantic.
Yes, the case of the marker differs from the case of the formatting
keyword, but is that really grounds for an ERROR?  The user's
intention is perfectly unambiguous, so why not just accept the input?

I would go so far as to say that we should accept any of the 8 valid
meridiem markers, regardless of which flavour is indicated by the
formatting keyword.

Day and month names already work this way.  We don't throw an error if
a user specifies a mixed-case month name like "Sep" but uses the
uppercase formatting keyword "MON".

I suspect that the current behaviour isn't so much a deliberate choice
to be draconian as it is a consequence of the way the code was
developed; as the inverse of to_char().

One way to tidy this up would be to re-implement the meridiem markers
using the seq_search functions, i.e., make it work like the day and
month names.  This would make it easy to accept any flavour of marker,
and the error messages thrown for bogus input would then be the same
as those for bogus day and month names.

Note that all of the above applies equally to the era markers B.C. and A.D.

Comments?

Cheers,
BJ

[1] http://archives.postgresql.org/message-id/8148.1222359751@sss.pgh.pa.us


"Brendan Jurd" <direvus@gmail.com> writes:
> One way to tidy this up would be to re-implement the meridiem markers
> using the seq_search functions, i.e., make it work like the day and
> month names.  This would make it easy to accept any flavour of marker,
> and the error messages thrown for bogus input would then be the same
> as those for bogus day and month names.

> Note that all of the above applies equally to the era markers B.C. and A.D.

+1 ... making these work like month names makes sense to me.
        regards, tom lane


On Fri, Sep 26, 2008 at 11:25 AM, Brendan Jurd <direvus@gmail.com> wrote:
> One way to tidy this up would be to re-implement the meridiem markers
> using the seq_search functions, i.e., make it work like the day and
> month names.  This would make it easy to accept any flavour of marker,
> and the error messages thrown for bogus input would then be the same
> as those for bogus day and month names.

Yeah if we seq_search then it should be a pretty easy conversion. so +1

However that still leaves the original complaint around (at least IMHO):

select to_timestamp('AN', 'AM');
ERROR:  invalid AM/PM string

select to_timestamp('11:47 PM 27 Sep a2008', 'HH:MI PM DD Mon YYYY');
ERROR: invalid value for "YYYY" in source string


Now arguably most to_timestamp calls are going to be short so i can
easily look for the YYYY in my string and see what I did wrong (and
DETAIL: provides Value must be an integer in the second case)... So
really maybe thats good enough I dunno... Personally I find the output
of my dumb patch to be better than both above:

select to_timestamp('AN', 'AM');
ERROR: invalid AM/PM string for 'AN'

And we could improve that by only showing node->key->len chars.  And
make that work for both AM/PM and the others YYYY etc

Then again maybe its not worth it?


"Alex Hunsaker" <badalex@gmail.com> writes:
> However that still leaves the original complaint around (at least IMHO):

> select to_timestamp('AN', 'AM');
> ERROR:  invalid AM/PM string

> select to_timestamp('11:47 PM 27 Sep a2008', 'HH:MI PM DD Mon YYYY');
> ERROR: invalid value for "YYYY" in source string

Yeah, it would be a lot better if it said
ERROR: invalid value for "YYYY": "a2008"

The DETAIL is good too, but it's no substitute for showing the exact
substring that the code is unhappy about.
        regards, tom lane


On Sat, Sep 27, 2008 at 4:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Alex Hunsaker" <badalex@gmail.com> writes:
>> However that still leaves the original complaint around (at least IMHO):
>
>> select to_timestamp('AN', 'AM');
>> ERROR:  invalid AM/PM string
>
>> select to_timestamp('11:47 PM 27 Sep a2008', 'HH:MI PM DD Mon YYYY');
>> ERROR: invalid value for "YYYY" in source string
>
> Yeah, it would be a lot better if it said
>
>        ERROR: invalid value for "YYYY": "a2008"
>
> The DETAIL is good too, but it's no substitute for showing the exact
> substring that the code is unhappy about.
>

I agree.  And with my proposed changes to the meridiem/era marker
code, it will be much easier to improve the error messages in a
consistent way.

I'll work on a patch for the Nov commitfest.

Cheers,
BJ


Brendan Jurd wrote:
> On Sat, Sep 27, 2008 at 4:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Alex Hunsaker" <badalex@gmail.com> writes:
> >> However that still leaves the original complaint around (at least IMHO):
> >
> >> select to_timestamp('AN', 'AM');
> >> ERROR:  invalid AM/PM string
> >
> >> select to_timestamp('11:47 PM 27 Sep a2008', 'HH:MI PM DD Mon YYYY');
> >> ERROR: invalid value for "YYYY" in source string
> >
> > Yeah, it would be a lot better if it said
> >
> >        ERROR: invalid value for "YYYY": "a2008"
> >
> > The DETAIL is good too, but it's no substitute for showing the exact
> > substring that the code is unhappy about.
> >
> 
> I agree.  And with my proposed changes to the meridiem/era marker
> code, it will be much easier to improve the error messages in a
> consistent way.
> 
> I'll work on a patch for the Nov commitfest.

Brendan, did you ever complete this patch?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


On Thu, Jan 8, 2009 at 2:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Brendan, did you ever complete this patch?
>

I didn't, no.  I still intend on doing work in this area, but
obviously it will have to be in the 8.5 cycle.

Cheers,
BJ


Brendan Jurd wrote:
> On Thu, Jan 8, 2009 at 2:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Brendan, did you ever complete this patch?
> >
> 
> I didn't, no.  I still intend on doing work in this area, but
> obviously it will have to be in the 8.5 cycle.

OK, what is the TODO?  It is more than AM/PM, right?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


On Thu, Jan 8, 2009 at 2:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
> OK, what is the TODO?  It is more than AM/PM, right?
>

I see it happening in two stages.

Stage 1 is updating the AM/PM parse code to use the seq_search
technique, which may involve some minor refactoring around seq_search
itself.  This will get us the relaxed validation rules I was talking
about before (i.e., "am" is an acceptable spelling for "AM" and vice
versa).

Stage 2 is improving the error message reporting for all the bits that
use seq_search.

So I would probably make two TODOs, descriptions could be like:

* Relax validation for AM/PM markers in to_timestamp() by using seq_search
* Make to_timestamp() error messages more specific, and display the
actual value which caused the error.

Cheers,
BJ


On Sat, Sep 27, 2008 at 4:25 AM, Brendan Jurd <direvus@gmail.com> wrote:
> Currently, Postgres accepts four separate flavours for specifying
> meridiem markers, given by uppercase/lowercase and with/without
> periods:
>
>  * am/pm
>  * AM/PM
>  * a.m./p.m.
>  * A.M./P.M.

>
> I would go so far as to say that we should accept any of the 8 valid
> meridiem markers, regardless of which flavour is indicated by the
> formatting keyword.
>
> Day and month names already work this way.  We don't throw an error if
> a user specifies a mixed-case month name like "Sep" but uses the
> uppercase formatting keyword "MON".

I've been thinking further about this lately, and whilst the month and
day name tokens aren't fussy about *case*, they do make a distinction
about *length*.

So, while MON will match "Sep", "SEP" and "sep" just fine, it will
have issues with "September" (it will match the first three characters
as "Sep" and then leave the remaining characters "tember" to bork up
the next token).

Likewise, MONTH will not match "Sep", it needs the full month name.

I think, for to_timestamp(), it's important that the user have a solid
idea of how many characters each formatting token wants to consume.
With the am/pm and bc/ad markers, we've got two possibilities for
length; without periods (2 characters) and with periods (4
characters).  Having the 2-character token match against a 4-character
string might cause more confusion than convenience.

It may make more sense to keep the different lengths separate, so that
a 2-character token will match any of "am", "pm", "AM", "PM", and a
4-character token will match any of "a.m.", "p.m.", "A.M.", "P.M.".

Comments?

Cheers,
BJ


On Sat, Sep 27, 2008 at 4:25 AM, Brendan Jurd <direvus@gmail.com> wrote:
> One way to tidy this up would be to re-implement the meridiem markers
> using the seq_search functions, i.e., make it work like the day and
> month names.  This would make it easy to accept any flavour of marker,
> and the error messages thrown for bogus input would then be the same
> as those for bogus day and month names.
>

>From the better-late-than-never department, comes a patch to improve
the handling of AM/PM and AD/BC markers in to_timestamp(), and up the
ante on error reporting for various kinds of invalid input.

The only difference as far as parsing goes is that we no longer worry
about the case of AM/PM and AD/BC.  As long as you use the right
variation for with/without periods, it will parse.

Internally, the code now uses seq_search to parse the markers, so they
work in much the same way as month and day names do.

I improved the error messages for when a seq_search fails, and when
the code fails to parse an integer.  Some examples of the new error
messages:

postgres=# select to_timestamp('Tue 20 Foo 2009 11:39 PM', 'Dy DD Mon
YYYY HH:MI PM');
ERROR:  invalid value "Foo" for "Mon"
DETAIL:  The given value did not match any of the allowed values for this field.

postgres=# select to_timestamp('Tue 20 Jan 2009 11:39 pn', 'Dy DD Mon
YYYY HH:MI PM');
ERROR:  invalid value "PN" for "PM"
DETAIL:  The given value did not match any of the allowed values for this field.

postgres=# select to_timestamp('Tue 20 Jan 2009 23:39', 'Dy DD Mon YYYY HH:MI');
ERROR:  hour "23" is invalid for the 12-hour clock
HINT:  Use the 24-hour clock, or give an hour between 1 and 12.

postgres=# select to_timestamp('Tue 20 Jan 2009 xx:39 pm', 'Dy DD Mon
YYYY HH:MI PM');
ERROR:  invalid value "xx" for "HH"
DETAIL:  Value must be an integer.

This resolves TODO item "Improve to_timestamp() handling of AM/PM, and
error messages".  I've added the patch to the 2009 commitfest.

Cheers,
BJ

 doc/src/sgml/func.sgml                 |   24 +--
 src/backend/utils/adt/formatting.c     |  235 ++++++++++++++++-----------------
 src/test/regress/expected/horology.out |   10 -
 src/test/regress/sql/horology.sql      |    4
 4 files changed, 137 insertions(+), 136 deletions(-)

Attachment
On Mon, Jan 19, 2009 at 15:26, Brendan Jurd <direvus@gmail.com> wrote:
> From the better-late-than-never department, comes a patch to improve
> the handling of AM/PM and AD/BC markers in to_timestamp(), and up the
> ante on error reporting for various kinds of invalid input.

Ok cool I tried this out and gave it a look over.  Looks good to me,
only a few nits below:

-the various (only moved not added)
strncpy(...)
copy[len]  = '\0';

just use strlcpy?

-it seemed we were going to do the error messages the other way around
"invalid value for "YYYY" in "a009"
instead of
"invalid value "a009" for "YYYY"

but I think I like the latter (what you did in your patch) better
anyways... so never mind


On Mon, Jan 26, 2009 at 8:32 AM, Alex Hunsaker <badalex@gmail.com> wrote:
>
> -the various (only moved not added)
> strncpy(...)
> copy[len]  = '\0';
>
> just use strlcpy?
>

Thanks for having a look Alex.

strlcpy would be easier, but I thought there might be portability
concerns re its non-entirely-standard status.

A quick grep through the backend code shows that strlcpy and strncpy
are both in use, with neither having a clear majority.  I used strncpy
because it is more prevalent within src/backend/utils/adt.

Happy to use strlcpy if it is acceptable though.  Anybody else have an
opinion on this?

Cheers,
BJ


Brendan Jurd <direvus@gmail.com> writes:
> A quick grep through the backend code shows that strlcpy and strncpy
> are both in use, with neither having a clear majority.  I used strncpy
> because it is more prevalent within src/backend/utils/adt.

strncpy is generally deprecated; any remaining uses you find of it
are probably only there for lack of round tuits.  Use strlcpy in new
code, unless there's a pretty strong argument that strncpy is actually
clearer for a given usage.

(No, it's not a portability issue, because we have our own copy for
platforms without it; see src/port/.)
        regards, tom lane


On Tue, Jan 27, 2009 at 3:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brendan Jurd <direvus@gmail.com> writes:
>> A quick grep through the backend code shows that strlcpy and strncpy
>> are both in use, with neither having a clear majority.  I used strncpy
>> because it is more prevalent within src/backend/utils/adt.
>
> strncpy is generally deprecated; any remaining uses you find of it
> are probably only there for lack of round tuits.  Use strlcpy in new
> code, unless there's a pretty strong argument that strncpy is actually
> clearer for a given usage.
>

Thanks for clearing that up Tom.  I will update the patch to use
strlcpy shortly.

Cheers,
BJ


On Tue, Jan 27, 2009 at 3:25 PM, Brendan Jurd <direvus@gmail.com> wrote:
> On Tue, Jan 27, 2009 at 3:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> strncpy is generally deprecated; any remaining uses you find of it
>> are probably only there for lack of round tuits.  Use strlcpy in new
>> code, unless there's a pretty strong argument that strncpy is actually
>> clearer for a given usage.
>>
>
> Thanks for clearing that up Tom.  I will update the patch to use
> strlcpy shortly.
>

Okay, I've attached a new version of the patch with strlcpy used in
from_char_set_int_len() and from_char_seq_search().  I've also
included an incremental patch which shows the differences between
version 0 and version 1.

Cheers,
BJ

Attachment
Brendan Jurd wrote:
> On Tue, Jan 27, 2009 at 3:25 PM, Brendan Jurd <direvus@gmail.com> wrote:
> > On Tue, Jan 27, 2009 at 3:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>
> >> strncpy is generally deprecated; any remaining uses you find of it
> >> are probably only there for lack of round tuits.  Use strlcpy in new
> >> code, unless there's a pretty strong argument that strncpy is actually
> >> clearer for a given usage.
> >>
> >
> > Thanks for clearing that up Tom.  I will update the patch to use
> > strlcpy shortly.
> >
> 
> Okay, I've attached a new version of the patch with strlcpy used in
> from_char_set_int_len() and from_char_seq_search().  I've also
> included an incremental patch which shows the differences between
> version 0 and version 1.

I consider this a bug fix so I have applied it to CVS HEAD, even though
the patch arrived after the commit fest started.  Thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +