Thread: Allow to_date() and to_timestamp() to accept localized names
Hello, Going through the current items in the wiki's todo list, I have been looking into: "Allow to_date () and to_timestamp () to accept localized month names". It seems to me that the code is almost there, so I would like to know if something like the attached patch would be a reasonable way to go. Regards, Juan José Santamaría Flecha
Attachment
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > > Going through the current items in the wiki's todo list, I have been > looking into: "Allow to_date () and to_timestamp () to accept > localized month names". > I have gone through a second take on this, trying to give it a better shape but it would surely benefit from some review, so I will open an item in the commitfest. Regards, Juan José Santamaría Flecha
Attachment
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Thu, Aug 22, 2019 at 9:38 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > > > > > Going through the current items in the wiki's todo list, I have been > > looking into: "Allow to_date () and to_timestamp () to accept > > localized month names". > > > > I have gone through a second take on this, trying to give it a better > shape but it would surely benefit from some review, so I will open an > item in the commitfest. For reviewers, the current aproach for this patch is: Break seq_search() into two functions: * seq_search_sqlascii() that supports seq_search() current usage. * seq_search_localized() similar to the previous but supports multibyte input. To avoid code duplication most of current seq_search() logic has been moved to a new function str_compare(). from_char_seq_search() is now responsible to choose between seq_search_sqlascii() or seq_search_localized(). Also, since localized names is not a null terminated array, seq_search() now receives the dimension as input and terminating nulls have been removed from static arrays. The commitfest item is: https://commitfest.postgresql.org/24/2255/ Regards, Juan José Santamaría Flecha
On 2019-Aug-22, Juan José Santamaría Flecha wrote: > On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha > <juanjo.santamaria@gmail.com> wrote: > > > > Going through the current items in the wiki's todo list, I have been > > looking into: "Allow to_date () and to_timestamp () to accept > > localized month names". > > I have gone through a second take on this, trying to give it a better > shape but it would surely benefit from some review, so I will open an > item in the commitfest. I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we just use lengthof() of the corresponding array? AFAICS it should work just as well. I wonder if the "compare first char" thing (seq_search_localized) really works when there are multibyte chars in the day/month names. I think the code compares just the first char ... but what if the original string uses those funny Unicode non-normalized letters and the locale translation uses normalized letters? My guess is that the first-char comparison will fail, but surely you'll want the name to match. (There's no month/day name in Spanish that doesn't start with an ASCII letter, but I bet there are some in other languages.) I think the localized comparison should avoid the first-char optimization, just compare the whole string all the time, and avoid possible weird issues. But then, I'm not 100% sure of this code. formatting.c is madness distilled. Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thanks for taking a look at this. > I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we > just use lengthof() of the corresponding array? AFAICS it should work > just as well. > It was because of the length difference between ascii-name arrays, which were all null-ended, and localized-name arrays. The attached version uses lengthof(). > I wonder if the "compare first char" thing (seq_search_localized) really > works when there are multibyte chars in the day/month names. I think > the code compares just the first char ... but what if the original > string uses those funny Unicode non-normalized letters and the locale > translation uses normalized letters? My guess is that the first-char > comparison will fail, but surely you'll want the name to match. > (There's no month/day name in Spanish that doesn't start with an ASCII > letter, but I bet there are some in other languages.) I think the > localized comparison should avoid the first-char optimization, just > compare the whole string all the time, and avoid possible weird issues. The localized search is reformulated in this version to deal with multibyte normalization. A regression test for this issue is included. Regards, Juan José Santamaría Flecha
Attachment
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Wed, Sep 18, 2019 at 11:09 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > > On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > > Thanks for taking a look at this. > > > I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we > > just use lengthof() of the corresponding array? AFAICS it should work > > just as well. > > > > It was because of the length difference between ascii-name arrays, > which were all null-ended, and localized-name arrays. The attached > version uses lengthof(). > > > I wonder if the "compare first char" thing (seq_search_localized) really > > works when there are multibyte chars in the day/month names. I think > > the code compares just the first char ... but what if the original > > string uses those funny Unicode non-normalized letters and the locale > > translation uses normalized letters? My guess is that the first-char > > comparison will fail, but surely you'll want the name to match. > > (There's no month/day name in Spanish that doesn't start with an ASCII > > letter, but I bet there are some in other languages.) I think the > > localized comparison should avoid the first-char optimization, just > > compare the whole string all the time, and avoid possible weird issues. > > The localized search is reformulated in this version to deal with > multibyte normalization. A regression test for this issue is included. This version is updated to optimize the need for dynamic allocation. > Regards, > > Juan José Santamaría Flecha
Attachment
This patch no longer applies. Can you please rebase? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > This patch no longer applies. Can you please rebase? Thank you for the notification. The patch rot after commit 1a950f3, a rebased version is attached. Regards, Juan José Santamaría Flecha
Attachment
On Thu, Sep 26, 2019 at 08:36:25PM +0200, Juan José Santamaría Flecha wrote: >On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> This patch no longer applies. Can you please rebase? > >Thank you for the notification. > >The patch rot after commit 1a950f3, a rebased version is attached. > Thanks. I did a quick review of this patch, and I think it's almost RFC. I only found a couple of minor issue: - In func.sgml, it seems we've lost this bit: <para> <literal>TM</literal> does not include trailing blanks. <function>to_timestamp</function> and <function>to_date</function> ignore the <literal>TM</literal> modifier. </para> Does that mean the function no longer ignore the TM modifier? That would be somewhat problematic (i.e. it might break some user code). But looking at the code I don't see why the patch would have this effect, so I suppose it's merely a doc bug. - I don't think we need to include examples how to_timestmap ignores case, I'd say just stating the fact is clear enough. But if we want to have examples, I think we should not inline in the para but use the established pattern: <para> Some examples: <programlisting> ... </programlisting> </para> which is used elsewhere in the func.sgml file. - In formatting.c the "common/unicode_norm.h" should be right after includes from "catalog/" to follow the alphabetic order (unless there's a reason why that does not work). - I rather dislike the "dim" parameter name, because I immediately think "dimension" which makes little sense. I suggest renaming to "nitems" or "nelements" or something like that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Sat, Jan 11, 2020 at 5:06 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Thanks. I did a quick review of this patch, and I think it's almost RFC.
Thanks for reviewing.
- In func.sgml, it seems we've lost this bit:
<para>
<literal>TM</literal> does not include trailing blanks.
<function>to_timestamp</function> and <function>to_date</function> ignore
the <literal>TM</literal> modifier.
</para>
Does that mean the function no longer ignore the TM modifier? That
would be somewhat problematic (i.e. it might break some user code).
But looking at the code I don't see why the patch would have this
effect, so I suppose it's merely a doc bug.
It is intentional. This patch uses the TM modifier to identify the usage of localized names as input for to_timestamp() and to_date().
- I don't think we need to include examples how to_timestmap ignores
case, I'd say just stating the fact is clear enough. But if we want to
have examples, I think we should not inline in the para but use the
established pattern:
<para>
Some examples:
<programlisting>
...
</programlisting>
</para>
which is used elsewhere in the func.sgml file.
I was trying to match the style surrounding the usage notes for date/time formatting [1]. Agreed that it is not worth an example on its own, so dropped.
- In formatting.c the "common/unicode_norm.h" should be right after
includes from "catalog/" to follow the alphabetic order (unless
there's a reason why that does not work).
- I rather dislike the "dim" parameter name, because I immediately think
"dimension" which makes little sense. I suggest renaming to "nitems"
or "nelements" or something like that.
Please, find attached a version addressing the above mentioned.
Regards,
Juan José Santamaría Flecha
Attachment
Hello! On 2020/01/13 21:04, Juan José Santamaría Flecha wrote: > Please, find attached a version addressing the above mentioned. I have some couple picky notes. > + if (name_len != norm_len) > + pfree(norm_name); I'm not sure here. Is it save to assume that if it was allocated a new norm_name name_len and norm_len always will differ? > static const char *const months_full[] = { > "January", "February", "March", "April", "May", "June", "July", > - "August", "September", "October", "November", "December", NULL > + "August", "September", "October", "November", "December" > }; Unfortunately I didn't see from the patch why it was necessary to remove last null element from months_full, days_short, adbc_strings, adbc_strings_long and other arrays. I think it is not good because unnecessarily increases the patch and adds code like the following: > + from_char_seq_search(&value, &s, months, localized_names, > + ONE_UPPER, MAX_MON_LEN, n, have_error, > + lengthof(months_full)); Here it passes "months" from datetime.c to from_char_seq_search() and calculates its length using "months_full" array from formatting.c. -- Arthur
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Wed, Jan 15, 2020 at 11:20 AM Arthur Zakirov <zaartur@gmail.com> wrote:
I have some couple picky notes.
Thanks for the review.
> + if (name_len != norm_len)
> + pfree(norm_name);
I'm not sure here. Is it save to assume that if it was allocated a new
norm_name name_len and norm_len always will differ?
Good call, that check can be more robust.
> static const char *const months_full[] = {
> "January", "February", "March", "April", "May", "June", "July",
> - "August", "September", "October", "November", "December", NULL
> + "August", "September", "October", "November", "December"
> };
Unfortunately I didn't see from the patch why it was necessary to remove
last null element from months_full, days_short, adbc_strings,
adbc_strings_long and other arrays. I think it is not good because
unnecessarily increases the patch and adds code like the following:
> + from_char_seq_search(&value, &s, months, localized_names,
> + ONE_UPPER, MAX_MON_LEN, n, have_error,
> + lengthof(months_full));
Here it passes "months" from datetime.c to from_char_seq_search() and
calculates its length using "months_full" array from formatting.c.
With the introduction of seq_search_localized() that can be avoided, minimizing code churn.
Please, find attached a version addressing the above mentioned.
Regards,
Juan José Santamaría Flecha
Attachment
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes: > [ 0001-Allow-localized-month-names-to_date-v6.patch ] I took a quick look through this. One thing I completely don't understand is why it's sufficient for seq_search_localized to do "initcap" semantics. Shouldn't it cover the all-upper and all-lower cases as well, as the existing seq_search code does? (That is, why is it okay to ignore the "type" argument?) On the other hand, you probably *should* be ignoring the "max" argument, because AFAICS the values that are passed for that are specific to the English ASCII spellings of the day and month names. It seems very likely that they could be too small for some sets of non-English names. Related to that, the business with + mb_max = max * pg_encoding_max_length(encoding); + name_len = strlen(name); + name_len = name_len < mb_max ? name_len : mb_max; seems wrong even if we assume that "max" is a sane limit on the number of characters to consider. This coding is likely to truncate a long "name" string in the middle of a multibyte character, leading to bad-encoding errors from later functions. I also am confused by the "if (mb_max > max ...)" bit. It looks to me like that's an obscure way of writing "if (pg_encoding_max_length() > 1)", which is a rather pointless check considering that the if() then goes on to insist on encoding == PG_UTF8. A bit further down, you have an "if (name_wlen > norm_wlen)" condition that seems pretty fishy. Is it really guaranteed that unicode_normalize_kc cannot transform the string without shortening it? I don't see that specified in its header comment, for sure. Also, it's purely accidental if this: norm_name = (char *) palloc((norm_wlen + 1) * sizeof(pg_wchar)); allocates a long enough string for the conversion back to multibyte form, because that output is not pg_wchars. I think you want something more like "norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1". (I wonder whether we need to be worrying about integer overflow in any of this.) It seems like you could eliminate a lot of messiness by extending localized_abbrev_days[] and related arrays by one element that's always NULL. That would waste, hmm, four 8-byte pointers on typical machines --- but you're eating a lot more than 32 bytes of code to pass around the array lengths, plus it's really ugly that the plain and localized arrays are handled differently. I don't think the way you're managing the "localized_names" variable in DCH_from_char is very sensible. The reader has to do a lot of reverse engineering just to discover that the value is constant NULL in most references, something that you've not helped by declaring it outside the loop rather than inside. I think you could drop the variable and write constant NULL in most places, with something like S_TM(n->suffix) ? localized_full_days : NULL in those from_char_seq_search calls where it's relevant. In general I'm displeased with the lack of attention to comments. Notably, you added arguments to from_char_seq_search without updating its header comment ... not that that comment is a great API spec, but at the least you shouldn't be making it worse. I think the bug I complained of above is directly traceable to the lack of a clear spec here for what "max" means, so failure to keep these comments up to speed does have demonstrable bad consequences. regards, tom lane
I wrote: > One thing I completely don't understand is why it's sufficient for > seq_search_localized to do "initcap" semantics. Shouldn't it cover > the all-upper and all-lower cases as well, as the existing seq_search > code does? (That is, why is it okay to ignore the "type" argument?) I took a closer look at this and decided you were probably doing that just because the seq_search code uses initcap-style case folding to match month and day names, relying on the assumption that the constant arrays it's passed are cased that way. But we shouldn't (and the patch doesn't) assume that the localized names we'll get from pg_locale.c are cased that way. However ... it seems to me that the way seq_search is defined is plenty bogus. In the first place, it scribbles on its source string, which is surely not okay. You can see that in error messages that are thrown on match failures: regression=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY'); ERROR: invalid value "JanuzRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Now, maybe somebody out there thinks this is a cute way of highlighting how much of the string we examined, but it looks more like a bug from where I sit. It's mere luck that what's being clobbered is a local string copy created by do_to_timestamp(), and not the original input data passed to to_date(). In the second place, there's really no need at all for seq_search to rely on any assumptions about the case state of the array it's searching. pg_toupper() is pretty cheap already, and we can make it guaranteed cheap if we use pg_ascii_toupper() instead. So I think we ought to just remove the "type" argument altogether and have seq_search dynamically convert all the strings it examines to upper case (or lower case would work as well, at least for English). > On the other hand, you probably *should* be ignoring the "max" > argument, because AFAICS the values that are passed for that are > specific to the English ASCII spellings of the day and month names. > It seems very likely that they could be too small for some sets of > non-English names. Closer examination shows that the "max" argument is pretty bogus as well. It doesn't do anything except confuse the reader, because there are no cases where the value passed is less than the maximum array entry length, so it never acts to change seq_search's behavior. So we should just drop that behavior from seq_search, too, and redefine "max" as having no purpose except to specify how much of the string to show in error messages. There's still a question of what that should be for non-English cases, but at least we now have a clear idea of what we need the value to do. I also noted while I was looking at it that from_char_seq_search() would truncate at "max" bytes even when dealing with multibyte input. That has a clear risk of generating an invalidly-encoded error message. The 0001 patch attached cleans up all the above complaints. I felt that given the business about scribbling on strings we shouldn't, it would also be wise to const-ify the string pointers if possible. That seems not too painful, per 0002 attached. I'm tempted to go a bit further than 0001 does, and remove the 'max' argument from from_char_seq_search() altogether. Since we only need 'max' in error cases, which don't need to be super-quick, we could drop the requirement for the caller to specify that and instead compute it when we do need it as the largest of the constant array's string lengths. That'd carry over into the localized-names case in a reasonably straightforward way (though we might need to count characters not bytes for that to work nicely). Because of the risk of incorrectly-encoded error messages, I'm rather tempted to claim that these patches (or at least 0001) are a bug fix and should be back-patched. In any case I think we should apply this to HEAD and then rebase the localized-names patch over it. It makes a whole lot more sense, IMO, for the localized-names comparison logic to match what this is doing than what seq_search does today. Comments? regards, tom lane diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index ca3c48d..3ef10dc 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL}; * Flags & Options: * ---------- */ -#define ONE_UPPER 1 /* Name */ -#define ALL_UPPER 2 /* NAME */ -#define ALL_LOWER 3 /* name */ - #define MAX_MONTH_LEN 9 #define MAX_MON_LEN 3 #define MAX_DAY_LEN 9 @@ -1068,9 +1064,9 @@ static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error); -static int seq_search(char *name, const char *const *array, int type, int max, int *len); +static int seq_search(const char *name, const char *const *array, int *len); static int from_char_seq_search(int *dest, char **src, - const char *const *array, int type, int max, + const char *const *array, int max, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, @@ -2454,70 +2450,57 @@ from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error) } /* ---------- - * Sequential search with to upper/lower conversion + * Sequentially search null-terminated "array" for a case-insensitive match + * to the initial character(s) of "name". + * + * Returns array index of match, or -1 for no match. + * + * *len is set to the length of the match, or 0 for no match. + * + * Case-insensitivity is defined per pg_ascii_toupper, so this is only + * suitable for comparisons to ASCII strings. * ---------- */ static int -seq_search(char *name, const char *const *array, int type, int max, int *len) +seq_search(const char *name, const char *const *array, int *len) { - const char *p; + unsigned char firstc; const char *const *a; - char *n; - int last, - i; *len = 0; + /* empty string can't match anything */ if (!*name) return -1; - /* set first char */ - if (type == ONE_UPPER || type == ALL_UPPER) - *name = pg_toupper((unsigned char) *name); - else if (type == ALL_LOWER) - *name = pg_tolower((unsigned char) *name); + /* we handle first char specially to gain some speed */ + firstc = pg_ascii_toupper((unsigned char) *name); - for (last = 0, a = array; *a != NULL; a++) + for (a = array; *a != NULL; a++) { + int i; + const char *p; + const char *n; + /* compare first chars */ - if (*name != **a) + if (pg_ascii_toupper((unsigned char) **a) != firstc) continue; for (i = 1, p = *a + 1, n = name + 1;; n++, p++, i++) { - /* search fragment (max) only */ - if (max && i == max) - { - *len = i; - return a - array; - } - /* full size */ + /* return success if we matched whole array entry */ if (*p == '\0') { *len = i; return a - array; } - /* Not found in array 'a' */ + /* else, must have another character in "name" ... */ if (*n == '\0') break; - /* - * Convert (but convert new chars only) - */ - if (i > last) - { - if (type == ONE_UPPER || type == ALL_LOWER) - *n = pg_tolower((unsigned char) *n); - else if (type == ALL_UPPER) - *n = pg_toupper((unsigned char) *n); - last = i; - } - -#ifdef DEBUG_TO_FROM_CHAR - elog(DEBUG_elog_output, "N: %c, P: %c, A: %s (%s)", - *n, *p, *a, name); -#endif - if (*n != *p) + /* ... and it must match */ + if (pg_ascii_toupper((unsigned char) *p) != + pg_ascii_toupper((unsigned char) *n)) break; } } @@ -2526,8 +2509,8 @@ seq_search(char *name, const char *const *array, int type, int max, int *len) } /* - * Perform a sequential search in 'array' for text matching the first 'max' - * characters of the source string. + * Perform a sequential search in 'array' for an entry matching the first + * character(s) of the 'src' string case-insensitively. * * If a match is found, copy the array index of the match into the integer * pointed to by 'dest', advance 'src' to the end of the part of the string @@ -2535,19 +2518,28 @@ seq_search(char *name, const char *const *array, int type, int max, int *len) * * If the string doesn't match, throw an error if 'have_error' is NULL, * otherwise set '*have_error' and return -1. + * + * 'max' and 'node' are used only for error reports: 'max' is the max + * number of bytes of the string to include in the error report, and + * node->key->name identifies what we were searching for. */ static int -from_char_seq_search(int *dest, char **src, const char *const *array, int type, +from_char_seq_search(int *dest, char **src, const char *const *array, int max, FormatNode *node, bool *have_error) { int len; - *dest = seq_search(*src, array, type, max, &len); + /* Assert that no call gives us a 'max' too large for the error buffer */ + Assert(max <= DCH_MAX_ITEM_SIZ); + + *dest = seq_search(*src, array, &len); + if (len <= 0) { char copy[DCH_MAX_ITEM_SIZ + 1]; - Assert(max <= DCH_MAX_ITEM_SIZ); + /* Use multibyte-aware truncation to avoid generating a bogus string */ + max = pg_mbcliplen(*src, strlen(*src), max); strlcpy(copy, *src, max + 1); RETURN_ERROR(ereport(ERROR, @@ -3279,7 +3271,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_a_m: case DCH_p_m: from_char_seq_search(&value, &s, ampm_strings_long, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); CHECK_ERROR; @@ -3290,7 +3282,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_am: case DCH_pm: from_char_seq_search(&value, &s, ampm_strings, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); CHECK_ERROR; @@ -3403,7 +3395,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_a_d: case DCH_b_c: from_char_seq_search(&value, &s, adbc_strings_long, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); CHECK_ERROR; @@ -3413,7 +3405,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_ad: case DCH_bc: from_char_seq_search(&value, &s, adbc_strings, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); CHECK_ERROR; @@ -3421,7 +3413,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_MONTH: case DCH_Month: case DCH_month: - from_char_seq_search(&value, &s, months_full, ONE_UPPER, + from_char_seq_search(&value, &s, months_full, MAX_MONTH_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3430,7 +3422,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_MON: case DCH_Mon: case DCH_mon: - from_char_seq_search(&value, &s, months, ONE_UPPER, + from_char_seq_search(&value, &s, months, MAX_MON_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3444,7 +3436,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_DAY: case DCH_Day: case DCH_day: - from_char_seq_search(&value, &s, days, ONE_UPPER, + from_char_seq_search(&value, &s, days, MAX_DAY_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3454,7 +3446,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_DY: case DCH_Dy: case DCH_dy: - from_char_seq_search(&value, &s, days, ONE_UPPER, + from_char_seq_search(&value, &s, days, MAX_DY_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3572,7 +3564,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, break; case DCH_RM: from_char_seq_search(&value, &s, rm_months_upper, - ALL_UPPER, MAX_RM_LEN, n, have_error); + MAX_RM_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, n, have_error); @@ -3580,7 +3572,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, break; case DCH_rm: from_char_seq_search(&value, &s, rm_months_lower, - ALL_LOWER, MAX_RM_LEN, n, have_error); + MAX_RM_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, n, have_error); diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 3ef10dc..ef21b7e 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1044,7 +1044,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid collid); -static void DCH_from_char(FormatNode *node, char *in, TmFromChar *out, +static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, bool *have_error); #ifdef DEBUG_TO_FROM_CHAR @@ -1055,17 +1055,17 @@ static void dump_node(FormatNode *node, int max); static const char *get_th(char *num, int type); static char *str_numth(char *dest, char *num, int type); static int adjust_partial_year_to_2020(int year); -static int strspace_len(char *str); +static int strspace_len(const char *str); static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode, bool *have_error); static void from_char_set_int(int *dest, const int value, const FormatNode *node, bool *have_error); -static int from_char_parse_int_len(int *dest, char **src, const int len, +static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); -static int from_char_parse_int(int *dest, char **src, FormatNode *node, +static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); static int seq_search(const char *name, const char *const *array, int *len); -static int from_char_seq_search(int *dest, char **src, +static int from_char_seq_search(int *dest, const char **src, const char *const *array, int max, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, @@ -2255,7 +2255,7 @@ adjust_partial_year_to_2020(int year) static int -strspace_len(char *str) +strspace_len(const char *str) { int len = 0; @@ -2344,12 +2344,12 @@ on_error: * and -1 is returned. */ static int -from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, +from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error) { long result; char copy[DCH_MAX_ITEM_SIZ + 1]; - char *init = *src; + const char *init = *src; int used; /* @@ -2366,8 +2366,11 @@ from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, * This node is in Fill Mode, or the next node is known to be a * non-digit value, so we just slurp as many characters as we can get. */ + char *endptr; + errno = 0; - result = strtol(init, src, 10); + result = strtol(init, &endptr, 10); + *src = endptr; } else { @@ -2444,7 +2447,7 @@ on_error: * required length explicitly. */ static int -from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error) +from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error) { return from_char_parse_int_len(dest, src, node->key->len, node, have_error); } @@ -2524,7 +2527,7 @@ seq_search(const char *name, const char *const *array, int *len) * node->key->name identifies what we were searching for. */ static int -from_char_seq_search(int *dest, char **src, const char *const *array, +from_char_seq_search(int *dest, const char **src, const char *const *array, int max, FormatNode *node, bool *have_error) { int len; @@ -3158,11 +3161,11 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col * ---------- */ static void -DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, +DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, bool *have_error) { FormatNode *n; - char *s; + const char *s; int len, value; bool fx_mode = std;
On 2020/01/23 7:11, Tom Lane wrote: > Closer examination shows that the "max" argument is pretty bogus as > well. It doesn't do anything except confuse the reader, because there > are no cases where the value passed is less than the maximum array entry > length, so it never acts to change seq_search's behavior. So we should > just drop that behavior from seq_search, too, and redefine "max" as > having no purpose except to specify how much of the string to show in > error messages. There's still a question of what that should be for > non-English cases, but at least we now have a clear idea of what we > need the value to do. Shouldn't we just show all remaining string instead of truncating it? For example I get the following output: =# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH YYYY'); ERROR: invalid value "ЯНВА" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. This behavior is reproduced without the patch though (on Postgres 12). And the English case might confuse too I think: =# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY'); ERROR: invalid value "JANUZRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Users might think what means "1" in "JANUZRY 1" string. I attached the draft patch, which shows all remaining string. So the query above will show the following: =# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH YYYY'); ERROR: invalid value "ЯНВАРЯ 1999" for "MONTH" DETAIL: The remaining value did not match any of the allowed values for this field. > The 0001 patch attached cleans up all the above complaints. I felt > that given the business about scribbling on strings we shouldn't, > it would also be wise to const-ify the string pointers if possible. > That seems not too painful, per 0002 attached. +1 on the patch. -- Arthur
Attachment
Arthur Zakirov <zaartur@gmail.com> writes: > On 2020/01/23 7:11, Tom Lane wrote: >> Closer examination shows that the "max" argument is pretty bogus as >> well. It doesn't do anything except confuse the reader, because there >> are no cases where the value passed is less than the maximum array entry >> length, so it never acts to change seq_search's behavior. So we should >> just drop that behavior from seq_search, too, and redefine "max" as >> having no purpose except to specify how much of the string to show in >> error messages. There's still a question of what that should be for >> non-English cases, but at least we now have a clear idea of what we >> need the value to do. > Shouldn't we just show all remaining string instead of truncating it? That would avoid a bunch of arbitrary decisions, for sure. Anybody have an objection? regards, tom lane
On 2020-Jan-22, Tom Lane wrote: > Arthur Zakirov <zaartur@gmail.com> writes: > > On 2020/01/23 7:11, Tom Lane wrote: > >> Closer examination shows that the "max" argument is pretty bogus as > >> well. It doesn't do anything except confuse the reader, because there > >> are no cases where the value passed is less than the maximum array entry > >> length, so it never acts to change seq_search's behavior. So we should > >> just drop that behavior from seq_search, too, and redefine "max" as > >> having no purpose except to specify how much of the string to show in > >> error messages. There's still a question of what that should be for > >> non-English cases, but at least we now have a clear idea of what we > >> need the value to do. > > > Shouldn't we just show all remaining string instead of truncating it? > > That would avoid a bunch of arbitrary decisions, for sure. > Anybody have an objection? I think it would be clearer to search for whitespace starting at the start of the bogus token and stop there. It might not be perfect, particularly if any language has whitespace in a month etc name (I don't know any example but I guess it's not impossible for it to exist; Portuguese weekday names maybe?), but it seems better than either of the behaviors shown above. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2020-Jan-22, Tom Lane wrote: >> Arthur Zakirov <zaartur@gmail.com> writes: >>> Shouldn't we just show all remaining string instead of truncating it? >> That would avoid a bunch of arbitrary decisions, for sure. >> Anybody have an objection? > I think it would be clearer to search for whitespace starting at the > start of the bogus token and stop there. It might not be perfect, > particularly if any language has whitespace in a month etc name (I don't > know any example but I guess it's not impossible for it to exist; > Portuguese weekday names maybe?), but it seems better than either of the > behaviors shown above. I'm okay with that. It won't work so well for cases like "1-Januzry-1999", but it's still better than what happens now: regression=# select to_date('1-Januzry-1999', 'DD MONTH YYYY'); ERROR: invalid value "Januzry-1" for "MONTH" That particular case could be improved by stopping at a dash ... but since this code is also used to match strings like "A.M.", we can't just exclude punctuation in general. Breaking at whitespace seems like a reasonable compromise. regards, tom lane
On 2020-Jan-23, Tom Lane wrote: > That particular case could be improved by stopping at a dash ... but > since this code is also used to match strings like "A.M.", we can't > just exclude punctuation in general. Breaking at whitespace seems > like a reasonable compromise. Yeah, and there are cases where dashes are used in names -- browsing through glibc for example I quickly found Akan, for which the month names are: mon "Sanda-<U0186>p<U025B>p<U0254>n";/ "Kwakwar-<U0186>gyefuo";/ "Eb<U0254>w-<U0186>benem";/ and so on. Even whitespace is problematic for some languages, such as Afan, mon "Qunxa Garablu";/ "Naharsi Kudo";/ "Ciggilta Kudo";/ (etc) but I think whitespace-splitting is going to be more comprehensible in the vast majority of cases, even if not perfect. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Even whitespace is problematic for some languages, such as Afan, > mon "Qunxa Garablu";/ > "Naharsi Kudo";/ > "Ciggilta Kudo";/ > (etc) but I think whitespace-splitting is going to be more comprehensible > in the vast majority of cases, even if not perfect. Interesting. Given that to_date et al are often willing to ignore whitespace in input, I wonder whether we won't have some other issues with names like that --- not so much that basic cases wouldn't work, as that people might reasonably expect that, say, we'd be flexible about the amount of whitespace. Seems like a problem for another day though. In the meantime, I agree that "truncate at whitespace" is a better heuristic for the error messages than what we've got. I'll go make it so. regards, tom lane
Here's a v7 patch, rebased over my recent hacking, and addressing most of the complaints I raised in <31691.1579648824@sss.pgh.pa.us>. However, I've got some new complaints now that I've looked harder at the code: * It's not exactly apparent to me why this code should be concerned about non-normalized characters when noplace else in the backend is. I think we should either rip that out entirely, or move the logic into str_tolower (and hence also str_toupper etc). I'd tend to favor the former, but of course I don't speak any languages where this would be an issue. Perhaps a better idea is to invent a new SQL function that normalizes a given string, and expect users to call that first if they'd like to_date() to match unnormalized text. * I have no faith in this calculation that decides how long the match length was: *len = element_len + name_len - norm_len; I seriously doubt that this does the right thing if either the normalization or the downcasing changed the byte length of the string. I'm not actually sure how we can do that correctly. There's no good way to know whether the changes happened within the part of the "name" string that we matched, or the part beyond what we matched, but we only want to count the former. So this seems like a pretty hard problem, and even if this logic is somehow correct as it stands, it needs a comment explaining why. * I'm still concerned about the risk of integer overflow in the string allocations in seq_search_localized(). Those need to be adjusted to be more paranoid, similar to the code in e.g. str_tolower. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 6c4359d..ff44496 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5934,7 +5934,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); </row> <row> <entry><literal>TM</literal> prefix</entry> - <entry>translation mode (print localized day and month names based on + <entry>translation mode (use localized day and month names based on <xref linkend="guc-lc-time"/>)</entry> <entry><literal>TMMonth</literal></entry> </row> @@ -5966,8 +5966,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); <listitem> <para> <literal>TM</literal> does not include trailing blanks. + </para> + </listitem> + + <listitem> + <para> <function>to_timestamp</function> and <function>to_date</function> ignore - the <literal>TM</literal> modifier. + the case when receiving names as an input. </para> </listitem> diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 792f9ca..2f39050 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -87,6 +87,7 @@ #include "catalog/pg_collation.h" #include "catalog/pg_type.h" +#include "common/unicode_norm.h" #include "mb/pg_wchar.h" #include "parser/scansup.h" #include "utils/builtins.h" @@ -1059,9 +1060,11 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, @@ -2459,7 +2462,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2505,8 +2508,92 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for + * a case-insensitive match to the initial character(s) of "name". + * + * This has the same API as seq_search_ascii(), but we use a more general + * downcasing transformation to achieve case-insensitivity. + * + * The array is treated as const, but we don't declare it that way because + * the arrays exported by pg_locale.c aren't const. + */ +static int +seq_search_localized(const char *name, char **array, int *len) +{ + char **a; + char *lower_name; + char *norm_name; + int name_len; + int norm_len; + + *len = 0; + + /* empty string can't match anything */ + if (!*name) + return -1; + + name_len = strlen(name); + + /* Normalize name, if working in Unicode */ + if (GetDatabaseEncoding() == PG_UTF8) + { + pg_wchar *wchar_name; + pg_wchar *norm_wname; + size_t name_wlen; + size_t norm_wlen; + + wchar_name = (pg_wchar *) palloc((name_len + 1) * sizeof(pg_wchar)); + name_wlen = pg_mb2wchar_with_len(name, wchar_name, name_len); + norm_wname = unicode_normalize_kc(wchar_name); + pfree(wchar_name); + norm_wlen = pg_wchar_strlen(norm_wname); + norm_name = (char *) palloc(norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1); + norm_len = pg_wchar2mb_with_len(norm_wname, norm_name, norm_wlen); + pfree(norm_wname); + } + else + { + norm_name = unconstify(char *, name); + norm_len = name_len; + } + + /* Now lower-case it */ + lower_name = str_tolower(norm_name, norm_len, DEFAULT_COLLATION_OID); + if ((const char *) norm_name != name) + pfree(norm_name); + + for (a = array; *a != NULL; a++) + { + char *lower_element; + int element_len; + + /* Lower-case array element, assuming it is normalized */ + lower_element = str_tolower(*a, strlen(*a), DEFAULT_COLLATION_OID); + element_len = strlen(lower_element); + + /* Match? */ + if (strncmp(lower_name, lower_element, element_len) == 0) + { + *len = element_len + name_len - norm_len; + pfree(lower_element); + pfree(lower_name); + return a - array; + } + pfree(lower_element); + } + + pfree(lower_name); + return -1; +} + +/* + * Perform a sequential search in 'array' (or 'localized_array', if that's + * not NULL) for an entry matching the first character(s) of the 'src' + * string case-insensitively. + * + * The 'array' is presumed to be English words (all-ASCII), but + * if 'localized_array' is supplied, that might be non-English + * so we need a more expensive downcasing transformation. * * If a match is found, copy the array index of the match into the integer * pointed to by 'dest', advance 'src' to the end of the part of the string @@ -2520,11 +2607,15 @@ seq_search(const char *name, const char *const *array, int *len) */ static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, FormatNode *node, bool *have_error) { int len; - *dest = seq_search(*src, array, &len); + if (localized_array == NULL) + *dest = seq_search_ascii(*src, array, &len); + else + *dest = seq_search_localized(*src, localized_array, &len); if (len <= 0) { @@ -3172,6 +3263,9 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, /* number of extra skipped characters (more than given in format string) */ int extra_skip = 0; + /* cache localized days and months */ + cache_locale_time(); + for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++) { /* @@ -3272,7 +3366,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_P_M: case DCH_a_m: case DCH_p_m: - from_char_seq_search(&value, &s, ampm_strings_long, + from_char_seq_search(&value, &s, ampm_strings_long, NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); @@ -3283,7 +3377,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_PM: case DCH_am: case DCH_pm: - from_char_seq_search(&value, &s, ampm_strings, + from_char_seq_search(&value, &s, ampm_strings, NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); @@ -3396,7 +3490,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_B_C: case DCH_a_d: case DCH_b_c: - from_char_seq_search(&value, &s, adbc_strings_long, + from_char_seq_search(&value, &s, adbc_strings_long, NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); @@ -3406,7 +3500,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_BC: case DCH_ad: case DCH_bc: - from_char_seq_search(&value, &s, adbc_strings, + from_char_seq_search(&value, &s, adbc_strings, NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); @@ -3416,6 +3510,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Month: case DCH_month: from_char_seq_search(&value, &s, months_full, + S_TM(n->suffix) ? localized_full_months : NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3425,6 +3520,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Mon: case DCH_mon: from_char_seq_search(&value, &s, months, + S_TM(n->suffix) ? localized_abbrev_months : NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3439,6 +3535,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Day: case DCH_day: from_char_seq_search(&value, &s, days, + S_TM(n->suffix) ? localized_full_days : NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3449,6 +3546,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Dy: case DCH_dy: from_char_seq_search(&value, &s, days_short, + S_TM(n->suffix) ? localized_abbrev_days : NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3566,7 +3664,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, break; case DCH_RM: case DCH_rm: - from_char_seq_search(&value, &s, rm_months_lower, + from_char_seq_search(&value, &s, rm_months_lower, NULL, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 25fb7e2..64fd3ae 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -96,11 +96,17 @@ char *locale_monetary; char *locale_numeric; char *locale_time; -/* lc_time localization cache */ -char *localized_abbrev_days[7]; -char *localized_full_days[7]; -char *localized_abbrev_months[12]; -char *localized_full_months[12]; +/* + * lc_time localization cache. + * + * We use only the first 7 or 12 entries of these arrays. The last array + * element is left as NULL for the convenience of outside code that wants + * to sequentially scan these arrays. + */ +char *localized_abbrev_days[7 + 1]; +char *localized_full_days[7 + 1]; +char *localized_abbrev_months[12 + 1]; +char *localized_full_months[12 + 1]; /* indicates whether locale information cache is valid */ static bool CurrentLocaleConvValid = false; @@ -922,6 +928,8 @@ cache_locale_time(void) cache_single_string(&localized_full_days[i], bufptr, encoding); bufptr += MAX_L10N_DATA; } + localized_abbrev_days[7] = NULL; + localized_full_days[7] = NULL; /* localized months */ for (i = 0; i < 12; i++) @@ -931,6 +939,8 @@ cache_locale_time(void) cache_single_string(&localized_full_months[i], bufptr, encoding); bufptr += MAX_L10N_DATA; } + localized_abbrev_months[12] = NULL; + localized_full_months[12] = NULL; CurrentLCTimeValid = true; } diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out index 37c6add..af3762e 100644 --- a/src/test/regress/expected/collate.linux.utf8.out +++ b/src/test/regress/expected/collate.linux.utf8.out @@ -461,6 +461,22 @@ SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR"); 01 NİS 2010 (1 row) +-- to_date +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- normalized \u015E + to_date +------------ + 02-01-2010 +(1 row) + +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- not normalized \u0053+\u0327 + to_date +------------ + 02-01-2010 +(1 row) + +SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail +ERROR: invalid value "1234567890ab" for "MONTH" +DETAIL: The given value did not match any of the allowed values for this field. -- backwards parsing CREATE VIEW collview1 AS SELECT * FROM collate_test1 WHERE b COLLATE "C" >= 'bbc'; CREATE VIEW collview2 AS SELECT a, b FROM collate_test1 ORDER BY b COLLATE "C"; diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql index 8c26f16..b0a85c2 100644 --- a/src/test/regress/sql/collate.linux.utf8.sql +++ b/src/test/regress/sql/collate.linux.utf8.sql @@ -182,6 +182,12 @@ SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR"); SELECT to_char(date '2010-04-01', 'DD TMMON YYYY'); SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR"); +-- to_date + +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- normalized \u015E +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- not normalized \u0053+\u0327 +SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail + -- backwards parsing
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Thu, Jan 23, 2020 at 11:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thank you for your time looking into this.
Here's a v7 patch, rebased over my recent hacking, and addressing
most of the complaints I raised in <31691.1579648824@sss.pgh.pa.us>.
However, I've got some new complaints now that I've looked harder
at the code:
* It's not exactly apparent to me why this code should be concerned
about non-normalized characters when noplace else in the backend is.
I think we should either rip that out entirely, or move the logic
into str_tolower (and hence also str_toupper etc). I'd tend to favor
the former, but of course I don't speak any languages where this would
be an issue. Perhaps a better idea is to invent a new SQL function
that normalizes a given string, and expect users to call that first
if they'd like to_date() to match unnormalized text.
There is an open patch that will make the normalization functionality user visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD TMMON YYYY') I would vote to drop the normalization logic inside this patch altogether.
* I have no faith in this calculation that decides how long the match
length was:
*len = element_len + name_len - norm_len;
I seriously doubt that this does the right thing if either the
normalization or the downcasing changed the byte length of the
string. I'm not actually sure how we can do that correctly.
There's no good way to know whether the changes happened within
the part of the "name" string that we matched, or the part beyond
what we matched, but we only want to count the former. So this
seems like a pretty hard problem, and even if this logic is somehow
correct as it stands, it needs a comment explaining why.
The proper logic would come from do_to_timestamp() receiving a normalized "date_txt" input, so we do not operate with unnormalize and normalize strings at the same time.
* I'm still concerned about the risk of integer overflow in the
string allocations in seq_search_localized(). Those need to be
adjusted to be more paranoid, similar to the code in e.g. str_tolower.
Please find attached a patch with the normalization logic removed, thus no direct allocations in seq_search_localized().
I would like to rise a couple of questions myself:
* When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ defined but not used". Should we drop this function or uncomment its usage?
* Would it be worth moving str_tolower(localized_name) from seq_search_localized() into cache_locale_time()?
Regards,
Juan José Santamaría Flecha
Attachment
On 2020-Jan-24, Juan José Santamaría Flecha wrote: > There is an open patch that will make the normalization functionality user > visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD > TMMON YYYY') I would vote to drop the normalization logic inside this patch > altogether. I was reading the SQL standard on this point, and it says this (4.2.8 Universal character sets): An SQL-implementation may assume that all UCS strings are normalized in one of [Unicode normalization forms]. which seems to agree with what you're saying. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes: > On Thu, Jan 23, 2020 at 11:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> * It's not exactly apparent to me why this code should be concerned >> about non-normalized characters when noplace else in the backend is. > There is an open patch that will make the normalization functionality user > visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD > TMMON YYYY') I would vote to drop the normalization logic inside this patch > altogether. Works for me. > * I have no faith in this calculation that decides how long the match >> length was: >> *len = element_len + name_len - norm_len; > The proper logic would come from do_to_timestamp() receiving a normalized > "date_txt" input, so we do not operate with unnormalize and normalize > strings at the same time. No, that only solves half the problem, because the downcasing transformation can change the string length too. Two easy examples: * In German, I believe "ß" downcases to "ss". In Latin-1 encoding that's a length change, though I think it might accidentally not be in UTF8. * The Turks distinguish dotted and dotless "i", so that "İ" downcases to "i", and conversely "I" downcases to "ı". Those are length changes in UTF8, though not in whichever Latin-N encoding works for Turkish. Even if these cases happen not to apply to any month or day name of the relevant language, we still have a problem, arising from the fact that we're downcasing the whole remaining string --- so length changes after the match could occur anyway. > I would like to rise a couple of questions myself: > * When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ > defined but not used". Should we drop this function or uncomment its usage? Maybe, but I don't think it belongs in this patch. > * Would it be worth moving str_tolower(localized_name) > from seq_search_localized() into cache_locale_time()? I think it'd complicate tracking when that cache has to be invalidated (i.e. it now depends on more than just LC_TIME). On the whole I wouldn't bother unless someone does the measurements to show there'd be a useful speedup. regards, tom lane
On 2020-01-24 12:44, Alvaro Herrera wrote: > On 2020-Jan-24, Juan José Santamaría Flecha wrote: > >> There is an open patch that will make the normalization functionality user >> visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD >> TMMON YYYY') I would vote to drop the normalization logic inside this patch >> altogether. > > I was reading the SQL standard on this point, and it says this (4.2.8 > Universal character sets): > > An SQL-implementation may assume that all UCS strings are normalized > in one of [Unicode normalization forms]. > > which seems to agree with what you're saying. When you interact with glibc locale functions, you essentially have to assume that everything is in NFC. When using ICU locale functions (which we don't here, but just for the sake of argument), then I would expect that ICU does appropriate normalization itself when I call icu_what_month_is_this(string) returns int. So I think it is appropriate to not deal with normalization in this patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-24, Peter Eisentraut wrote: > When you interact with glibc locale functions, you essentially have to > assume that everything is in NFC. When using ICU locale functions (which we > don't here, but just for the sake of argument), then I would expect that ICU > does appropriate normalization itself when I call > icu_what_month_is_this(string) returns int. So I think it is appropriate to > not deal with normalization in this patch. But that's a different POV. The input to this function could come from arbitrary user input from any application whatsoever. So the only reason we can get away with that is because the example regression case Juan José added (which uses non-normals) does not conform to the standard. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > But that's a different POV. The input to this function could come from > arbitrary user input from any application whatsoever. So the only > reason we can get away with that is because the example regression case > Juan José added (which uses non-normals) does not conform to the > standard. I'm unsure about "conforming to standard", but I think it's reasonable to put the onus of doing normalization when necessary on the user. Otherwise, we need to move normalization logic into basically all the string processing functions (even texteq), which seems like a pretty huge cost that will benefit only a small minority of people. (If it's not a small minority, then where's the bug reports complaining that we don't do it today?) regards, tom lane
On 2020-01-24 17:22, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> But that's a different POV. The input to this function could come from >> arbitrary user input from any application whatsoever. So the only >> reason we can get away with that is because the example regression case >> Juan José added (which uses non-normals) does not conform to the >> standard. > > I'm unsure about "conforming to standard", but I think it's reasonable > to put the onus of doing normalization when necessary on the user. > Otherwise, we need to move normalization logic into basically all > the string processing functions (even texteq), which seems like a > pretty huge cost that will benefit only a small minority of people. > (If it's not a small minority, then where's the bug reports complaining > that we don't do it today?) These reports do exist, and this behavior is known. However, the impact is mostly that results "look wrong" (looks the same but doesn't compare as equal) rather than causing inconsistency and corruption, so it's mostly shrugged off. The nondeterministic collation feature was introduced in part to be able to deal with this; the pending normalization patch is another. However, this behavior is baked deeply into Unicode, so no single feature or facility will simply make it go away. AFAICT, we haven't so far had any code that does a lookup of non-ASCII strings in a table, so that's why we haven't had this discussion yet. Now that I think about it, you could also make an argument that this should be handled through collation, so the function that looks up the string in the locale table should go through texteq. However, this would mostly satisfy the purists but create a bizarre user experience. Looking through the patch quickly, if you want to get Unicode-fancy, doing a case-insensitive comparison by running lower-case on both strings is also wrong in corner cases. All the Greek month names end in sigma, so I suspect that this patch might not work correctly in such cases. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > Looking through the patch quickly, if you want to get Unicode-fancy, > doing a case-insensitive comparison by running lower-case on both > strings is also wrong in corner cases. All the Greek month names end in > sigma, so I suspect that this patch might not work correctly in such cases. Hm. That's basically what citext does, and I don't recall hearing complaints about that. What other definition of "case insensitive" would you suggest? regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Fri, Jan 24, 2020 at 5:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Looking through the patch quickly, if you want to get Unicode-fancy,
> doing a case-insensitive comparison by running lower-case on both
> strings is also wrong in corner cases. All the Greek month names end in
> sigma, so I suspect that this patch might not work correctly in such cases.
Hm. That's basically what citext does, and I don't recall hearing
complaints about that. What other definition of "case insensitive"
would you suggest?
To illustrate the issue, it does not work as expected:
lower
------------
ιανουάριοσ
(1 row)
postgres=# select to_char(now(),'TMmonth');
to_char
------------
ιανουάριος
(1 row)
Regards,
Juan José Santamaría Flecha
On 2020-01-24 18:25, Juan José Santamaría Flecha wrote: > To illustrate the issue, it does not work as expected: > > postgres=# select lower(to_char(now(),'TMMONTH')); > lower > ------------ > ιανουάριοσ > (1 row) > > postgres=# select to_char(now(),'TMmonth'); > to_char > ------------ > ιανουάριος > (1 row) Well, this is interesting, because on macOS and Debian stable I get postgres=# select to_char(now(),'TMmonth'); to_char ------------ ιανουαρίου (1 row) which is the genitive of ιανουάριος. You use the genitive form for a date (24th of January) but the nominative otherwise. But the reverse mapping can only take one of these forms. So here select to_date('Ιανουαριος', 'TMMonth'); fails, which is bad. In the glibc locale data sources they have both forms listed, but apparently the API were are using only accepts one of them. (I don't know any Greek, I'm just extrapolating from Wikipedia and locale data.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-01-24 19:01, Peter Eisentraut wrote: > postgres=# select to_char(now(),'TMmonth'); > to_char > ------------ > ιανουαρίου > (1 row) > > which is the genitive of ιανουάριος. You use the genitive form for a > date (24th of January) but the nominative otherwise. But the reverse > mapping can only take one of these forms. So here > > select to_date('Ιανουαριος', 'TMMonth'); > > fails, which is bad. For the record, the correct form of that would appear to be select to_date('Ιανουάριος', 'TMMonth'); with the accent. I had tried different variations of that and they all failed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > For the record, the correct form of that would appear to be > select to_date('Ιανουάριος', 'TMMonth'); > with the accent. I had tried different variations of that and they all > failed. OK, so for anyone who is as confused as I was, the main point here seems to be this: the upper case form of Greek sigma is 'Σ', and the lower case form is 'σ' ... except as the final letter of a word, where it is supposed to be written like 'ς'. If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8', then (on a somewhat hoary glibc platform) I get u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH'); to_char ---------------------- ΙΑΝΟΥΆΡΙΟΣ (1 row) u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth'); to_char ---------------------- Ιανουάριος (1 row) u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth'); to_char ---------------------- ιανουάριος (1 row) which is correct AFAICS ... but u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH')); lower ---------------------- ιανουάριοσ (1 row) So what we actually have here, ISTM, is a bug in lower() not to_char(). The bug is unsurprising because str_tolower() simply applies towlower_l() to each character independently, so there's no way for it to account for the word-final rule. I'm not aware that glibc provides any API whereby that could be done correctly. On the other hand, we get it right when using an ICU collation for lower(): u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate "el-gr-x-icu"); lower ---------------------- ιανουάριος (1 row) because that code path passes the whole string to ICU at once, and of course getting this right is ICU's entire job. I haven't double-checked, but I imagine that the reason that to_char gets the month name case-folding right is that what comes out of strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase correctly, while the downcasing code paths don't affect 'ς'. I thought for a little bit about trying to dodge this issue in the patch by folding to upper case, not lower, before comparing month/day names. I fear that that would just shift the problem cases to some other language(s). However, it makes Greek better, and I think it makes German better (does 'ß' appear in any month/day names there?), so maybe we should just roll with that. In the end, it doesn't seem right to reject this patch just because lower() is broken on some platforms. The other question your example raises is whether we should be trying to de-accent before comparison, ie was it right for 'Ιανουάριος' to be treated differently from 'Ιανουαριος'. I don't know enough Greek to say, but it kind of feels like that should be outside to_date's purview. regards, tom lane
> On Jan 27, 2020, at 6:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > and I think it > makes German better (does 'ß' appear in any month/day names there?), > so maybe we should just roll with that. To my knowledge, neither /ß/ nor /ss/ occur in day or month names in German, neither presently nor in recent times, so Iwouldn’t expect it to matter for German whether you use uppercase or lowercase. > > The other question your example raises is whether we should be trying > to de-accent before comparison, ie was it right for 'Ιανουάριος' to > be treated differently from 'Ιανουαριος'. I don't know enough Greek > to say, but it kind of feels like that should be outside to_date's > purview. I’m getting a little lost here. German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a listof values anyway? I don’t know Norwegian, but Wikipedia shows both sundag and søndag for Sunday in Norwegian Nynorsk. Faroese seems to have a few similar cases. Whether those languages have alternate day names both in common usage,I can’t say, but both Sonnabend and Samstag are still in use in the German speaking world. If you need to compareagainst lists, then I would think you could put both ιανουάριοσ and ιανουάριος into a list, even if only one of themis grammatically correct Greek. I’d think that unaccented versions of names might also go into the list, depending onwhether speakers of that language consider them equivalent. That sounds like something for the translators to hash out. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-01-28 04:05, Mark Dilger wrote: > German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway? Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", then it's not really usable. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-01-28 03:11, Tom Lane wrote: > The other question your example raises is whether we should be trying > to de-accent before comparison, ie was it right for 'Ιανουάριος' to > be treated differently from 'Ιανουαριος'. I don't know enough Greek > to say, but it kind of feels like that should be outside to_date's > purview. To clarify, nothing in my examples was meant to imply that de-accenting might be necessary. AFAICT, the accented forms are the only linguistically acceptable forms and all the locale libraries accept them correctly in their accented forms. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 2020-01-28 04:05, Mark Dilger wrote: >> German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway? > Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", > then it's not really usable. If we're going to insist on that, then the entire patch is junk. Indeed, I don't even know where we could get the knowledge of which name(s) to accept, because strftime is surely only going to tell us one of them. regards, tom lane
> On Jan 28, 2020, at 6:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> On 2020-01-28 04:05, Mark Dilger wrote: >>> German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway? > >> Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", >> then it's not really usable. > > If we're going to insist on that, then the entire patch is junk. I’m not insisting, just asking about the design. If it only works with one name supported per weekday per language, andper month per language, I’m not going to object. I was just curious if you were going to support multiple names anyway,and if that made the question about the Greek case folding less pressing. > Indeed, I don't even know where we could get the knowledge of which > name(s) to accept, because strftime is surely only going to tell us > one of them. I haven’t played with this yet, but per the manual page for strftime_l: > %A is replaced by national representation of the full weekday name. > > %a is replaced by national representation of the abbreviated weekday name. > > %B is replaced by national representation of the full month name. > > %b is replaced by national representation of the abbreviated month name. Which I think gives you just the preferred name, by whatever means the library decides which of Sonnabend/Samstag is thepreferred name. But then the manual page goes on to say: > %E* %O* > POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow%OW %Oy are supposed to provide alternate representations. > > Additionally %OB implemented to represent alternative months names (used standalone, without day mentioned). This is the part I haven’t played with, but it sounds like it can handle at least one alternate name. Perhaps you can getthe alternates this way? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Tue, Jan 28, 2020 at 4:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
I’m not insisting, just asking about the design. If it only works with one name supported per weekday per language, and per month per language, I’m not going to object. I was just curious if you were going to support multiple names anyway, and if that made the question about the Greek case folding less pressing.
This patch targets to do something symmetrical to to_char(), which will just return a single value.
The issue with the greek locale is that it cannot hold this equivalent behaviour, as in:
postgres=# select to_date(to_char(now(), 'TMMONTH'), 'TMMONTH');
ERROR: invalid value "ΙΑΝΟΥΆΡΙΟΣ" for "MONTH"
ERROR: invalid value "ΙΑΝΟΥΆΡΙΟΣ" for "MONTH"
Because of the particular behaviour sigma (Σσς) casing has, which is also user visible with a lower().
> %E* %O*
> POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide alternate representations.
>
> Additionally %OB implemented to represent alternative months names (used standalone, without day mentioned).
This is the part I haven’t played with, but it sounds like it can handle at least one alternate name. Perhaps you can get the alternates this way?
This looks like a POSIX feature that some systems might not like (Windows [1]). But if this is something that the patch should aim to, I am fine with a RWF and give it another try in the future.
Regards,
Juan José Santamaría Flecha
> On Jan 28, 2020, at 7:47 AM, Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > > This looks like a POSIX feature that some systems might not like (Windows [1]). But if this is something that the patchshould aim to, I am fine with a RWF and give it another try in the future. As long as this implementation doesn’t create a backward-compatibility problem when doing a more complete implementationlater, I’m fine with this patch not tackling the whole problem. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-Jan-28, Peter Eisentraut wrote: > On 2020-01-28 04:05, Mark Dilger wrote: > > German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway? > > Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", then > it's not really usable. The string "Sonnabend" never appears in the glibc sources, so that will certainly not work. I vote not to care about that, but of course my language is not one that has alternate weekday/month names. I guess if we're intent on recognizing alternate names, we'll have to build our own list of them :-( I don't have the ICU sources here to check the status there. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Mark Dilger <mark.dilger@enterprisedb.com> writes: > But then the manual page goes on to say: >> %E* %O* >> POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy aresupposed to provide alternate representations. >> >> Additionally %OB implemented to represent alternative months names (used standalone, without day mentioned). > This is the part I haven’t played with, but it sounds like it can handle at least one alternate name. Perhaps you canget the alternates this way? This sounded promising, but the POSIX strftime spec doesn't mention %OB, so I'm afraid we can't count on it to do much. At this point I'm not really convinced that there are no languages with more than two forms, anyway :-(. I also wondered whether we could get any further by using strptime() to convert localized month and day names on-the-fly, rather than the patch's current approach of re-using strftime() results. If strptime() fails to support alternative names, it's their bug not ours. Unfortunately, glibc has got said bug (AFAICS anyway), so in practice this would only offer us plausible deniability and not much of any real functionality. In the end it seems like we could only handle alternative names by keeping our own lists of them. There are probably few enough cases that that wouldn't be a tremendous maintenance problem, but what I'm not quite seeing is how we'd decide which list to use when. Right now, locale identifiers are pretty much opaque to us ... do we really want to get into the business of recognizing that such a name refers to German, or Greek? A brute-force answer, if there are few enough cases, is to recognize them regardless of the specific value of LC_TIME. Do we think anybody would notice or care if to_date('Sonnabend', 'TMDay') works even when in a non-German locale? regards, tom lane
> On Jan 28, 2020, at 9:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > A brute-force answer, if there are few enough cases, is to recognize > them regardless of the specific value of LC_TIME. Do we think > anybody would notice or care if to_date('Sonnabend', 'TMDay') works > even when in a non-German locale? I think this only becomes a problem if there are weekday or month name collisions between languages where they have differentmeanings. As an absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in English, then you’vegot a problem. This does happen for month abbreviations. “Mar” is short for “March” and variations in a number of languages, but shortfor “November” in Finnish. For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” collides between sl_SL and no_NO. I don’t see any collisions with full month or full weekday names, but I’m also only looking at the 53 locales installed in/usr/share/locale/.*/LC_TIME on my mac. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Jan 28, 2020, at 9:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A brute-force answer, if there are few enough cases, is to recognize >> them regardless of the specific value of LC_TIME. Do we think >> anybody would notice or care if to_date('Sonnabend', 'TMDay') works >> even when in a non-German locale? > I think this only becomes a problem if there are weekday or month name collisions between languages where they have differentmeanings. As an absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in English, then you’vegot a problem. > This does happen for month abbreviations. “Mar” is short for “March” and variations in a number of languages, but shortfor “November” in Finnish. > For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” collides between sl_SL and no_NO. But none of those cases are alternative names, so we wouldn't have entries for them in this hypothetical list. They'd only be recognized when strftime() returns them. I suspect also that the abbreviated names have few if any alternatives, so we may only need this whole hack for full names. A possible way of tightening things up without explicitly decoding the locale name is to make the entries of the list be string pairs: "if strftime() returned this, then also consider that". That puts a bit of a premium on knowing which names strftime considers primary, but I think we'd have had to know that anyway. regards, tom lane
On 2020-01-28 16:47, Juan José Santamaría Flecha wrote: > This patch targets to do something symmetrical to to_char(), which will > just return a single value. I didn't fully realize while reading this thread that to_char() already supports localized output and this patch indeed just wants to do the opposite. So I'm withdrawing my concerns with respect to this patch. As long as it can do a roundtrip conversion with to_char(), it's fine. It's pretty clear that this interface cannot be useful for producing or parsing fully general localized dates. But since it exists now (and it's quasi SQL standard now), we might as well fill in this gap. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Tue, Jan 28, 2020 at 5:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jan-28, Peter Eisentraut wrote:
> On 2020-01-28 04:05, Mark Dilger wrote:
> > German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway?
>
> Yeah, good point. If it doesn't accept both "Sonnabend" and "Samstag", then
> it's not really usable.
The string "Sonnabend" never appears in the glibc sources, so that will
certainly not work. I vote not to care about that, but of course my
language is not one that has alternate weekday/month names. I guess if
we're intent on recognizing alternate names, we'll have to build our own
list of them :-(
I don't have the ICU sources here to check the status there.
"Sonnabend" is neither available in ICU.
What is available are both genitive and nominative forms for months, as reported up thread by Peter. See formats "M" and "L" in:
Regards,
Juan José Santamaría Flecha
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2020-01-28 16:47, Juan José Santamaría Flecha wrote:
> This patch targets to do something symmetrical to to_char(), which will
> just return a single value.
I didn't fully realize while reading this thread that to_char() already
supports localized output and this patch indeed just wants to do the
opposite.
So I'm withdrawing my concerns with respect to this patch. As long as
it can do a roundtrip conversion with to_char(), it's fine.
We can avoid issues with non injective case conversion languages with a double conversion, so both strings in the comparison end up in the same state.
I propose an upper/lower conversion as in the attached patch.
Regards,
Juan José Santamaría Flecha
Attachment
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes: > On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut < > peter.eisentraut@2ndquadrant.com> wrote: >> So I'm withdrawing my concerns with respect to this patch. As long as >> it can do a roundtrip conversion with to_char(), it's fine. > We can avoid issues with non injective case conversion languages with a > double conversion, so both strings in the comparison end up in the same > state. > I propose an upper/lower conversion as in the attached patch. This seems pretty reasonable to me, with a couple of caveats: * I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID as the collation to do case-folding with. For to_date/to_timestamp, we have collatable text input so we can rely on the function's input collation instead, at the cost of having to pass down the collation OID through a few layers of subroutines :-(. For parse_datetime, I punted for now and let it use DEFAULT_COLLATION_OID, because that's currently only called by JSONB code that probably hasn't got a usable input collation anyway (since jsonb isn't considered collatable). * I think it's likely worthwhile to do a quick check for an exact match before we go through all these case-folding pushups. If the expected use-case is reversing to_char() output, that will win all the time. We're probably ahead of the game even if it only matches a few percent of the time. Attached v10 incorporates those improvements, plus a bit of polishing of docs and comments. Barring objections, this seems committable to me. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 28035f1..8b73e05 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); </row> <row> <entry><literal>TM</literal> prefix</entry> - <entry>translation mode (print localized day and month names based on + <entry>translation mode (use localized day and month names based on <xref linkend="guc-lc-time"/>)</entry> <entry><literal>TMMonth</literal></entry> </row> @@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); <listitem> <para> - <literal>TM</literal> does not include trailing blanks. - <function>to_timestamp</function> and <function>to_date</function> ignore - the <literal>TM</literal> modifier. + <literal>TM</literal> suppresses trailing blanks whether or + not <literal>FM</literal> is specified. + </para> + </listitem> + + <listitem> + <para> + <function>to_timestamp</function> and <function>to_date</function> + ignore letter case in the input; so for + example <literal>MON</literal>, <literal>Mon</literal>, + and <literal>mon</literal> all accept the same strings. When using + the <literal>TM</literal> modifier, case-folding is done according to + the rules of the function's input collation (see + <xref linkend="collation"/>). </para> </listitem> diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index d029468..db99a6b 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid collid); static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, - bool std, bool *have_error); + Oid collid, bool std, bool *have_error); #ifdef DEBUG_TO_FROM_CHAR static void dump_index(const KeyWord *k, const int *index); @@ -1057,11 +1057,14 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len, + Oid collid); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, Oid collid, FormatNode *node, bool *have_error); -static void do_to_timestamp(text *date_txt, text *fmt, bool std, +static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error); static char *fill_str(char *str, int c, int max); @@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for + * a case-insensitive match to the initial character(s) of "name". + * + * This has the same API as seq_search_ascii(), but we use a more general + * case-folding transformation to achieve case-insensitivity. Case folding + * is done per the rules of the collation identified by "collid". + * + * The array is treated as const, but we don't declare it that way because + * the arrays exported by pg_locale.c aren't const. + */ +static int +seq_search_localized(const char *name, char **array, int *len, Oid collid) +{ + char **a; + char *upper_name; + char *lower_name; + + *len = 0; + + /* empty string can't match anything */ + if (!*name) + return -1; + + /* + * The case-folding processing done below is fairly expensive, so before + * doing that, make a quick pass to see if there is an exact match. + */ + for (a = array; *a != NULL; a++) + { + int element_len = strlen(*a); + + if (strncmp(name, *a, element_len) == 0) + { + *len = element_len; + return a - array; + } + } + + /* + * Fold to upper case, then to lower case, so that we can match reliably + * even in languages in which case conversions are not injective. + */ + upper_name = str_toupper(unconstify(char *, name), strlen(name), collid); + lower_name = str_tolower(upper_name, strlen(upper_name), collid); + pfree(upper_name); + + for (a = array; *a != NULL; a++) + { + char *upper_element; + char *lower_element; + int element_len; + + /* Likewise upper/lower-case array element */ + upper_element = str_toupper(*a, strlen(*a), collid); + lower_element = str_tolower(upper_element, strlen(upper_element), + collid); + pfree(upper_element); + element_len = strlen(lower_element); + + /* Match? */ + if (strncmp(lower_name, lower_element, element_len) == 0) + { + *len = element_len; + pfree(lower_element); + pfree(lower_name); + return a - array; + } + pfree(lower_element); + } + + pfree(lower_name); + return -1; +} + +/* + * Perform a sequential search in 'array' (or 'localized_array', if that's + * not NULL) for an entry matching the first character(s) of the 'src' + * string case-insensitively. + * + * The 'array' is presumed to be English words (all-ASCII), but + * if 'localized_array' is supplied, that might be non-English + * so we need a more expensive case-folding transformation + * (which will follow the rules of the collation 'collid'). * * If a match is found, copy the array index of the match into the integer * pointed to by 'dest', advance 'src' to the end of the part of the string @@ -2518,11 +2602,15 @@ seq_search(const char *name, const char *const *array, int *len) */ static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, Oid collid, FormatNode *node, bool *have_error) { int len; - *dest = seq_search(*src, array, &len); + if (localized_array == NULL) + *dest = seq_search_ascii(*src, array, &len); + else + *dest = seq_search_localized(*src, localized_array, &len, collid); if (len <= 0) { @@ -3147,19 +3235,20 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col *s = '\0'; } -/* ---------- - * Process a string as denoted by a list of FormatNodes. +/* + * Process the string 'in' as denoted by the array of FormatNodes 'node[]'. * The TmFromChar struct pointed to by 'out' is populated with the results. * + * 'collid' identifies the collation to use, if needed. + * 'std' specifies standard parsing mode. + * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. + * * Note: we currently don't have any to_interval() function, so there * is no need here for INVALID_FOR_INTERVAL checks. - * - * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. - * ---------- */ static void -DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, - bool *have_error) +DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, + Oid collid, bool std, bool *have_error) { FormatNode *n; const char *s; @@ -3170,6 +3259,9 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, /* number of extra skipped characters (more than given in format string) */ int extra_skip = 0; + /* cache localized days and months */ + cache_locale_time(); + for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++) { /* @@ -3271,6 +3363,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_a_m: case DCH_p_m: from_char_seq_search(&value, &s, ampm_strings_long, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); @@ -3282,6 +3375,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_am: case DCH_pm: from_char_seq_search(&value, &s, ampm_strings, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); @@ -3395,6 +3489,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_a_d: case DCH_b_c: from_char_seq_search(&value, &s, adbc_strings_long, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); @@ -3405,6 +3500,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_ad: case DCH_bc: from_char_seq_search(&value, &s, adbc_strings, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); @@ -3414,6 +3510,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Month: case DCH_month: from_char_seq_search(&value, &s, months_full, + S_TM(n->suffix) ? localized_full_months : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3423,6 +3521,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Mon: case DCH_mon: from_char_seq_search(&value, &s, months, + S_TM(n->suffix) ? localized_abbrev_months : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3437,6 +3537,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Day: case DCH_day: from_char_seq_search(&value, &s, days, + S_TM(n->suffix) ? localized_full_days : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3447,6 +3549,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Dy: case DCH_dy: from_char_seq_search(&value, &s, days_short, + S_TM(n->suffix) ? localized_abbrev_days : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3565,6 +3669,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_RM: case DCH_rm: from_char_seq_search(&value, &s, rm_months_lower, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, @@ -4031,13 +4136,15 @@ to_timestamp(PG_FUNCTION_ARGS) { text *date_txt = PG_GETARG_TEXT_PP(0); text *fmt = PG_GETARG_TEXT_PP(1); + Oid collid = PG_GET_COLLATION(); Timestamp result; int tz; struct pg_tm tm; fsec_t fsec; int fprec; - do_to_timestamp(date_txt, fmt, false, &tm, &fsec, &fprec, NULL, NULL); + do_to_timestamp(date_txt, fmt, collid, false, + &tm, &fsec, &fprec, NULL, NULL); /* Use the specified time zone, if any. */ if (tm.tm_zone) @@ -4072,11 +4179,13 @@ to_date(PG_FUNCTION_ARGS) { text *date_txt = PG_GETARG_TEXT_PP(0); text *fmt = PG_GETARG_TEXT_PP(1); + Oid collid = PG_GET_COLLATION(); DateADT result; struct pg_tm tm; fsec_t fsec; - do_to_timestamp(date_txt, fmt, false, &tm, &fsec, NULL, NULL, NULL); + do_to_timestamp(date_txt, fmt, collid, false, + &tm, &fsec, NULL, NULL, NULL); /* Prevent overflow in Julian-day routines */ if (!IS_VALID_JULIAN(tm.tm_year, tm.tm_mon, tm.tm_mday)) @@ -4116,7 +4225,13 @@ parse_datetime(text *date_txt, text *fmt, bool strict, Oid *typid, int fprec; uint32 flags; - do_to_timestamp(date_txt, fmt, strict, &tm, &fsec, &fprec, &flags, have_error); + /* + * At some point we might wish to have callers supply the collation to + * use, but right now it's not clear that they'd be able to do better than + * DEFAULT_COLLATION_OID anyway. + */ + do_to_timestamp(date_txt, fmt, DEFAULT_COLLATION_OID, strict, + &tm, &fsec, &fprec, &flags, have_error); CHECK_ERROR; *typmod = fprec ? fprec : -1; /* fractional part precision */ @@ -4278,21 +4393,21 @@ on_error: * Parse the 'date_txt' according to 'fmt', return results as a struct pg_tm, * fractional seconds, and fractional precision. * + * 'collid' identifies the collation to use, if needed. + * 'std' specifies standard parsing mode. + * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags', + * if that is not NULL. + * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. + * * We parse 'fmt' into a list of FormatNodes, which is then passed to * DCH_from_char to populate a TmFromChar with the parsed contents of * 'date_txt'. * * The TmFromChar is then analysed and converted into the final results in - * struct 'tm' and 'fsec'. - * - * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags'. - * - * 'std' specifies standard parsing mode. - * - * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. + * struct 'tm', 'fsec', and 'fprec'. */ static void -do_to_timestamp(text *date_txt, text *fmt, bool std, +do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error) { @@ -4352,7 +4467,7 @@ do_to_timestamp(text *date_txt, text *fmt, bool std, /* dump_index(DCH_keywords, DCH_index); */ #endif - DCH_from_char(format, date_str, &tmfc, std, have_error); + DCH_from_char(format, date_str, &tmfc, collid, std, have_error); CHECK_ERROR; pfree(fmt_str); diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 25fb7e2..64fd3ae 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -96,11 +96,17 @@ char *locale_monetary; char *locale_numeric; char *locale_time; -/* lc_time localization cache */ -char *localized_abbrev_days[7]; -char *localized_full_days[7]; -char *localized_abbrev_months[12]; -char *localized_full_months[12]; +/* + * lc_time localization cache. + * + * We use only the first 7 or 12 entries of these arrays. The last array + * element is left as NULL for the convenience of outside code that wants + * to sequentially scan these arrays. + */ +char *localized_abbrev_days[7 + 1]; +char *localized_full_days[7 + 1]; +char *localized_abbrev_months[12 + 1]; +char *localized_full_months[12 + 1]; /* indicates whether locale information cache is valid */ static bool CurrentLocaleConvValid = false; @@ -922,6 +928,8 @@ cache_locale_time(void) cache_single_string(&localized_full_days[i], bufptr, encoding); bufptr += MAX_L10N_DATA; } + localized_abbrev_days[7] = NULL; + localized_full_days[7] = NULL; /* localized months */ for (i = 0; i < 12; i++) @@ -931,6 +939,8 @@ cache_locale_time(void) cache_single_string(&localized_full_months[i], bufptr, encoding); bufptr += MAX_L10N_DATA; } + localized_abbrev_months[12] = NULL; + localized_full_months[12] = NULL; CurrentLCTimeValid = true; } diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out index 37c6add..f06ae54 100644 --- a/src/test/regress/expected/collate.linux.utf8.out +++ b/src/test/regress/expected/collate.linux.utf8.out @@ -461,6 +461,22 @@ SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR"); 01 NİS 2010 (1 row) +-- to_date +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); + to_date +------------ + 02-01-2010 +(1 row) + +SELECT to_date('01 Şub 2010', 'DD TMMON YYYY'); + to_date +------------ + 02-01-2010 +(1 row) + +SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail +ERROR: invalid value "1234567890ab" for "MONTH" +DETAIL: The given value did not match any of the allowed values for this field. -- backwards parsing CREATE VIEW collview1 AS SELECT * FROM collate_test1 WHERE b COLLATE "C" >= 'bbc'; CREATE VIEW collview2 AS SELECT a, b FROM collate_test1 ORDER BY b COLLATE "C"; diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql index 8c26f16..cbbd220 100644 --- a/src/test/regress/sql/collate.linux.utf8.sql +++ b/src/test/regress/sql/collate.linux.utf8.sql @@ -182,6 +182,12 @@ SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR"); SELECT to_char(date '2010-04-01', 'DD TMMON YYYY'); SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR"); +-- to_date + +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); +SELECT to_date('01 Şub 2010', 'DD TMMON YYYY'); +SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail + -- backwards parsing
I wrote: > * I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID > as the collation to do case-folding with. For to_date/to_timestamp, > we have collatable text input so we can rely on the function's input > collation instead, at the cost of having to pass down the collation > OID through a few layers of subroutines :-(. For parse_datetime, > I punted for now and let it use DEFAULT_COLLATION_OID, because that's > currently only called by JSONB code that probably hasn't got a usable > input collation anyway (since jsonb isn't considered collatable). On closer look, it's probably a wise idea to change the signature of parse_datetime() to include a collation argument, because that function is new in v13 so there's no API-break argument against it. It will never be cheaper to change it than today. So v11 below does that, pushing the use of DEFAULT_COLLATION_OID into the json-specific code. Perhaps somebody else would like to look at whether there's something brighter for that code to do, but I still suspect there isn't, so I didn't chase it further. > Barring objections, this seems > committable to me. Going once, going twice ... regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 28035f1..8b73e05 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); </row> <row> <entry><literal>TM</literal> prefix</entry> - <entry>translation mode (print localized day and month names based on + <entry>translation mode (use localized day and month names based on <xref linkend="guc-lc-time"/>)</entry> <entry><literal>TMMonth</literal></entry> </row> @@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); <listitem> <para> - <literal>TM</literal> does not include trailing blanks. - <function>to_timestamp</function> and <function>to_date</function> ignore - the <literal>TM</literal> modifier. + <literal>TM</literal> suppresses trailing blanks whether or + not <literal>FM</literal> is specified. + </para> + </listitem> + + <listitem> + <para> + <function>to_timestamp</function> and <function>to_date</function> + ignore letter case in the input; so for + example <literal>MON</literal>, <literal>Mon</literal>, + and <literal>mon</literal> all accept the same strings. When using + the <literal>TM</literal> modifier, case-folding is done according to + the rules of the function's input collation (see + <xref linkend="collation"/>). </para> </listitem> diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index d029468..95f7d05 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid collid); static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, - bool std, bool *have_error); + Oid collid, bool std, bool *have_error); #ifdef DEBUG_TO_FROM_CHAR static void dump_index(const KeyWord *k, const int *index); @@ -1057,11 +1057,14 @@ static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); -static int seq_search(const char *name, const char *const *array, int *len); +static int seq_search_ascii(const char *name, const char *const *array, int *len); +static int seq_search_localized(const char *name, char **array, int *len, + Oid collid); static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, Oid collid, FormatNode *node, bool *have_error); -static void do_to_timestamp(text *date_txt, text *fmt, bool std, +static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error); static char *fill_str(char *str, int c, int max); @@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er * suitable for comparisons to ASCII strings. */ static int -seq_search(const char *name, const char *const *array, int *len) +seq_search_ascii(const char *name, const char *const *array, int *len) { unsigned char firstc; const char *const *a; @@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len) } /* - * Perform a sequential search in 'array' for an entry matching the first - * character(s) of the 'src' string case-insensitively. + * Sequentially search an array of possibly non-English words for + * a case-insensitive match to the initial character(s) of "name". + * + * This has the same API as seq_search_ascii(), but we use a more general + * case-folding transformation to achieve case-insensitivity. Case folding + * is done per the rules of the collation identified by "collid". + * + * The array is treated as const, but we don't declare it that way because + * the arrays exported by pg_locale.c aren't const. + */ +static int +seq_search_localized(const char *name, char **array, int *len, Oid collid) +{ + char **a; + char *upper_name; + char *lower_name; + + *len = 0; + + /* empty string can't match anything */ + if (!*name) + return -1; + + /* + * The case-folding processing done below is fairly expensive, so before + * doing that, make a quick pass to see if there is an exact match. + */ + for (a = array; *a != NULL; a++) + { + int element_len = strlen(*a); + + if (strncmp(name, *a, element_len) == 0) + { + *len = element_len; + return a - array; + } + } + + /* + * Fold to upper case, then to lower case, so that we can match reliably + * even in languages in which case conversions are not injective. + */ + upper_name = str_toupper(unconstify(char *, name), strlen(name), collid); + lower_name = str_tolower(upper_name, strlen(upper_name), collid); + pfree(upper_name); + + for (a = array; *a != NULL; a++) + { + char *upper_element; + char *lower_element; + int element_len; + + /* Likewise upper/lower-case array element */ + upper_element = str_toupper(*a, strlen(*a), collid); + lower_element = str_tolower(upper_element, strlen(upper_element), + collid); + pfree(upper_element); + element_len = strlen(lower_element); + + /* Match? */ + if (strncmp(lower_name, lower_element, element_len) == 0) + { + *len = element_len; + pfree(lower_element); + pfree(lower_name); + return a - array; + } + pfree(lower_element); + } + + pfree(lower_name); + return -1; +} + +/* + * Perform a sequential search in 'array' (or 'localized_array', if that's + * not NULL) for an entry matching the first character(s) of the 'src' + * string case-insensitively. + * + * The 'array' is presumed to be English words (all-ASCII), but + * if 'localized_array' is supplied, that might be non-English + * so we need a more expensive case-folding transformation + * (which will follow the rules of the collation 'collid'). * * If a match is found, copy the array index of the match into the integer * pointed to by 'dest', advance 'src' to the end of the part of the string @@ -2518,11 +2602,15 @@ seq_search(const char *name, const char *const *array, int *len) */ static int from_char_seq_search(int *dest, const char **src, const char *const *array, + char **localized_array, Oid collid, FormatNode *node, bool *have_error) { int len; - *dest = seq_search(*src, array, &len); + if (localized_array == NULL) + *dest = seq_search_ascii(*src, array, &len); + else + *dest = seq_search_localized(*src, localized_array, &len, collid); if (len <= 0) { @@ -3147,19 +3235,20 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col *s = '\0'; } -/* ---------- - * Process a string as denoted by a list of FormatNodes. +/* + * Process the string 'in' as denoted by the array of FormatNodes 'node[]'. * The TmFromChar struct pointed to by 'out' is populated with the results. * + * 'collid' identifies the collation to use, if needed. + * 'std' specifies standard parsing mode. + * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. + * * Note: we currently don't have any to_interval() function, so there * is no need here for INVALID_FOR_INTERVAL checks. - * - * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. - * ---------- */ static void -DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, - bool *have_error) +DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, + Oid collid, bool std, bool *have_error) { FormatNode *n; const char *s; @@ -3170,6 +3259,9 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, /* number of extra skipped characters (more than given in format string) */ int extra_skip = 0; + /* cache localized days and months */ + cache_locale_time(); + for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++) { /* @@ -3271,6 +3363,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_a_m: case DCH_p_m: from_char_seq_search(&value, &s, ampm_strings_long, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); @@ -3282,6 +3375,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_am: case DCH_pm: from_char_seq_search(&value, &s, ampm_strings, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); @@ -3395,6 +3489,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_a_d: case DCH_b_c: from_char_seq_search(&value, &s, adbc_strings_long, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); @@ -3405,6 +3500,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_ad: case DCH_bc: from_char_seq_search(&value, &s, adbc_strings, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); @@ -3414,6 +3510,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Month: case DCH_month: from_char_seq_search(&value, &s, months_full, + S_TM(n->suffix) ? localized_full_months : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3423,6 +3521,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Mon: case DCH_mon: from_char_seq_search(&value, &s, months, + S_TM(n->suffix) ? localized_abbrev_months : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3437,6 +3537,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Day: case DCH_day: from_char_seq_search(&value, &s, days, + S_TM(n->suffix) ? localized_full_days : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3447,6 +3549,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_Dy: case DCH_dy: from_char_seq_search(&value, &s, days_short, + S_TM(n->suffix) ? localized_abbrev_days : NULL, + collid, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3565,6 +3669,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, case DCH_RM: case DCH_rm: from_char_seq_search(&value, &s, rm_months_lower, + NULL, InvalidOid, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, @@ -4031,13 +4136,15 @@ to_timestamp(PG_FUNCTION_ARGS) { text *date_txt = PG_GETARG_TEXT_PP(0); text *fmt = PG_GETARG_TEXT_PP(1); + Oid collid = PG_GET_COLLATION(); Timestamp result; int tz; struct pg_tm tm; fsec_t fsec; int fprec; - do_to_timestamp(date_txt, fmt, false, &tm, &fsec, &fprec, NULL, NULL); + do_to_timestamp(date_txt, fmt, collid, false, + &tm, &fsec, &fprec, NULL, NULL); /* Use the specified time zone, if any. */ if (tm.tm_zone) @@ -4072,11 +4179,13 @@ to_date(PG_FUNCTION_ARGS) { text *date_txt = PG_GETARG_TEXT_PP(0); text *fmt = PG_GETARG_TEXT_PP(1); + Oid collid = PG_GET_COLLATION(); DateADT result; struct pg_tm tm; fsec_t fsec; - do_to_timestamp(date_txt, fmt, false, &tm, &fsec, NULL, NULL, NULL); + do_to_timestamp(date_txt, fmt, collid, false, + &tm, &fsec, NULL, NULL, NULL); /* Prevent overflow in Julian-day routines */ if (!IS_VALID_JULIAN(tm.tm_year, tm.tm_mon, tm.tm_mday)) @@ -4098,25 +4207,31 @@ to_date(PG_FUNCTION_ARGS) } /* - * Convert the 'date_txt' input to a datetime type using argument 'fmt' as a format string. + * Convert the 'date_txt' input to a datetime type using argument 'fmt' + * as a format string. The collation 'collid' may be used for case-folding + * rules in some cases. 'strict' specifies standard parsing mode. + * * The actual data type (returned in 'typid', 'typmod') is determined by * the presence of date/time/zone components in the format string. * - * When timezone component is present, the corresponding offset is set to '*tz'. + * When timezone component is present, the corresponding offset is + * returned in '*tz'. * * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set * and zero value is returned. */ Datum -parse_datetime(text *date_txt, text *fmt, bool strict, Oid *typid, - int32 *typmod, int *tz, bool *have_error) +parse_datetime(text *date_txt, text *fmt, Oid collid, bool strict, + Oid *typid, int32 *typmod, int *tz, + bool *have_error) { struct pg_tm tm; fsec_t fsec; int fprec; uint32 flags; - do_to_timestamp(date_txt, fmt, strict, &tm, &fsec, &fprec, &flags, have_error); + do_to_timestamp(date_txt, fmt, collid, strict, + &tm, &fsec, &fprec, &flags, have_error); CHECK_ERROR; *typmod = fprec ? fprec : -1; /* fractional part precision */ @@ -4278,21 +4393,21 @@ on_error: * Parse the 'date_txt' according to 'fmt', return results as a struct pg_tm, * fractional seconds, and fractional precision. * + * 'collid' identifies the collation to use, if needed. + * 'std' specifies standard parsing mode. + * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags', + * if that is not NULL. + * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. + * * We parse 'fmt' into a list of FormatNodes, which is then passed to * DCH_from_char to populate a TmFromChar with the parsed contents of * 'date_txt'. * * The TmFromChar is then analysed and converted into the final results in - * struct 'tm' and 'fsec'. - * - * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags'. - * - * 'std' specifies standard parsing mode. - * - * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set. + * struct 'tm', 'fsec', and 'fprec'. */ static void -do_to_timestamp(text *date_txt, text *fmt, bool std, +do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, uint32 *flags, bool *have_error) { @@ -4352,7 +4467,7 @@ do_to_timestamp(text *date_txt, text *fmt, bool std, /* dump_index(DCH_keywords, DCH_index); */ #endif - DCH_from_char(format, date_str, &tmfc, std, have_error); + DCH_from_char(format, date_str, &tmfc, collid, std, have_error); CHECK_ERROR; pfree(fmt_str); diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c index b6fdd47..1e4dd89 100644 --- a/src/backend/utils/adt/jsonpath_exec.c +++ b/src/backend/utils/adt/jsonpath_exec.c @@ -1781,6 +1781,7 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp, JsonbValue jbvbuf; Datum value; text *datetime; + Oid collid; Oid typid; int32 typmod = -1; int tz = 0; @@ -1797,6 +1798,13 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp, datetime = cstring_to_text_with_len(jb->val.string.val, jb->val.string.len); + /* + * At some point we might wish to have callers supply the collation to + * use, but right now it's unclear that they'd be able to do better than + * DEFAULT_COLLATION_OID anyway. + */ + collid = DEFAULT_COLLATION_OID; + if (jsp->content.arg) { text *template; @@ -1814,7 +1822,7 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp, template = cstring_to_text_with_len(template_str, template_len); - value = parse_datetime(datetime, template, true, + value = parse_datetime(datetime, template, collid, true, &typid, &typmod, &tz, jspThrowErrors(cxt) ? NULL : &have_error); @@ -1858,7 +1866,7 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp, MemoryContextSwitchTo(oldcxt); } - value = parse_datetime(datetime, fmt_txt[i], true, + value = parse_datetime(datetime, fmt_txt[i], collid, true, &typid, &typmod, &tz, &have_error); diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 25fb7e2..64fd3ae 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -96,11 +96,17 @@ char *locale_monetary; char *locale_numeric; char *locale_time; -/* lc_time localization cache */ -char *localized_abbrev_days[7]; -char *localized_full_days[7]; -char *localized_abbrev_months[12]; -char *localized_full_months[12]; +/* + * lc_time localization cache. + * + * We use only the first 7 or 12 entries of these arrays. The last array + * element is left as NULL for the convenience of outside code that wants + * to sequentially scan these arrays. + */ +char *localized_abbrev_days[7 + 1]; +char *localized_full_days[7 + 1]; +char *localized_abbrev_months[12 + 1]; +char *localized_full_months[12 + 1]; /* indicates whether locale information cache is valid */ static bool CurrentLocaleConvValid = false; @@ -922,6 +928,8 @@ cache_locale_time(void) cache_single_string(&localized_full_days[i], bufptr, encoding); bufptr += MAX_L10N_DATA; } + localized_abbrev_days[7] = NULL; + localized_full_days[7] = NULL; /* localized months */ for (i = 0; i < 12; i++) @@ -931,6 +939,8 @@ cache_locale_time(void) cache_single_string(&localized_full_months[i], bufptr, encoding); bufptr += MAX_L10N_DATA; } + localized_abbrev_months[12] = NULL; + localized_full_months[12] = NULL; CurrentLCTimeValid = true; } diff --git a/src/include/utils/formatting.h b/src/include/utils/formatting.h index 357deb9..8f62aa3 100644 --- a/src/include/utils/formatting.h +++ b/src/include/utils/formatting.h @@ -26,7 +26,7 @@ extern char *asc_tolower(const char *buff, size_t nbytes); extern char *asc_toupper(const char *buff, size_t nbytes); extern char *asc_initcap(const char *buff, size_t nbytes); -extern Datum parse_datetime(text *date_txt, text *fmt, bool std, +extern Datum parse_datetime(text *date_txt, text *fmt, Oid collid, bool strict, Oid *typid, int32 *typmod, int *tz, bool *have_error); diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out index 37c6add..f06ae54 100644 --- a/src/test/regress/expected/collate.linux.utf8.out +++ b/src/test/regress/expected/collate.linux.utf8.out @@ -461,6 +461,22 @@ SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR"); 01 NİS 2010 (1 row) +-- to_date +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); + to_date +------------ + 02-01-2010 +(1 row) + +SELECT to_date('01 Şub 2010', 'DD TMMON YYYY'); + to_date +------------ + 02-01-2010 +(1 row) + +SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail +ERROR: invalid value "1234567890ab" for "MONTH" +DETAIL: The given value did not match any of the allowed values for this field. -- backwards parsing CREATE VIEW collview1 AS SELECT * FROM collate_test1 WHERE b COLLATE "C" >= 'bbc'; CREATE VIEW collview2 AS SELECT a, b FROM collate_test1 ORDER BY b COLLATE "C"; diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql index 8c26f16..cbbd220 100644 --- a/src/test/regress/sql/collate.linux.utf8.sql +++ b/src/test/regress/sql/collate.linux.utf8.sql @@ -182,6 +182,12 @@ SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR"); SELECT to_char(date '2010-04-01', 'DD TMMON YYYY'); SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR"); +-- to_date + +SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); +SELECT to_date('01 Şub 2010', 'DD TMMON YYYY'); +SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail + -- backwards parsing
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Sun, Mar 1, 2020 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Barring objections, this seems
> committable to me.
Going once, going twice ...
You have moved this to better place, so none from me, and thank you.
Regards,
Juan José Santamaría Flecha
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes: > On Sun, Mar 1, 2020 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Going once, going twice ... > You have moved this to better place, so none from me, and thank you. Pushed then. While looking over the patch one more time, I noticed some pretty poor English in docs and error messages for the related jsonb code, so I took it on myself to fix that while at it. regards, tom lane
On Tue, Mar 3, 2020 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> writes:
> On Sun, Mar 1, 2020 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Going once, going twice ...
> You have moved this to better place, so none from me, and thank you.
Pushed then.
On master with a clean build (and configure re-run) and a fresh init-db, I'm seeing the collate.linux.utf8 test fail with the attached diff.
Is there something I need to reconfigure, install on my machine (elementary os, so based on Ubuntu) for this to pass?
Thanks,
James
Attachment
James Coleman <jtc331@gmail.com> writes: > On master with a clean build (and configure re-run) and a fresh init-db, > I'm seeing the collate.linux.utf8 test fail with the attached diff. -- to_char SET lc_time TO 'tr_TR'; +ERROR: invalid value for parameter "lc_time": "tr_TR" SELECT to_char(date '2010-02-01', 'DD TMMON YYYY'); Looks like you may not have Turkish locale installed? Try locale -a | grep tr_TR If you don't see "tr_TR.utf8" or some variant spelling of that, the collate.linux.utf8 test is not gonna pass. The required package is probably some sub-package of glibc. A workaround if you don't want to install more stuff is to run the regression tests in C locale, so that that test script gets skipped. regards, tom lane
On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On master with a clean build (and configure re-run) and a fresh init-db,
> I'm seeing the collate.linux.utf8 test fail with the attached diff.
-- to_char
SET lc_time TO 'tr_TR';
+ERROR: invalid value for parameter "lc_time": "tr_TR"
SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
Looks like you may not have Turkish locale installed? Try
locale -a | grep tr_TR
If you don't see "tr_TR.utf8" or some variant spelling of that,
the collate.linux.utf8 test is not gonna pass. The required
package is probably some sub-package of glibc.
A workaround if you don't want to install more stuff is to run the
regression tests in C locale, so that that test script gets skipped.
regards, tom lane
Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the utf8 version is acceptable? Or is there a non-utf8 variant?
Thanks,
James
James Coleman <jtc331@gmail.com> writes: > On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Looks like you may not have Turkish locale installed? Try >> locale -a | grep tr_TR > Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the > utf8 version is acceptable? Or is there a non-utf8 variant? Hmm ... I'm far from an expert on the packaging of locale data, but the simplest explanation I can think of is that the tr_TR locale exists to some extent on your machine but the LC_TIME component of that is missing. Do you get different results from "date" depending on the locale? I get $ LANG=C date Sat Mar 7 21:44:24 EST 2020 $ LANG=tr_TR.utf8 date Cts Mar 7 21:44:26 EST 2020 on my Fedora 30 box. Another possibility perhaps is that you have partial locale settings in your environment that are bollixing the test. Try $ env | grep ^LANG $ env | grep ^LC_ If there's more than one relevant environment setting, and they don't all agree, I'm not sure what would happen with our regression tests. BTW, what platform are you using anyway? regards, tom lane
Re: Allow to_date() and to_timestamp() to accept localized names
From
Juan José Santamaría Flecha
Date:
On Sun, Mar 8, 2020 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed? Try
>> locale -a | grep tr_TR
> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?
Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.
AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'.
BTW, what platform are you using anyway?
I have just checked in a Debian Stretch
Regards,
Juan José Santamaría Flecha
On Sat, Mar 7, 2020 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed? Try
>> locale -a | grep tr_TR
> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?
Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.
Do you get different results from "date" depending on the locale?
I get
$ LANG=C date
Sat Mar 7 21:44:24 EST 2020
$ LANG=tr_TR.utf8 date
Cts Mar 7 21:44:26 EST 2020
$ LANG=C date
Sun Mar 8 09:27:50 EDT 2020
$ LANG=tr_TR.utf8 date
Paz Mar 8 09:27:54 EDT 2020
$ LANG=tr_TR date
Sun Mar 8 09:27:57 EDT 2020
Sun Mar 8 09:27:50 EDT 2020
$ LANG=tr_TR.utf8 date
Paz Mar 8 09:27:54 EDT 2020
$ LANG=tr_TR date
Sun Mar 8 09:27:57 EDT 2020
I'm curious if you get something different for that last one (no utf8 qualifier).
on my Fedora 30 box.
Another possibility perhaps is that you have partial locale settings
in your environment that are bollixing the test. Try
$ env | grep ^LANG
This gives me:
LANG=en_US.UTF-8
LANGUAGE=en_US
LANGUAGE=en_US
$ env | grep ^LC_
Nothing for this.
If there's more than one relevant environment setting, and they
don't all agree, I'm not sure what would happen with our
regression tests.
There's LANG and LANGUAGE but they look like they effectively agree to me?
BTW, what platform are you using anyway?
ElementaryOS 5.1 Hera, which is built on top of Ubuntu 18.04.3 LTS (and Linux 4.15.0-72-generic).
This suggests I have the standard Ubuntu Turkish language packages installed:
$ dpkg -l | grep language-pack-tr
ii language-pack-tr 1:18.04+20180712 all translation updates for language Turkish
ii language-pack-tr-base 1:18.04+20180712 all translations for language Turkish
ii language-pack-tr 1:18.04+20180712 all translation updates for language Turkish
ii language-pack-tr-base 1:18.04+20180712 all translations for language Turkish
James
On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote:
On Sun, Mar 8, 2020 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed? Try
>> locale -a | grep tr_TR
> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?
Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'.
The test name implies it's about utf8, though, which makes me wonder if the test should be testing utf8 instead?
That being said, a bit more googling based on your node about the proper ISO encoding turned up this page: https://unix.stackexchange.com/a/446762
And I confirmed that the locale you mentioned is available:
$ grep "tr_TR" /usr/share/i18n/SUPPORTED
tr_TR.UTF-8 UTF-8
tr_TR ISO-8859-9
tr_TR.UTF-8 UTF-8
tr_TR ISO-8859-9
So I tried:
echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root session
sudo dpkg-reconfigure locales
That didn't seem to fix it, though `locale -a` still only lists tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8 is actually relying on an ISO encoding.
James
On Sun, Mar 8, 2020 at 10:42 AM James Coleman <jtc331@gmail.com> wrote:
On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote:On Sun, Mar 8, 2020 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed? Try
>> locale -a | grep tr_TR
> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?
Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'.The test name implies it's about utf8, though, which makes me wonder if the test should be testing utf8 instead?That being said, a bit more googling based on your node about the proper ISO encoding turned up this page: https://unix.stackexchange.com/a/446762And I confirmed that the locale you mentioned is available:$ grep "tr_TR" /usr/share/i18n/SUPPORTED
tr_TR.UTF-8 UTF-8
tr_TR ISO-8859-9So I tried:echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root sessionsudo dpkg-reconfigure localesThat didn't seem to fix it, though `locale -a` still only lists tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8 is actually relying on an ISO encoding.
Another update:
Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine, I tried selecting the tr_TR.ISO-8859-9 option there and removed the /var/lib/locales/supported.d/local file. Now I get:
$ locale -a | grep tr_TR
tr_TR
tr_TR.iso88599
tr_TR.utf8
tr_TR
tr_TR.iso88599
tr_TR.utf8
And now `make check` passes.
I'm still interested in understanding why we're using the ISO locale instead of the utf8 one in a utf8-labeled test though.
Thanks,
James
James Coleman <jtc331@gmail.com> writes: > Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine, > I tried selecting the tr_TR.ISO-8859-9 option there and removed the > /var/lib/locales/supported.d/local file. Now I get: > $ locale -a | grep tr_TR > tr_TR > tr_TR.iso88599 > tr_TR.utf8 > And now `make check` passes. Hm. > I'm still interested in understanding why we're using the ISO locale > instead of the utf8 one in a utf8-labeled test though. We are not. My understanding of the rules about this is that the active LC_CTYPE setting determines the encoding that libc uses, period. The encoding suffix on the locale name only makes a difference when LC_CTYPE is being specified (or derived from LANG or LC_ALL), not any other LC_XXX setting --- although for consistency they'll let you include it in any LC_XXX value. We could of course choose to spell LC_TIME as 'tr_TR.utf8' in this test, but that would open up the question of whether that causes problems on platforms where the canonical spelling of the encoding suffix is different (eg "UTF-8"). I'm disinclined to take that risk without positive proof that it helps on some platform. My guess about what was actually happening on your machine is that the underlying locale data only exists in iso-8859-9 form, and that glibc normally translates that to utf8 on-the-fly when it's demanded ... but if the data isn't installed at all, nothing happens. On my Red Hat platforms, this installation choice seems pretty all-or-nothing, but it sounds like Debian lets you chop it up more finely (and maybe slice off your finger while at it :-(). regards, tom lane
I wrote: > James Coleman <jtc331@gmail.com> writes: >> I'm still interested in understanding why we're using the ISO locale >> instead of the utf8 one in a utf8-labeled test though. > We are not. My understanding of the rules about this is that the > active LC_CTYPE setting determines the encoding that libc uses, > period. The encoding suffix on the locale name only makes a > difference when LC_CTYPE is being specified (or derived from LANG or > LC_ALL), not any other LC_XXX setting --- although for consistency > they'll let you include it in any LC_XXX value. Oh wait --- I'm wrong about that. Looking at the code in pg_locale.c, what actually happens is that we get data in the codeset implied by the LC_TIME setting and then translate it to the database encoding (cf commit 7ad1cd31b). So if bare "tr_TR" is taken as implying iso-8859-9, which seems likely (it appears to work that way here, anyway) then this test is exercising the codeset translation path. We could change the test to say 'tr_TR.utf8' but that would give us less test coverage. regards, tom lane
On Sun, Mar 8, 2020 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> James Coleman <jtc331@gmail.com> writes:
>> I'm still interested in understanding why we're using the ISO locale
>> instead of the utf8 one in a utf8-labeled test though.
> We are not. My understanding of the rules about this is that the
> active LC_CTYPE setting determines the encoding that libc uses,
> period. The encoding suffix on the locale name only makes a
> difference when LC_CTYPE is being specified (or derived from LANG or
> LC_ALL), not any other LC_XXX setting --- although for consistency
> they'll let you include it in any LC_XXX value.
Oh wait --- I'm wrong about that. Looking at the code in pg_locale.c,
what actually happens is that we get data in the codeset implied by
the LC_TIME setting and then translate it to the database encoding
(cf commit 7ad1cd31b). So if bare "tr_TR" is taken as implying
iso-8859-9, which seems likely (it appears to work that way here,
anyway) then this test is exercising the codeset translation path.
We could change the test to say 'tr_TR.utf8' but that would give us
less test coverage.
So just to confirm I understand, that implies that the issue is solely that only the utf8 tr_TR set is installed by default on this machine, and the iso-8859-9 set is a hard requirement (that is, the test is explicitly testing a codepath that generates utf8 results from a non-utf8 source)?
If so, I'm going to try a bare Ubuntu install on a VM and see what locales are installed by default for Turkish.
If in fact Ubuntu doesn't install this locale by default, then is this a caveat we should add to developer docs somewhere? It seems odd to me that I'd be the only one encountering it, but OTOH I would have thought this a fairly vanilla install too...
James
James Coleman <jtc331@gmail.com> writes: > So just to confirm I understand, that implies that the issue is solely that > only the utf8 tr_TR set is installed by default on this machine, and the > iso-8859-9 set is a hard requirement (that is, the test is explicitly > testing a codepath that generates utf8 results from a non-utf8 source)? It's not "explicitly" testing that; the fact that "tr_TR" is treated as "tr_TR.iso88599" is surely an implementation artifact. (On macOS, some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".) But yeah, I think it's intentional that we want the codeset translation path to be exercised here on at least some platforms. > If in fact Ubuntu doesn't install this locale by default, then is this a > caveat we should add to developer docs somewhere? It seems odd to me that > I'd be the only one encountering it, but OTOH I would have thought this a > fairly vanilla install too... Not sure. The lack of prior complaints points to this not being a common situation. It does seem weird that they'd set things up so that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an outright packaging mistake, it seems like a POLA violation from here. regards, tom lane
On Sun, Mar 8, 2020 at 6:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> So just to confirm I understand, that implies that the issue is solely that
> only the utf8 tr_TR set is installed by default on this machine, and the
> iso-8859-9 set is a hard requirement (that is, the test is explicitly
> testing a codepath that generates utf8 results from a non-utf8 source)?
It's not "explicitly" testing that; the fact that "tr_TR" is treated
as "tr_TR.iso88599" is surely an implementation artifact. (On macOS,
some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".)
But yeah, I think it's intentional that we want the codeset translation
path to be exercised here on at least some platforms.
I idly wonder if there could/should be some tests for the implicit case, some explicitly testing the codeset translation (if possible), and some testing the explicit utf8 case...but I don't know enough about this area to push for anything.
> If in fact Ubuntu doesn't install this locale by default, then is this a
> caveat we should add to developer docs somewhere? It seems odd to me that
> I'd be the only one encountering it, but OTOH I would have thought this a
> fairly vanilla install too...
Not sure. The lack of prior complaints points to this not being a
common situation. It does seem weird that they'd set things up so
that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an
outright packaging mistake, it seems like a POLA violation from here.
regards, tom lane
All right. I downloaded an Ubuntu 18.04.3 server VM from OSBoxes, and it had very few locales installed by default...so that wasn't all that helpful.
I think at this point I'll just leave this as a mystery, much as I hate that.
James