Re: Allow to_date() and to_timestamp() to accept localized names - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Allow to_date() and to_timestamp() to accept localized names |
Date | |
Msg-id | 29432.1579731087@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Allow to_date() and to_timestamp() to accept localized names (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Allow to_date() and to_timestamp() to accept localized names
|
List | pgsql-hackers |
I wrote: > One thing I completely don't understand is why it's sufficient for > seq_search_localized to do "initcap" semantics. Shouldn't it cover > the all-upper and all-lower cases as well, as the existing seq_search > code does? (That is, why is it okay to ignore the "type" argument?) I took a closer look at this and decided you were probably doing that just because the seq_search code uses initcap-style case folding to match month and day names, relying on the assumption that the constant arrays it's passed are cased that way. But we shouldn't (and the patch doesn't) assume that the localized names we'll get from pg_locale.c are cased that way. However ... it seems to me that the way seq_search is defined is plenty bogus. In the first place, it scribbles on its source string, which is surely not okay. You can see that in error messages that are thrown on match failures: regression=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY'); ERROR: invalid value "JanuzRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Now, maybe somebody out there thinks this is a cute way of highlighting how much of the string we examined, but it looks more like a bug from where I sit. It's mere luck that what's being clobbered is a local string copy created by do_to_timestamp(), and not the original input data passed to to_date(). In the second place, there's really no need at all for seq_search to rely on any assumptions about the case state of the array it's searching. pg_toupper() is pretty cheap already, and we can make it guaranteed cheap if we use pg_ascii_toupper() instead. So I think we ought to just remove the "type" argument altogether and have seq_search dynamically convert all the strings it examines to upper case (or lower case would work as well, at least for English). > On the other hand, you probably *should* be ignoring the "max" > argument, because AFAICS the values that are passed for that are > specific to the English ASCII spellings of the day and month names. > It seems very likely that they could be too small for some sets of > non-English names. Closer examination shows that the "max" argument is pretty bogus as well. It doesn't do anything except confuse the reader, because there are no cases where the value passed is less than the maximum array entry length, so it never acts to change seq_search's behavior. So we should just drop that behavior from seq_search, too, and redefine "max" as having no purpose except to specify how much of the string to show in error messages. There's still a question of what that should be for non-English cases, but at least we now have a clear idea of what we need the value to do. I also noted while I was looking at it that from_char_seq_search() would truncate at "max" bytes even when dealing with multibyte input. That has a clear risk of generating an invalidly-encoded error message. The 0001 patch attached cleans up all the above complaints. I felt that given the business about scribbling on strings we shouldn't, it would also be wise to const-ify the string pointers if possible. That seems not too painful, per 0002 attached. I'm tempted to go a bit further than 0001 does, and remove the 'max' argument from from_char_seq_search() altogether. Since we only need 'max' in error cases, which don't need to be super-quick, we could drop the requirement for the caller to specify that and instead compute it when we do need it as the largest of the constant array's string lengths. That'd carry over into the localized-names case in a reasonably straightforward way (though we might need to count characters not bytes for that to work nicely). Because of the risk of incorrectly-encoded error messages, I'm rather tempted to claim that these patches (or at least 0001) are a bug fix and should be back-patched. In any case I think we should apply this to HEAD and then rebase the localized-names patch over it. It makes a whole lot more sense, IMO, for the localized-names comparison logic to match what this is doing than what seq_search does today. Comments? regards, tom lane diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index ca3c48d..3ef10dc 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL}; * Flags & Options: * ---------- */ -#define ONE_UPPER 1 /* Name */ -#define ALL_UPPER 2 /* NAME */ -#define ALL_LOWER 3 /* name */ - #define MAX_MONTH_LEN 9 #define MAX_MON_LEN 3 #define MAX_DAY_LEN 9 @@ -1068,9 +1064,9 @@ static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error); -static int seq_search(char *name, const char *const *array, int type, int max, int *len); +static int seq_search(const char *name, const char *const *array, int *len); static int from_char_seq_search(int *dest, char **src, - const char *const *array, int type, int max, + const char *const *array, int max, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, struct pg_tm *tm, fsec_t *fsec, int *fprec, @@ -2454,70 +2450,57 @@ from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error) } /* ---------- - * Sequential search with to upper/lower conversion + * Sequentially search null-terminated "array" for a case-insensitive match + * to the initial character(s) of "name". + * + * Returns array index of match, or -1 for no match. + * + * *len is set to the length of the match, or 0 for no match. + * + * Case-insensitivity is defined per pg_ascii_toupper, so this is only + * suitable for comparisons to ASCII strings. * ---------- */ static int -seq_search(char *name, const char *const *array, int type, int max, int *len) +seq_search(const char *name, const char *const *array, int *len) { - const char *p; + unsigned char firstc; const char *const *a; - char *n; - int last, - i; *len = 0; + /* empty string can't match anything */ if (!*name) return -1; - /* set first char */ - if (type == ONE_UPPER || type == ALL_UPPER) - *name = pg_toupper((unsigned char) *name); - else if (type == ALL_LOWER) - *name = pg_tolower((unsigned char) *name); + /* we handle first char specially to gain some speed */ + firstc = pg_ascii_toupper((unsigned char) *name); - for (last = 0, a = array; *a != NULL; a++) + for (a = array; *a != NULL; a++) { + int i; + const char *p; + const char *n; + /* compare first chars */ - if (*name != **a) + if (pg_ascii_toupper((unsigned char) **a) != firstc) continue; for (i = 1, p = *a + 1, n = name + 1;; n++, p++, i++) { - /* search fragment (max) only */ - if (max && i == max) - { - *len = i; - return a - array; - } - /* full size */ + /* return success if we matched whole array entry */ if (*p == '\0') { *len = i; return a - array; } - /* Not found in array 'a' */ + /* else, must have another character in "name" ... */ if (*n == '\0') break; - /* - * Convert (but convert new chars only) - */ - if (i > last) - { - if (type == ONE_UPPER || type == ALL_LOWER) - *n = pg_tolower((unsigned char) *n); - else if (type == ALL_UPPER) - *n = pg_toupper((unsigned char) *n); - last = i; - } - -#ifdef DEBUG_TO_FROM_CHAR - elog(DEBUG_elog_output, "N: %c, P: %c, A: %s (%s)", - *n, *p, *a, name); -#endif - if (*n != *p) + /* ... and it must match */ + if (pg_ascii_toupper((unsigned char) *p) != + pg_ascii_toupper((unsigned char) *n)) break; } } @@ -2526,8 +2509,8 @@ seq_search(char *name, const char *const *array, int type, int max, int *len) } /* - * Perform a sequential search in 'array' for text matching the first 'max' - * characters of the source string. + * Perform a sequential search in 'array' for an entry matching the first + * character(s) of the 'src' string case-insensitively. * * If a match is found, copy the array index of the match into the integer * pointed to by 'dest', advance 'src' to the end of the part of the string @@ -2535,19 +2518,28 @@ seq_search(char *name, const char *const *array, int type, int max, int *len) * * If the string doesn't match, throw an error if 'have_error' is NULL, * otherwise set '*have_error' and return -1. + * + * 'max' and 'node' are used only for error reports: 'max' is the max + * number of bytes of the string to include in the error report, and + * node->key->name identifies what we were searching for. */ static int -from_char_seq_search(int *dest, char **src, const char *const *array, int type, +from_char_seq_search(int *dest, char **src, const char *const *array, int max, FormatNode *node, bool *have_error) { int len; - *dest = seq_search(*src, array, type, max, &len); + /* Assert that no call gives us a 'max' too large for the error buffer */ + Assert(max <= DCH_MAX_ITEM_SIZ); + + *dest = seq_search(*src, array, &len); + if (len <= 0) { char copy[DCH_MAX_ITEM_SIZ + 1]; - Assert(max <= DCH_MAX_ITEM_SIZ); + /* Use multibyte-aware truncation to avoid generating a bogus string */ + max = pg_mbcliplen(*src, strlen(*src), max); strlcpy(copy, *src, max + 1); RETURN_ERROR(ereport(ERROR, @@ -3279,7 +3271,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_a_m: case DCH_p_m: from_char_seq_search(&value, &s, ampm_strings_long, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); CHECK_ERROR; @@ -3290,7 +3282,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_am: case DCH_pm: from_char_seq_search(&value, &s, ampm_strings, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->pm, value % 2, n, have_error); CHECK_ERROR; @@ -3403,7 +3395,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_a_d: case DCH_b_c: from_char_seq_search(&value, &s, adbc_strings_long, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); CHECK_ERROR; @@ -3413,7 +3405,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_ad: case DCH_bc: from_char_seq_search(&value, &s, adbc_strings, - ALL_UPPER, n->key->len, n, have_error); + n->key->len, n, have_error); CHECK_ERROR; from_char_set_int(&out->bc, value % 2, n, have_error); CHECK_ERROR; @@ -3421,7 +3413,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_MONTH: case DCH_Month: case DCH_month: - from_char_seq_search(&value, &s, months_full, ONE_UPPER, + from_char_seq_search(&value, &s, months_full, MAX_MONTH_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3430,7 +3422,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_MON: case DCH_Mon: case DCH_mon: - from_char_seq_search(&value, &s, months, ONE_UPPER, + from_char_seq_search(&value, &s, months, MAX_MON_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, value + 1, n, have_error); @@ -3444,7 +3436,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_DAY: case DCH_Day: case DCH_day: - from_char_seq_search(&value, &s, days, ONE_UPPER, + from_char_seq_search(&value, &s, days, MAX_DAY_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3454,7 +3446,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, case DCH_DY: case DCH_Dy: case DCH_dy: - from_char_seq_search(&value, &s, days, ONE_UPPER, + from_char_seq_search(&value, &s, days, MAX_DY_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->d, value, n, have_error); @@ -3572,7 +3564,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, break; case DCH_RM: from_char_seq_search(&value, &s, rm_months_upper, - ALL_UPPER, MAX_RM_LEN, n, have_error); + MAX_RM_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, n, have_error); @@ -3580,7 +3572,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, break; case DCH_rm: from_char_seq_search(&value, &s, rm_months_lower, - ALL_LOWER, MAX_RM_LEN, n, have_error); + MAX_RM_LEN, n, have_error); CHECK_ERROR; from_char_set_int(&out->mm, MONTHS_PER_YEAR - value, n, have_error); diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 3ef10dc..ef21b7e 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1044,7 +1044,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw, static void DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid collid); -static void DCH_from_char(FormatNode *node, char *in, TmFromChar *out, +static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, bool *have_error); #ifdef DEBUG_TO_FROM_CHAR @@ -1055,17 +1055,17 @@ static void dump_node(FormatNode *node, int max); static const char *get_th(char *num, int type); static char *str_numth(char *dest, char *num, int type); static int adjust_partial_year_to_2020(int year); -static int strspace_len(char *str); +static int strspace_len(const char *str); static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode, bool *have_error); static void from_char_set_int(int *dest, const int value, const FormatNode *node, bool *have_error); -static int from_char_parse_int_len(int *dest, char **src, const int len, +static int from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error); -static int from_char_parse_int(int *dest, char **src, FormatNode *node, +static int from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error); static int seq_search(const char *name, const char *const *array, int *len); -static int from_char_seq_search(int *dest, char **src, +static int from_char_seq_search(int *dest, const char **src, const char *const *array, int max, FormatNode *node, bool *have_error); static void do_to_timestamp(text *date_txt, text *fmt, bool std, @@ -2255,7 +2255,7 @@ adjust_partial_year_to_2020(int year) static int -strspace_len(char *str) +strspace_len(const char *str) { int len = 0; @@ -2344,12 +2344,12 @@ on_error: * and -1 is returned. */ static int -from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, +from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node, bool *have_error) { long result; char copy[DCH_MAX_ITEM_SIZ + 1]; - char *init = *src; + const char *init = *src; int used; /* @@ -2366,8 +2366,11 @@ from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, * This node is in Fill Mode, or the next node is known to be a * non-digit value, so we just slurp as many characters as we can get. */ + char *endptr; + errno = 0; - result = strtol(init, src, 10); + result = strtol(init, &endptr, 10); + *src = endptr; } else { @@ -2444,7 +2447,7 @@ on_error: * required length explicitly. */ static int -from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error) +from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error) { return from_char_parse_int_len(dest, src, node->key->len, node, have_error); } @@ -2524,7 +2527,7 @@ seq_search(const char *name, const char *const *array, int *len) * node->key->name identifies what we were searching for. */ static int -from_char_seq_search(int *dest, char **src, const char *const *array, +from_char_seq_search(int *dest, const char **src, const char *const *array, int max, FormatNode *node, bool *have_error) { int len; @@ -3158,11 +3161,11 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col * ---------- */ static void -DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std, +DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std, bool *have_error) { FormatNode *n; - char *s; + const char *s; int len, value; bool fx_mode = std;
pgsql-hackers by date: