Re: Allow to_date() and to_timestamp() to accept localized names - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Allow to_date() and to_timestamp() to accept localized names
Date
Msg-id 29432.1579731087@sss.pgh.pa.us
Whole thread Raw
In response to Re: Allow to_date() and to_timestamp() to accept localized names  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Allow to_date() and to_timestamp() to accept localized names  (Arthur Zakirov <zaartur@gmail.com>)
List pgsql-hackers
I wrote:
> One thing I completely don't understand is why it's sufficient for
> seq_search_localized to do "initcap" semantics.  Shouldn't it cover
> the all-upper and all-lower cases as well, as the existing seq_search
> code does?  (That is, why is it okay to ignore the "type" argument?)

I took a closer look at this and decided you were probably doing that
just because the seq_search code uses initcap-style case folding to
match month and day names, relying on the assumption that the constant
arrays it's passed are cased that way.  But we shouldn't (and the patch
doesn't) assume that the localized names we'll get from pg_locale.c are
cased that way.

However ... it seems to me that the way seq_search is defined is
plenty bogus.  In the first place, it scribbles on its source string,
which is surely not okay.  You can see that in error messages that
are thrown on match failures:

regression=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY');
ERROR:  invalid value "JanuzRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for this field.

Now, maybe somebody out there thinks this is a cute way of highlighting
how much of the string we examined, but it looks more like a bug from
where I sit.  It's mere luck that what's being clobbered is a local
string copy created by do_to_timestamp(), and not the original input
data passed to to_date().

In the second place, there's really no need at all for seq_search to
rely on any assumptions about the case state of the array it's
searching.  pg_toupper() is pretty cheap already, and we can make it
guaranteed cheap if we use pg_ascii_toupper() instead.  So I think
we ought to just remove the "type" argument altogether and have
seq_search dynamically convert all the strings it examines to upper
case (or lower case would work as well, at least for English).

> On the other hand, you probably *should* be ignoring the "max"
> argument, because AFAICS the values that are passed for that are
> specific to the English ASCII spellings of the day and month names.
> It seems very likely that they could be too small for some sets of
> non-English names.

Closer examination shows that the "max" argument is pretty bogus as
well.  It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior.  So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages.  There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

I also noted while I was looking at it that from_char_seq_search()
would truncate at "max" bytes even when dealing with multibyte input.
That has a clear risk of generating an invalidly-encoded error message.

The 0001 patch attached cleans up all the above complaints.  I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

I'm tempted to go a bit further than 0001 does, and remove the 'max'
argument from from_char_seq_search() altogether.  Since we only need
'max' in error cases, which don't need to be super-quick, we could drop
the requirement for the caller to specify that and instead compute it
when we do need it as the largest of the constant array's string
lengths.  That'd carry over into the localized-names case in a
reasonably straightforward way (though we might need to count characters
not bytes for that to work nicely).

Because of the risk of incorrectly-encoded error messages, I'm rather
tempted to claim that these patches (or at least 0001) are a bug fix
and should be back-patched.  In any case I think we should apply this
to HEAD and then rebase the localized-names patch over it.  It makes
a whole lot more sense, IMO, for the localized-names comparison logic
to match what this is doing than what seq_search does today.

Comments?

            regards, tom lane

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index ca3c48d..3ef10dc 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL};
  * Flags & Options:
  * ----------
  */
-#define ONE_UPPER        1        /* Name */
-#define ALL_UPPER        2        /* NAME */
-#define ALL_LOWER        3        /* name */
-
 #define MAX_MONTH_LEN    9
 #define MAX_MON_LEN        3
 #define MAX_DAY_LEN        9
@@ -1068,9 +1064,9 @@ static int    from_char_parse_int_len(int *dest, char **src, const int len,
                                     FormatNode *node, bool *have_error);
 static int    from_char_parse_int(int *dest, char **src, FormatNode *node,
                                 bool *have_error);
-static int    seq_search(char *name, const char *const *array, int type, int max, int *len);
+static int    seq_search(const char *name, const char *const *array, int *len);
 static int    from_char_seq_search(int *dest, char **src,
-                                 const char *const *array, int type, int max,
+                                 const char *const *array, int max,
                                  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
                             struct pg_tm *tm, fsec_t *fsec, int *fprec,
@@ -2454,70 +2450,57 @@ from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error)
 }

 /* ----------
- * Sequential search with to upper/lower conversion
+ * Sequentially search null-terminated "array" for a case-insensitive match
+ * to the initial character(s) of "name".
+ *
+ * Returns array index of match, or -1 for no match.
+ *
+ * *len is set to the length of the match, or 0 for no match.
+ *
+ * Case-insensitivity is defined per pg_ascii_toupper, so this is only
+ * suitable for comparisons to ASCII strings.
  * ----------
  */
 static int
-seq_search(char *name, const char *const *array, int type, int max, int *len)
+seq_search(const char *name, const char *const *array, int *len)
 {
-    const char *p;
+    unsigned char firstc;
     const char *const *a;
-    char       *n;
-    int            last,
-                i;

     *len = 0;

+    /* empty string can't match anything */
     if (!*name)
         return -1;

-    /* set first char */
-    if (type == ONE_UPPER || type == ALL_UPPER)
-        *name = pg_toupper((unsigned char) *name);
-    else if (type == ALL_LOWER)
-        *name = pg_tolower((unsigned char) *name);
+    /* we handle first char specially to gain some speed */
+    firstc = pg_ascii_toupper((unsigned char) *name);

-    for (last = 0, a = array; *a != NULL; a++)
+    for (a = array; *a != NULL; a++)
     {
+        int            i;
+        const char *p;
+        const char *n;
+
         /* compare first chars */
-        if (*name != **a)
+        if (pg_ascii_toupper((unsigned char) **a) != firstc)
             continue;

         for (i = 1, p = *a + 1, n = name + 1;; n++, p++, i++)
         {
-            /* search fragment (max) only */
-            if (max && i == max)
-            {
-                *len = i;
-                return a - array;
-            }
-            /* full size */
+            /* return success if we matched whole array entry */
             if (*p == '\0')
             {
                 *len = i;
                 return a - array;
             }
-            /* Not found in array 'a' */
+            /* else, must have another character in "name" ... */
             if (*n == '\0')
                 break;

-            /*
-             * Convert (but convert new chars only)
-             */
-            if (i > last)
-            {
-                if (type == ONE_UPPER || type == ALL_LOWER)
-                    *n = pg_tolower((unsigned char) *n);
-                else if (type == ALL_UPPER)
-                    *n = pg_toupper((unsigned char) *n);
-                last = i;
-            }
-
-#ifdef DEBUG_TO_FROM_CHAR
-            elog(DEBUG_elog_output, "N: %c, P: %c, A: %s (%s)",
-                 *n, *p, *a, name);
-#endif
-            if (*n != *p)
+            /* ... and it must match */
+            if (pg_ascii_toupper((unsigned char) *p) !=
+                pg_ascii_toupper((unsigned char) *n))
                 break;
         }
     }
@@ -2526,8 +2509,8 @@ seq_search(char *name, const char *const *array, int type, int max, int *len)
 }

 /*
- * Perform a sequential search in 'array' for text matching the first 'max'
- * characters of the source string.
+ * Perform a sequential search in 'array' for an entry matching the first
+ * character(s) of the 'src' string case-insensitively.
  *
  * If a match is found, copy the array index of the match into the integer
  * pointed to by 'dest', advance 'src' to the end of the part of the string
@@ -2535,19 +2518,28 @@ seq_search(char *name, const char *const *array, int type, int max, int *len)
  *
  * If the string doesn't match, throw an error if 'have_error' is NULL,
  * otherwise set '*have_error' and return -1.
+ *
+ * 'max' and 'node' are used only for error reports: 'max' is the max
+ * number of bytes of the string to include in the error report, and
+ * node->key->name identifies what we were searching for.
  */
 static int
-from_char_seq_search(int *dest, char **src, const char *const *array, int type,
+from_char_seq_search(int *dest, char **src, const char *const *array,
                      int max, FormatNode *node, bool *have_error)
 {
     int            len;

-    *dest = seq_search(*src, array, type, max, &len);
+    /* Assert that no call gives us a 'max' too large for the error buffer */
+    Assert(max <= DCH_MAX_ITEM_SIZ);
+
+    *dest = seq_search(*src, array, &len);
+
     if (len <= 0)
     {
         char        copy[DCH_MAX_ITEM_SIZ + 1];

-        Assert(max <= DCH_MAX_ITEM_SIZ);
+        /* Use multibyte-aware truncation to avoid generating a bogus string */
+        max = pg_mbcliplen(*src, strlen(*src), max);
         strlcpy(copy, *src, max + 1);

         RETURN_ERROR(ereport(ERROR,
@@ -3279,7 +3271,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_a_m:
             case DCH_p_m:
                 from_char_seq_search(&value, &s, ampm_strings_long,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3290,7 +3282,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_am:
             case DCH_pm:
                 from_char_seq_search(&value, &s, ampm_strings,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3403,7 +3395,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_a_d:
             case DCH_b_c:
                 from_char_seq_search(&value, &s, adbc_strings_long,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3413,7 +3405,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_ad:
             case DCH_bc:
                 from_char_seq_search(&value, &s, adbc_strings,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3421,7 +3413,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_MONTH:
             case DCH_Month:
             case DCH_month:
-                from_char_seq_search(&value, &s, months_full, ONE_UPPER,
+                from_char_seq_search(&value, &s, months_full,
                                      MAX_MONTH_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3430,7 +3422,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_MON:
             case DCH_Mon:
             case DCH_mon:
-                from_char_seq_search(&value, &s, months, ONE_UPPER,
+                from_char_seq_search(&value, &s, months,
                                      MAX_MON_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3444,7 +3436,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_DAY:
             case DCH_Day:
             case DCH_day:
-                from_char_seq_search(&value, &s, days, ONE_UPPER,
+                from_char_seq_search(&value, &s, days,
                                      MAX_DAY_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3454,7 +3446,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_DY:
             case DCH_Dy:
             case DCH_dy:
-                from_char_seq_search(&value, &s, days, ONE_UPPER,
+                from_char_seq_search(&value, &s, days,
                                      MAX_DY_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3572,7 +3564,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
                 break;
             case DCH_RM:
                 from_char_seq_search(&value, &s, rm_months_upper,
-                                     ALL_UPPER, MAX_RM_LEN, n, have_error);
+                                     MAX_RM_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
                                   n, have_error);
@@ -3580,7 +3572,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
                 break;
             case DCH_rm:
                 from_char_seq_search(&value, &s, rm_months_lower,
-                                     ALL_LOWER, MAX_RM_LEN, n, have_error);
+                                     MAX_RM_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
                                   n, have_error);
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 3ef10dc..ef21b7e 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1044,7 +1044,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,

 static void DCH_to_char(FormatNode *node, bool is_interval,
                         TmToChar *in, char *out, Oid collid);
-static void DCH_from_char(FormatNode *node, char *in, TmFromChar *out,
+static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
                           bool std, bool *have_error);

 #ifdef DEBUG_TO_FROM_CHAR
@@ -1055,17 +1055,17 @@ static void dump_node(FormatNode *node, int max);
 static const char *get_th(char *num, int type);
 static char *str_numth(char *dest, char *num, int type);
 static int    adjust_partial_year_to_2020(int year);
-static int    strspace_len(char *str);
+static int    strspace_len(const char *str);
 static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode,
                                bool *have_error);
 static void from_char_set_int(int *dest, const int value, const FormatNode *node,
                               bool *have_error);
-static int    from_char_parse_int_len(int *dest, char **src, const int len,
+static int    from_char_parse_int_len(int *dest, const char **src, const int len,
                                     FormatNode *node, bool *have_error);
-static int    from_char_parse_int(int *dest, char **src, FormatNode *node,
+static int    from_char_parse_int(int *dest, const char **src, FormatNode *node,
                                 bool *have_error);
 static int    seq_search(const char *name, const char *const *array, int *len);
-static int    from_char_seq_search(int *dest, char **src,
+static int    from_char_seq_search(int *dest, const char **src,
                                  const char *const *array, int max,
                                  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
@@ -2255,7 +2255,7 @@ adjust_partial_year_to_2020(int year)


 static int
-strspace_len(char *str)
+strspace_len(const char *str)
 {
     int            len = 0;

@@ -2344,12 +2344,12 @@ on_error:
  * and -1 is returned.
  */
 static int
-from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node,
+from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node,
                         bool *have_error)
 {
     long        result;
     char        copy[DCH_MAX_ITEM_SIZ + 1];
-    char       *init = *src;
+    const char *init = *src;
     int            used;

     /*
@@ -2366,8 +2366,11 @@ from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node,
          * This node is in Fill Mode, or the next node is known to be a
          * non-digit value, so we just slurp as many characters as we can get.
          */
+        char       *endptr;
+
         errno = 0;
-        result = strtol(init, src, 10);
+        result = strtol(init, &endptr, 10);
+        *src = endptr;
     }
     else
     {
@@ -2444,7 +2447,7 @@ on_error:
  * required length explicitly.
  */
 static int
-from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error)
+from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error)
 {
     return from_char_parse_int_len(dest, src, node->key->len, node, have_error);
 }
@@ -2524,7 +2527,7 @@ seq_search(const char *name, const char *const *array, int *len)
  * node->key->name identifies what we were searching for.
  */
 static int
-from_char_seq_search(int *dest, char **src, const char *const *array,
+from_char_seq_search(int *dest, const char **src, const char *const *array,
                      int max, FormatNode *node, bool *have_error)
 {
     int            len;
@@ -3158,11 +3161,11 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col
  * ----------
  */
 static void
-DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
+DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
               bool *have_error)
 {
     FormatNode *n;
-    char       *s;
+    const char *s;
     int            len,
                 value;
     bool        fx_mode = std;

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Online checksums patch - once again
Next
From: Daniel Gustafsson
Date:
Subject: Re: Ought to use heap_multi_insert() for pg_attribute/dependinsertions?