Thread: Re: [BUGS] casting strings to multidimensional arrays yields strange results

[ cc'ing pghackers in case anyone wants to object ]

Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> Right now I think the sanest behavior would be to throw an error on
>> non-rectangular input.  Once we have support for null elements in
>> arrays, however, it would arguably be reasonable to pad with NULLs
>> where needed, so that the above would be read as
>> 
>>     {{1,2},{2,3},{4,NULL}}
>> 
>>    {{1,NULL},{2,3},{4,5}}
>> 
>> respectively.  If that's the direction we want to head in, it would
>> probably be best to leave array_in alone until we can do that; users
>> tend to get unhappy when we change behavior repeatedly.

> I think that even once we support NULL array elements, they should be 
> explicitly requested -- i.e. throwing an error on non-rectangular input 
> is still the right thing to do. I haven't suggested that in the past 
> because of the backward-compatibility issue, but maybe now is the time 
> to bite the bullet.

Okay with me.  Anyone on pghackers not happy?

> If you think this qualifies as a bug fix for 7.5, I can take a look at 
> it next week.

Yeah, we can call it a bug fix.
        regards, tom lane


Re: [PATCHES] [BUGS] casting strings to multidimensional arrays yields

From
Joe Conway
Date:
Joe Conway wrote:
> I'm hesitant to apply the attached this late before the beta without 
> review, but it seems to take care of the pathological cases I came up 
> with, doesn't break anything AFAICS, and passes all regression tests. I 
> guess it can go into beta 2.

I've continued to hack on array literal parsing today, and in doing so 
came up with a question regarding behavior. Currently the following works:

regression=# select '{0 second,0 second}'::interval[];      interval
--------------------- {00:00:00,00:00:00}
(1 row)

However, since we don't require an element with embedded spaces to be 
quoted, it also means that whitespace just before the delimiter is 
significant (even though leading whitespace is not) because there is no 
way to know when the element is complete:

regression=# select '{   0 second   ,0 second}'::text[];            text
---------------------------- {"0 second   ","0 second"}
(1 row)


I view the current behavior as a bug. While making changes, I'd like to 
require elements with embedded whitespace to be quoted so that the above 
does this:

regression=# select '{0 second,0 second}'::text[];
ERROR:  malformed array literal: "{0 second,0 second}"

Additionally I'd like to make whitespace before and after quoted or 
unquoted elements insignificant. Any comments?

Thanks,

Joe


Joe Conway <mail@joeconway.com> writes:
> ... whitespace just before the delimiter is 
> significant (even though leading whitespace is not)

Yeah.  This has been the documented behavior for quite some time, but
I can't say that I ever liked it.

> I view the current behavior as a bug. While making changes, I'd like to 
> require elements with embedded whitespace to be quoted so that the above 
> does this:

> regression=# select '{0 second,0 second}'::text[];
> ERROR:  malformed array literal: "{0 second,0 second}"

> Additionally I'd like to make whitespace before and after quoted or 
> unquoted elements insignificant. Any comments?

I think that suppressing unquoted trailing whitespace is probably
reasonable.  I'm much less enthusiastic about rejecting unquoted
embedded whitespace, though.  I think that's significantly likely
to break apps, and it doesn't seem like it's really needed to have
sane behavior.  The leading/trailing whitespace asymmetry is just
weird, but it doesn't follow that embedded whitespace is weird.

If we are going to make such changes, 8.0 is the right time to do it.
        regards, tom lane


Re: [PATCHES] [BUGS] casting strings to multidimensional arrays yields

From
Joe Conway
Date:
Tom Lane wrote:
> I think that suppressing unquoted trailing whitespace is probably
> reasonable.  I'm much less enthusiastic about rejecting unquoted
> embedded whitespace, though.  I think that's significantly likely
> to break apps, and it doesn't seem like it's really needed to have
> sane behavior.  The leading/trailing whitespace asymmetry is just
> weird, but it doesn't follow that embedded whitespace is weird.
>
> If we are going to make such changes, 8.0 is the right time to do it.

The attached patch suppresses trailing whitespace, but allows embedded
whitespace in unquoted elements as discussed above. It also rejects some
previously accepted cases that were just too strange to be correct:

-- Postgres 8.0, with the patch
-- none of these should be accepted
select '{{1,{2}},{2,3}}'::text[];
ERROR:  malformed array literal: "{{1,{2}},{2,3}}"
select '{{},{}}'::text[];
ERROR:  malformed array literal: "{{},{}}"
select '{{1,2},\\{2,3}}'::text[];
ERROR:  malformed array literal: "{{1,2},\{2,3}}"
select '{{"1 2" x},{3}}'::text[];
ERROR:  malformed array literal: "{{"1 2" x},{3}}"

The third case above actually does get an ERROR in 7.4 but it looks
suspicious itself:


-- in Postgres 7.4
select '{{1,2},\\{2,3}}'::text[];
ERROR:  malformed array literal: "{{1"


More examples:

-- Postgres 8.0, with the patch
-- all of these should be accepted
select '{}'::text[];
  text
------
  {}
(1 row)

select '{0 second  ,0 second}'::interval[];
       interval
---------------------
  {00:00:00,00:00:00}
(1 row)

select '{ { "," } , { 3 } }'::text[];
     text
-------------
  {{","},{3}}
(1 row)

select '  {   {  "  0 second  "   ,  0 second  }   }'::text[];
              text
-------------------------------
  {{"  0 second  ","0 second"}}
(1 row)


If there are no objections, I'll update the docs and commit tomorrow
sometime.

Thanks,

Joe
Index: src/backend/utils/adt/arrayfuncs.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/utils/adt/arrayfuncs.c,v
retrieving revision 1.106
diff -c -r1.106 arrayfuncs.c
*** src/backend/utils/adt/arrayfuncs.c    5 Aug 2004 03:29:37 -0000    1.106
--- src/backend/utils/adt/arrayfuncs.c    7 Aug 2004 01:34:02 -0000
***************
*** 351,368 ****
   *         The syntax for array input is C-like nested curly braces
   *-----------------------------------------------------------------------------
   */
  static int
  ArrayCount(char *str, int *dim, char typdelim)
  {
!     int            nest_level = 0,
!                 i;
!     int            ndim = 1,
!                 temp[MAXDIM],
!                 nelems[MAXDIM],
!                 nelems_last[MAXDIM];
!     bool        scanning_string = false;
!     bool        eoArray = false;
!     char       *ptr;

      for (i = 0; i < MAXDIM; ++i)
      {
--- 351,382 ----
   *         The syntax for array input is C-like nested curly braces
   *-----------------------------------------------------------------------------
   */
+ typedef enum
+ {
+     ARRAY_NO_LEVEL,
+     ARRAY_LEVEL_STARTED,
+     ARRAY_ELEM_STARTED,
+     ARRAY_ELEM_COMPLETED,
+     ARRAY_QUOTED_ELEM_STARTED,
+     ARRAY_QUOTED_ELEM_COMPLETED,
+     ARRAY_ELEM_DELIMITED,
+     ARRAY_LEVEL_COMPLETED,
+     ARRAY_LEVEL_DELIMITED
+ } ArrayParseState;
+
  static int
  ArrayCount(char *str, int *dim, char typdelim)
  {
!     int                nest_level = 0,
!                     i;
!     int                ndim = 1,
!                     temp[MAXDIM],
!                     nelems[MAXDIM],
!                     nelems_last[MAXDIM];
!     bool            scanning_string = false;
!     bool            eoArray = false;
!     char           *ptr;
!     ArrayParseState    parse_state = ARRAY_NO_LEVEL;

      for (i = 0; i < MAXDIM; ++i)
      {
***************
*** 370,375 ****
--- 384,390 ----
          nelems_last[i] = nelems[i] = 1;
      }

+     /* special case for an empty array */
      if (strncmp(str, "{}", 2) == 0)
          return 0;

***************
*** 389,394 ****
--- 404,423 ----
                          errmsg("malformed array literal: \"%s\"", str)));
                      break;
                  case '\\':
+                     /*
+                      * An escape must be after a level start, after an
+                      * element start, or after an element delimiter. In any
+                      * case we now must be past an element start.
+                      */
+                     if (parse_state != ARRAY_LEVEL_STARTED &&
+                         parse_state != ARRAY_ELEM_STARTED &&
+                         parse_state != ARRAY_QUOTED_ELEM_STARTED &&
+                         parse_state != ARRAY_ELEM_DELIMITED)
+                         ereport(ERROR,
+                             (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                             errmsg("malformed array literal: \"%s\"", str)));
+                     if (parse_state != ARRAY_QUOTED_ELEM_STARTED)
+                         parse_state = ARRAY_ELEM_STARTED;
                      /* skip the escaped character */
                      if (*(ptr + 1))
                          ptr++;
***************
*** 398,408 ****
--- 427,464 ----
                          errmsg("malformed array literal: \"%s\"", str)));
                      break;
                  case '\"':
+                     /*
+                      * A quote must be after a level start, after a quoted
+                      * element start, or after an element delimiter. In any
+                      * case we now must be past an element start.
+                      */
+                     if (parse_state != ARRAY_LEVEL_STARTED &&
+                         parse_state != ARRAY_QUOTED_ELEM_STARTED &&
+                         parse_state != ARRAY_ELEM_DELIMITED)
+                         ereport(ERROR,
+                             (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                             errmsg("malformed array literal: \"%s\"", str)));
                      scanning_string = !scanning_string;
+                     if (scanning_string)
+                         parse_state = ARRAY_QUOTED_ELEM_STARTED;
+                     else
+                         parse_state = ARRAY_QUOTED_ELEM_COMPLETED;
                      break;
                  case '{':
                      if (!scanning_string)
                      {
+                         /*
+                          * A left brace can occur if no nesting has
+                          * occurred yet, after a level start, or
+                          * after a level delimiter.
+                          */
+                         if (parse_state != ARRAY_NO_LEVEL &&
+                             parse_state != ARRAY_LEVEL_STARTED &&
+                             parse_state != ARRAY_LEVEL_DELIMITED)
+                             ereport(ERROR,
+                                 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                                 errmsg("malformed array literal: \"%s\"", str)));
+                         parse_state = ARRAY_LEVEL_STARTED;
                          if (nest_level >= MAXDIM)
                              ereport(ERROR,
                                  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
***************
*** 417,422 ****
--- 473,491 ----
                  case '}':
                      if (!scanning_string)
                      {
+                         /*
+                          * A right brace can occur after an element start,
+                          * an element completion, a quoted element completion,
+                          * or a level completion.
+                          */
+                         if (parse_state != ARRAY_ELEM_STARTED &&
+                             parse_state != ARRAY_ELEM_COMPLETED &&
+                             parse_state != ARRAY_QUOTED_ELEM_COMPLETED &&
+                             parse_state != ARRAY_LEVEL_COMPLETED)
+                             ereport(ERROR,
+                                 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                                 errmsg("malformed array literal: \"%s\"", str)));
+                         parse_state = ARRAY_LEVEL_COMPLETED;
                          if (nest_level == 0)
                              ereport(ERROR,
                              (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
***************
*** 445,454 ****
                      }
                      break;
                  default:
!                     if (*ptr == typdelim && !scanning_string)
                      {
!                         itemdone = true;
!                         nelems[nest_level - 1]++;
                      }
                      break;
              }
--- 514,558 ----
                      }
                      break;
                  default:
!                     if (!scanning_string)
                      {
!                         if (*ptr == typdelim)
!                         {
!                             /*
!                             * Delimiters can occur after an element start,
!                             * an element completion, a quoted element
!                             * completion, or a level completion.
!                             */
!                             if (parse_state != ARRAY_ELEM_STARTED &&
!                                 parse_state != ARRAY_ELEM_COMPLETED &&
!                                 parse_state != ARRAY_QUOTED_ELEM_COMPLETED &&
!                                 parse_state != ARRAY_LEVEL_COMPLETED)
!                                 ereport(ERROR,
!                                     (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
!                                     errmsg("malformed array literal: \"%s\"", str)));
!                             if (parse_state == ARRAY_LEVEL_COMPLETED)
!                                 parse_state = ARRAY_LEVEL_DELIMITED;
!                             else
!                                 parse_state = ARRAY_ELEM_DELIMITED;
!                             itemdone = true;
!                             nelems[nest_level - 1]++;
!                         }
!                         else if (!isspace(*ptr))
!                         {
!                             /*
!                             * Other non-space characters must be after a level
!                             * start, after an element start, or after an element
!                             * delimiter. In any case we now must be past an
!                             * element start.
!                             */
!                             if (parse_state != ARRAY_LEVEL_STARTED &&
!                                 parse_state != ARRAY_ELEM_STARTED &&
!                                 parse_state != ARRAY_ELEM_DELIMITED)
!                                 ereport(ERROR,
!                                     (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
!                                     errmsg("malformed array literal: \"%s\"", str)));
!                             parse_state = ARRAY_ELEM_STARTED;
!                         }
                      }
                      break;
              }
***************
*** 511,522 ****
--- 615,629 ----
      while (!eoArray)
      {
          bool        itemdone = false;
+         bool        itemquoted = false;
          int            i = -1;
          char       *itemstart;
+         char       *eptr;

          /* skip leading whitespace */
          while (isspace((unsigned char) *ptr))
              ptr++;
+
          itemstart = ptr;

          while (!itemdone)
***************
*** 547,557 ****
                          char       *cptr;

                          scanning_string = !scanning_string;
!                         /* Crunch the string on top of the quote. */
!                         for (cptr = ptr; *cptr != '\0'; cptr++)
!                             *cptr = *(cptr + 1);
!                         /* Back up to not miss following character. */
!                         ptr--;
                          break;
                      }
                  case '{':
--- 654,668 ----
                          char       *cptr;

                          scanning_string = !scanning_string;
!                         if (scanning_string)
!                         {
!                             itemquoted = true;
!                             /* Crunch the string on top of the first quote. */
!                             for (cptr = ptr; *cptr != '\0'; cptr++)
!                                 *cptr = *(cptr + 1);
!                             /* Back up to not miss following character. */
!                             ptr--;
!                         }
                          break;
                      }
                  case '{':
***************
*** 615,620 ****
--- 726,750 ----
                      (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                     errmsg("malformed array literal: \"%s\"", arrayStr)));

+         /*
+          * skip trailing whitespace
+          */
+         eptr = ptr - 1;
+         if (!itemquoted)
+         {
+             /* skip to last non-NULL, non-space, character */
+             while ((*eptr == '\0') || (isspace((unsigned char) *eptr)))
+                 eptr--;
+             *(++eptr) = '\0';
+         }
+         else
+         {
+             /* skip to last quote character */
+             while (*eptr != '"')
+                 eptr--;
+             *eptr = '\0';
+         }
+
          values[i] = FunctionCall3(inputproc,
                                    CStringGetDatum(itemstart),
                                    ObjectIdGetDatum(typioparam),

Joe Conway <mail@joeconway.com> writes:
> The attached patch suppresses trailing whitespace, but allows embedded 
> whitespace in unquoted elements as discussed above. It also rejects some 
> previously accepted cases that were just too strange to be correct:

> -- Postgres 8.0, with the patch
> -- none of these should be accepted
> select '{{1,{2}},{2,3}}'::text[];
> ERROR:  malformed array literal: "{{1,{2}},{2,3}}"

Okay, uneven nesting level of {} is clearly bogus.

> select '{{},{}}'::text[];
> ERROR:  malformed array literal: "{{},{}}"

Hm.   This seems like it would be a legitimate representation of a 2x0
array.  Why would you allow '{}' and reject this?

> select '{{1,2},\\{2,3}}'::text[];
> ERROR:  malformed array literal: "{{1,2},\{2,3}}"

Okay, junk outside the {} structure is bad.

> select '{{"1 2" x},{3}}'::text[];
> ERROR:  malformed array literal: "{{"1 2" x},{3}}"

So here you're rejecting the first data value because it's partially
quoted and partially not?  I guess this is arguably reasonable, but
I'm not sure that it's really necessary either.  Historically array_in
has taken this sort of thing.

> The third case above actually does get an ERROR in 7.4 but it looks 
> suspicious itself:

> select '{{1,2},\\{2,3}}'::text[];
> ERROR:  malformed array literal: "{{1"

This is something I was planning to fix myself: ReadArrayStr is using
arrayStr as the string to complain about in its error messages, but that
is the same string that it is scribbling on by overwriting with \0 where
it wants to terminate an individual data value.  So you get bogus
messages anytime a syntax error is detected after having absorbed at
least one item value.

The easiest way to fix it seems to be for array_in to pass an additional
parameter which is the original non-overwritable input string, and use
that in the ereport calls.

> More examples:

These look okay.  Didn't study the code though.
        regards, tom lane


Re: [PATCHES] [BUGS] casting strings to multidimensional arrays yields

From
Joe Conway
Date:
Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>>select '{{},{}}'::text[];
>>ERROR:  malformed array literal: "{{},{}}"
> 
> Hm.   This seems like it would be a legitimate representation of a 2x0
> array.  Why would you allow '{}' and reject this?

Well, '{}' is a special case representing an empty array. One of the 
characteristics of an empty array is that is has 0 dimensions. An empty 
array can later be treated as an array of any dimension, e.g.:

regression=# select '{}'::int[] || ARRAY[1]; ?column?
---------- {1}
(1 row)

regression=# select '{}'::int[] || ARRAY[[1]]; ?column?
---------- {{1}}
(1 row)

In my mind, '{{},{}}' clearly defines a two dimensional array, and 
therefore needs elements, which in this case would have to be NULL (or 
empty strings -- see below). Once we can deal with NULL elements, I'd 
think '{{},{}}' ought to be accepted, and produce a 2d array with 2 NULL 
elements. Note how this works in 7.4:

select '{{},{}}'::text[];    text
------------- {{""},{""}}
(1 row)

I thought creation of empty strings was what we agreed the other day to 
eliminate?

>>select '{{"1 2" x},{3}}'::text[];
>>ERROR:  malformed array literal: "{{"1 2" x},{3}}"
> 
> So here you're rejecting the first data value because it's partially
> quoted and partially not?  I guess this is arguably reasonable, but
> I'm not sure that it's really necessary either.  Historically array_in
> has taken this sort of thing.

I know is has accepted this, but I always found it surprising and 
strange. And I believe we've had others complain about this behavior. It  seems to me that if you are going to the
troubleto quote the item, 
 
why would you want random garbage outside the quotes to get magically 
appended?

>>The third case above actually does get an ERROR in 7.4 but it looks 
>>suspicious itself:
> 
>>select '{{1,2},\\{2,3}}'::text[];
>>ERROR:  malformed array literal: "{{1"
> 
> This is something I was planning to fix myself: ReadArrayStr is using
> arrayStr as the string to complain about in its error messages, but that
> is the same string that it is scribbling on by overwriting with \0 where
> it wants to terminate an individual data value.  So you get bogus
> messages anytime a syntax error is detected after having absorbed at
> least one item value.
> 
> The easiest way to fix it seems to be for array_in to pass an additional
> parameter which is the original non-overwritable input string, and use
> that in the ereport calls.

Ah, thanks for the explanation. I'll fix that too.

Thanks,

Joe


Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> [ random griping ]

> Well, '{}' is a special case representing an empty array.

Hmm ... good point ... it is not obvious whether that is an empty array
(0x0) or a 1x1 array containing a single zero-length string.  I guess
if we want to make both these possibilities representable, we ought to
stipulate that '{}' means the first and '{""}' means the second.

> In my mind, '{{},{}}' clearly defines a two dimensional array, and 
> therefore needs elements, which in this case would have to be NULL (or 
> empty strings -- see below). Once we can deal with NULL elements, I'd 
> think '{{},{}}' ought to be accepted, and produce a 2d array with 2 NULL 
> elements. Note how this works in 7.4:

Urgh ... that makes for a third case to support.  How will you represent
0x0 vs 1x1 empty string vs 1x1 NULL?

> I thought creation of empty strings was what we agreed the other day to 
> eliminate?

I'm not sure that we had such an agreement, but in any case we've got
to understand how to distinguish empty-string from NULL.
        regards, tom lane


Re: [PATCHES] [BUGS] casting strings to multidimensional arrays yields

From
Joe Conway
Date:
Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>>select '{{},{}}'::text[];
>>ERROR:  malformed array literal: "{{},{}}"
> 
> Hm.   This seems like it would be a legitimate representation of a 2x0
> array.  Why would you allow '{}' and reject this?

I just reread your comment and thought about it more. I think most of 
what I said in the last post holds, but in addition part of the problem 
lies in the fact that we don't implement multidimensional arrays as 
nested arrays, but rather as one monolithic structure. If we did the 
former, then 2x0 would be feasible.

Investigating, and possibly implementing, multidimensional arrays as a 
nested structure is another item on my wish list. I'm still not entirely 
sure how difficult or desirable it is, but my interpretation of SQL99 is 
that nested arrays is the requirement.

Joe


Re: [PATCHES] [BUGS] casting strings to multidimensional arrays yields

From
Joe Conway
Date:
Tom Lane wrote:
> I'm not sure that we had such an agreement, but in any case we've got
> to understand how to distinguish empty-string from NULL.
> 

I realized that in my followup post regarding nested arrays. In any 
case, since will not 8.0 support either NULL elements nor nested arrays, 
can we agree that '{{},{}}' and similar is not allowed? If someone 
really wants '{{""},{""}}', they will still be able to specify it in 
that precise way.

Joe


Re: [PATCHES] [BUGS] casting strings to multidimensional arrays yields

From
James William Pye
Date:
On 08/06/04:31/5, Joe Conway wrote:
> I just reread your comment and thought about it more. I think most of
> what I said in the last post holds, but in addition part of the problem
> lies in the fact that we don't implement multidimensional arrays as
> nested arrays, but rather as one monolithic structure. If we did the
> former, then 2x0 would be feasible.
>
> Investigating, and possibly implementing, multidimensional arrays as a
> nested structure is another item on my wish list. I'm still not entirely
> sure how difficult or desirable it is, but my interpretation of SQL99 is
> that nested arrays is the requirement.

+1

This would probably make MD array support in plpy almost trivial, so it is
definitely desirable to me.

--
Regards,       James William Pye