Thread: Some array semantics issues

Some array semantics issues

From
Tom Lane
Date:
While hacking on the nulls-in-arrays addition, I noticed a couple of
behaviors that seem a bit bogus to me.

First, array slicing returns NULL any time the requested slice falls
completely outside the array bounds.  For instance

regression=# select ('{1,2,3}'::int[])[2:4];int4
-------{2,3}
(1 row)

regression=# select ('{1,2,3}'::int[])[3:4];int4
------{3}
(1 row)

regression=# select ('{1,2,3}'::int[])[4:4];int4
------

(1 row)

I'm inclined to think that an empty array ('{}') would be a more
sensible result.

Second, array comparison compares the contained values but pays no
attention to the array dimensions.  Thus for example

regression=# select '[0:2]={1,2,3}'::int[] = '{1,2,3}'::int[];?column?
----------t
(1 row)

regression=# select '{1,2,3,4}'::int[] = '{{1,2},{3,4}}'::int[];?column?
----------t
(1 row)

This seems pretty bogus as well.  To maintain backwards compatibility as
much as possible, I'd be inclined to sort first on the contained values
as we do now, but if they are equal sort on the array dimension data.
I'm not too concerned about exactly what the sort order is for
different-shaped arrays, I just don't think the above cases should be
considered "equal".

Comments?
        regards, tom lane


Re: Some array semantics issues

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> regression=# select '[0:2]={1,2,3}'::int[] = '{1,2,3}'::int[];
>  ?column?
> ----------
>  t
> (1 row)
> 
> regression=# select '{1,2,3,4}'::int[] = '{{1,2},{3,4}}'::int[];
>  ?column?
> ----------
>  t
> (1 row)
> 
> This seems pretty bogus as well.  

The second case seems utterly bogus. But the first case seems maybe
justifiable. maybe. 

In the past Postgres treated the array bounds as so insignificant they weren't
even worth preserving across a dump/restore. 

And changing that would make it harder to test just the contents of the array
without having to match bounds as well. That is, You couldn't say "list =
'{1,2}'" to test if the array contained 1,2. You would have to, well, I'm not
even sure how you would test it actually. Maybe something kludgy like
"'{}'::int[] || list = '{1,2}'" ?

I'm not entirely against the idea of making array bounds significant but I
guess we would need some convenient way of taking them out of the picture too.
Perhaps another equality operator.

-- 
greg



Re: Some array semantics issues

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> And changing that would make it harder to test just the contents of the array
> without having to match bounds as well.

Fair point, but currently it's impossible to make a comparison that
*does* consider the bounds, which one would think would be the ordinary
meaning of equality.

> I'm not entirely against the idea of making array bounds significant
> but I guess we would need some convenient way of taking them out of
> the picture too.  Perhaps another equality operator.

I could go for a separate operator that has the current behavior
(might as well ignore number of dimensions too, if we're going to
ignore bounds).  Any thoughts about the operator name?
        regards, tom lane


Re: Some array semantics issues

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I could go for a separate operator that has the current behavior
> (might as well ignore number of dimensions too, if we're going to
> ignore bounds).  Any thoughts about the operator name?

Well to me these are two different cases. At least the way I see it {1,2} is a
list of two numbers, and {{1,2,},{3,4}} is a list of two lists. They aren't
the same and they don't even contain the same thing.

It occurs to me that it would also make sense to have an operator that
considered the arrays in an order-insensitive comparison. 

It all depends on what you're using the arrays to represent.

If you're implementing something where each slot of the array corresponds to
some specific meaning then you need the array bounds included.

If you're representing stacks where the array bounds march up as they're used
then you don't really want to include the array bounds in your comparison.

If you're representing a denormalized one-to-many relationship (being aware of
all the associated pros and cons of denormalization of course) then you really
don't care about the order at all.

Personally I can't really think of any cases where the shape of the array
doesn't matter though.

-- 
greg



Re: Some array semantics issues

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I could go for a separate operator that has the current behavior
>> (might as well ignore number of dimensions too, if we're going to
>> ignore bounds).  Any thoughts about the operator name?

> Well to me these are two different cases. At least the way I see it {1,2} is a
> list of two numbers, and {{1,2,},{3,4}} is a list of two lists. They aren't
> the same and they don't even contain the same thing.

Well, in that case what do you think about{{1,2},{3,4},{5,6},{7,8}}
vs{{1,2,3,4},{5,6,7,8}}
These have the same physical contents and the same number of dimensions,
so unless you want to consider them equal, you have to consider the
dimension values.

I think what we may be dancing around here is that there are some cases
where it makes sense to ignore the lower bounds, as opposed to the axis
lengths.  I'm not convinced that there are any cases where it makes
sense to compare the number of dimensions without comparing the axis
lengths --- but I can see the argument that lower bounds might be
uninteresting, particularly seeing that array_push and and array_cat
do some not-necessarily-always-right things with them.
        regards, tom lane


Re: Some array semantics issues

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Well, in that case what do you think about
>     {{1,2},{3,4},{5,6},{7,8}}
> vs
>     {{1,2,3,4},{5,6,7,8}}

In the first case the first element is {1,2} and in the second case the first
element is {1,2,3,4} so from my point of view there's no way these are the
same.

None of the three use cases I conjured would want them considered equal, which
isn't to say there isn't some data structure somewhere out there which would,
but I haven't thought of it.

-- 
greg



Re: Some array semantics issues

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Well, in that case what do you think about
>> {{1,2},{3,4},{5,6},{7,8}}
>> vs
>> {{1,2,3,4},{5,6,7,8}}

> In the first case the first element is {1,2} and in the second case the first
> element is {1,2,3,4} so from my point of view there's no way these are the
> same.

Well, then I think we're converging on agreement that array comparison
should always take into account the number of dimensions and the axis
lengths.  What seems still in question is whether to compare or ignore
the axis lower bounds.

I'd argue that ordinary equality should include the lower bounds, but
I'm willing to provide a separate operator (or whole btree opclass
if people want it) that ignores the lower bounds.  We just need a name.
Maybe ~=, ~<, etc?
        regards, tom lane


Re: Some array semantics issues

From
David Fetter
Date:
On Wed, Nov 16, 2005 at 03:03:53PM -0500, Greg Stark wrote:
> 
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > I could go for a separate operator that has the current behavior
> > (might as well ignore number of dimensions too, if we're going to
> > ignore bounds).  Any thoughts about the operator name?
> 
> Well to me these are two different cases. At least the way I see it
> {1,2} is a list of two numbers, and {{1,2,},{3,4}} is a list of two
> lists. They aren't the same and they don't even contain the same
> thing.

Right.

> It occurs to me that it would also make sense to have an operator
> that considered the arrays in an order-insensitive comparison. 

That sounds more like the SQL:2003 MULTISET, which is essentially
unordered.  Any plans for these?

Cheers,
D
-- 
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!


Re: Some array semantics issues

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Wed, Nov 16, 2005 at 03:03:53PM -0500, Greg Stark wrote:
>> It occurs to me that it would also make sense to have an operator
>> that considered the arrays in an order-insensitive comparison. 

> That sounds more like the SQL:2003 MULTISET, which is essentially
> unordered.  Any plans for these?

Seems to me it would be really expensive to try to make such a
comparison directly with the present array representation.  The
only sensible way to do it would be to sort the elements of the
two arrays (using the comparison operator of the element data type)
and then compare the results.  So you don't actually need a variant
equality operator, you just need a generic array_sort() function,
and then go "array_sort(x) = array_sort(y)".  Such a function might
have other uses besides this, too.
        regards, tom lane


Re: Some array semantics issues

From
Joe Conway
Date:
Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
>>Tom Lane <tgl@sss.pgh.pa.us> writes:
>>
>>>Well, in that case what do you think about
>>>{{1,2},{3,4},{5,6},{7,8}}
>>>vs
>>>{{1,2,3,4},{5,6,7,8}}
> 
>>In the first case the first element is {1,2} and in the second case the first
>>element is {1,2,3,4} so from my point of view there's no way these are the
>>same.
> 
> Well, then I think we're converging on agreement that array comparison
> should always take into account the number of dimensions and the axis
> lengths.  What seems still in question is whether to compare or ignore
> the axis lower bounds.
> 
> I'd argue that ordinary equality should include the lower bounds, but
> I'm willing to provide a separate operator (or whole btree opclass
> if people want it) that ignores the lower bounds.  We just need a name.
> Maybe ~=, ~<, etc?

A couple of thoughts based on the last time I read SQL2003 WRT arrays.

First, the spec only allows arrays to have a lower bound of 1. That 
requirement simplifies a whole lot of things. I don't think that many 
people actually depend on other than 1 as a lower bound (or at least if 
they do, they weren't dumping and reloading those databases prior to 
8.0) -- maybe given other possibly non-backward compatible changes for 
NULLs, we should also change this?

Second, the spec does not really directly allow for multidimensional 
arrays. What it does allow is nesting of arrays. So as Greg states, 
{1,2} is clearly a different array than {1,2,3,4}. I had been thinking 
that when (if?) the array literal parser and related infrastructure is 
rewritten, it should be done so that arrays-as-array-elements are 
processed similar to any scalar element (and perhaps tuples as array 
elements as well). My hope was that eventually anyarray I/O functions 
could eliminate the need to create an array type for every data type you 
wanted to use as an array element.

Joe






Re: Some array semantics issues

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> First, the spec only allows arrays to have a lower bound of 1. That 
> requirement simplifies a whole lot of things. I don't think that many 
> people actually depend on other than 1 as a lower bound (or at least if 
> they do, they weren't dumping and reloading those databases prior to 
> 8.0) -- maybe given other possibly non-backward compatible changes for 
> NULLs, we should also change this?

I don't have a lot of use for arguments that go "we should remove any
functionality that's not in the spec" ... ISTM that variable lower
bounds are clearly useful for some applications, and even if they had
bugs in earlier releases that's not an argument for removing them.

> ... My hope was that eventually anyarray I/O functions 
> could eliminate the need to create an array type for every data type you 
> wanted to use as an array element.

Interesting thought, but then how do you declare the type of an array
column, or the type of a function argument that's not supposed to range
over every array type?  If we can't use an OID to identify a data type
completely, we're going to have lots of problems.
        regards, tom lane


Re: Some array semantics issues

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Joe Conway <mail@joeconway.com> writes:
> > First, the spec only allows arrays to have a lower bound of 1. That 
> > requirement simplifies a whole lot of things. I don't think that many 
> > people actually depend on other than 1 as a lower bound (or at least if 
> > they do, they weren't dumping and reloading those databases prior to 
> > 8.0) -- maybe given other possibly non-backward compatible changes for 
> > NULLs, we should also change this?
> 
> I don't have a lot of use for arguments that go "we should remove any
> functionality that's not in the spec" ... ISTM that variable lower
> bounds are clearly useful for some applications, and even if they had
> bugs in earlier releases that's not an argument for removing them.

Normally I don't either. But it's not just functionality that's not in the
spec. It's functionality that creates behaviour the spec specifies otherwise.

That is, if you have an array [1,2] the spec says you can get 1 by referring
to arr[1]. On Postgres you have to take more care. There could easily be code
out there that won't work on Postgres because of this difference.

The main reason for having non-zero lower bounds in the first place was
originally that NULLs weren't allowed in arrays. Otherwise you run into
problems when you try to set arr[5] = 1 when there isn't an arr[1]..arr[4].
But if we have NULLs in arrays then we could easily have all arrays have lower
bounds of 1. We don't even have to store the leading NULL elements.

I think having all arrays start at 1 would actually much simplify the
semantics and avoid a lot of strange corner cases and surprising behaviour
that will follow from having non-1 based arrays.

-- 
greg



Re: Some array semantics issues

From
Joe Conway
Date:
Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>Joe Conway <mail@joeconway.com> writes:
>>>First, the spec only allows arrays to have a lower bound of 1. That 
>>>requirement simplifies a whole lot of things. I don't think that many 
>>>people actually depend on other than 1 as a lower bound (or at least if 
>>>they do, they weren't dumping and reloading those databases prior to 
>>>8.0) -- maybe given other possibly non-backward compatible changes for 
>>>NULLs, we should also change this?
>>
>>I don't have a lot of use for arguments that go "we should remove any
>>functionality that's not in the spec" ... ISTM that variable lower
>>bounds are clearly useful for some applications, and even if they had
>>bugs in earlier releases that's not an argument for removing them.
> 
> Normally I don't either. But it's not just functionality that's not in the
> spec. It's functionality that creates behaviour the spec specifies otherwise.

This is an important point. SQL2003 doesn't leave this as undefined:

4.10.2 Arrays
An array is a collection A in which each element is associated with 
exactly one ordinal position in A. If n is the cardinality of A, then 
the ordinal position p of an element is an integer in the range 1 (one) 
<= p <= n. If EDT is the element type of A, then A can thus be 
considered as a function of the integers in the range 1 (one) to n into EDT.

Joe


Re: Some array semantics issues

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I don't have a lot of use for arguments that go "we should remove any
>> functionality that's not in the spec" ... ISTM that variable lower
>> bounds are clearly useful for some applications, and even if they had
>> bugs in earlier releases that's not an argument for removing them.

> Normally I don't either. But it's not just functionality that's not in the
> spec. It's functionality that creates behaviour the spec specifies otherwise.

AFAICS the only cases that give rise to arrays with lower bounds other
than one are:* direct entry of a literal with explicit lower bound;* assignment to a subscript or slice below 1;*
array_prepend(and the N/N+1-dimension case of array_cat).
 

I don't think "it's not in the spec" is a reason for rejecting #1 or #2.
But I agree that there is a reasonable case for modifying array_prepend
and array_cat so that they won't generate non-spec lower bounds where
none existed before.

How about changing them so that the lower bound of the right-hand array
is preserved, rather than decreased by one?
        regards, tom lane


Re: Some array semantics issues

From
Joe Conway
Date:
Tom Lane wrote:
> AFAICS the only cases that give rise to arrays with lower bounds other
> than one are:
>     * direct entry of a literal with explicit lower bound;
>     * assignment to a subscript or slice below 1;
>     * array_prepend (and the N/N+1-dimension case of array_cat).
> 
> I don't think "it's not in the spec" is a reason for rejecting #1 or #2.
> But I agree that there is a reasonable case for modifying array_prepend
> and array_cat so that they won't generate non-spec lower bounds where
> none existed before.
> 
> How about changing them so that the lower bound of the right-hand array
> is preserved, rather than decreased by one?
> 

That seems reasonable. I'll do it if you'd like...

Joe




Re: Some array semantics issues

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> How about changing them so that the lower bound of the right-hand array
>> is preserved, rather than decreased by one?

> That seems reasonable. I'll do it if you'd like...

I'll take care of it --- I'm still hacking in that general area anyway.
        regards, tom lane


Re: Some array semantics issues

From
Joe Conway
Date:
Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>>... My hope was that eventually anyarray I/O functions 
>>could eliminate the need to create an array type for every data type you 
>>wanted to use as an array element.
> 
> Interesting thought, but then how do you declare the type of an array
> column, or the type of a function argument that's not supposed to range
> over every array type?  If we can't use an OID to identify a data type
> completely, we're going to have lots of problems.
> 

You only really need two pieces of information to uniquely identify an 
array data type -- the OID of the (leaf-node) scalar elements, and the 
fact that what you have is an array.  Even if it is a nested structure 
of arrays, by recursing (max 5 times), you can eventually find the 
scalar elements. Last year I played around with this and had it 
partially working, but then got too busy to pursue it further.

Joe




Re: Some array semantics issues

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> If we can't use an OID to identify a data type
>> completely, we're going to have lots of problems.

> You only really need two pieces of information to uniquely identify an 
> array data type -- the OID of the (leaf-node) scalar elements, and the 
> fact that what you have is an array.

Yes, but replacing one piece of data (type OID) with two (type OID plus
array flag bit) is going to be notationally horrible.  What's more it
will force knowledge about the existence of array types into a lot of
places that don't have to care right now.  For something that is
ultimately just a flange on the side of SQL, that doesn't seem like a
good tradeoff.
        regards, tom lane


Re: Some array semantics issues

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> [ this is back up-thread a bit ]
> And changing that would make it harder to test just the contents of the array
> without having to match bounds as well. That is, You couldn't say "list =
> '{1,2}'" to test if the array contained 1,2. You would have to, well, I'm not
> even sure how you would test it actually. Maybe something kludgy like
> "'{}'::int[] || list = '{1,2}'" ?

Given the just-committed changes to avoid having array_push/array_cat
generate non-spec lower bounds unnecessarily, do you still think it's
important to have a variant of array comparison that ignores lower
bounds?

ISTM that ignoring lower bounds is definitely something that violates
the principle of least surprise.  There was an ease-of-use argument
for it before, but now that we changed the other thing I think we don't
need such a kluge.
        regards, tom lane


Re: Some array semantics issues

From
Joe Conway
Date:
Tom Lane wrote:
> Given the just-committed changes to avoid having array_push/array_cat
> generate non-spec lower bounds unnecessarily, do you still think it's
> important to have a variant of array comparison that ignores lower
> bounds?
> 
> ISTM that ignoring lower bounds is definitely something that violates
> the principle of least surprise.  There was an ease-of-use argument
> for it before, but now that we changed the other thing I think we don't
> need such a kluge.

I agree. At this point, having an array with other than 1 as a lower 
bound takes a very conscious decision. I'd think that if you cared that 
much about the lower bound, you'd not want to ignore it when it comes to 
comparison.

Joe