Thread: Some array semantics issues
While hacking on the nulls-in-arrays addition, I noticed a couple of behaviors that seem a bit bogus to me. First, array slicing returns NULL any time the requested slice falls completely outside the array bounds. For instance regression=# select ('{1,2,3}'::int[])[2:4];int4 -------{2,3} (1 row) regression=# select ('{1,2,3}'::int[])[3:4];int4 ------{3} (1 row) regression=# select ('{1,2,3}'::int[])[4:4];int4 ------ (1 row) I'm inclined to think that an empty array ('{}') would be a more sensible result. Second, array comparison compares the contained values but pays no attention to the array dimensions. Thus for example regression=# select '[0:2]={1,2,3}'::int[] = '{1,2,3}'::int[];?column? ----------t (1 row) regression=# select '{1,2,3,4}'::int[] = '{{1,2},{3,4}}'::int[];?column? ----------t (1 row) This seems pretty bogus as well. To maintain backwards compatibility as much as possible, I'd be inclined to sort first on the contained values as we do now, but if they are equal sort on the array dimension data. I'm not too concerned about exactly what the sort order is for different-shaped arrays, I just don't think the above cases should be considered "equal". Comments? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > regression=# select '[0:2]={1,2,3}'::int[] = '{1,2,3}'::int[]; > ?column? > ---------- > t > (1 row) > > regression=# select '{1,2,3,4}'::int[] = '{{1,2},{3,4}}'::int[]; > ?column? > ---------- > t > (1 row) > > This seems pretty bogus as well. The second case seems utterly bogus. But the first case seems maybe justifiable. maybe. In the past Postgres treated the array bounds as so insignificant they weren't even worth preserving across a dump/restore. And changing that would make it harder to test just the contents of the array without having to match bounds as well. That is, You couldn't say "list = '{1,2}'" to test if the array contained 1,2. You would have to, well, I'm not even sure how you would test it actually. Maybe something kludgy like "'{}'::int[] || list = '{1,2}'" ? I'm not entirely against the idea of making array bounds significant but I guess we would need some convenient way of taking them out of the picture too. Perhaps another equality operator. -- greg
Greg Stark <gsstark@mit.edu> writes: > And changing that would make it harder to test just the contents of the array > without having to match bounds as well. Fair point, but currently it's impossible to make a comparison that *does* consider the bounds, which one would think would be the ordinary meaning of equality. > I'm not entirely against the idea of making array bounds significant > but I guess we would need some convenient way of taking them out of > the picture too. Perhaps another equality operator. I could go for a separate operator that has the current behavior (might as well ignore number of dimensions too, if we're going to ignore bounds). Any thoughts about the operator name? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I could go for a separate operator that has the current behavior > (might as well ignore number of dimensions too, if we're going to > ignore bounds). Any thoughts about the operator name? Well to me these are two different cases. At least the way I see it {1,2} is a list of two numbers, and {{1,2,},{3,4}} is a list of two lists. They aren't the same and they don't even contain the same thing. It occurs to me that it would also make sense to have an operator that considered the arrays in an order-insensitive comparison. It all depends on what you're using the arrays to represent. If you're implementing something where each slot of the array corresponds to some specific meaning then you need the array bounds included. If you're representing stacks where the array bounds march up as they're used then you don't really want to include the array bounds in your comparison. If you're representing a denormalized one-to-many relationship (being aware of all the associated pros and cons of denormalization of course) then you really don't care about the order at all. Personally I can't really think of any cases where the shape of the array doesn't matter though. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I could go for a separate operator that has the current behavior >> (might as well ignore number of dimensions too, if we're going to >> ignore bounds). Any thoughts about the operator name? > Well to me these are two different cases. At least the way I see it {1,2} is a > list of two numbers, and {{1,2,},{3,4}} is a list of two lists. They aren't > the same and they don't even contain the same thing. Well, in that case what do you think about{{1,2},{3,4},{5,6},{7,8}} vs{{1,2,3,4},{5,6,7,8}} These have the same physical contents and the same number of dimensions, so unless you want to consider them equal, you have to consider the dimension values. I think what we may be dancing around here is that there are some cases where it makes sense to ignore the lower bounds, as opposed to the axis lengths. I'm not convinced that there are any cases where it makes sense to compare the number of dimensions without comparing the axis lengths --- but I can see the argument that lower bounds might be uninteresting, particularly seeing that array_push and and array_cat do some not-necessarily-always-right things with them. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Well, in that case what do you think about > {{1,2},{3,4},{5,6},{7,8}} > vs > {{1,2,3,4},{5,6,7,8}} In the first case the first element is {1,2} and in the second case the first element is {1,2,3,4} so from my point of view there's no way these are the same. None of the three use cases I conjured would want them considered equal, which isn't to say there isn't some data structure somewhere out there which would, but I haven't thought of it. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Well, in that case what do you think about >> {{1,2},{3,4},{5,6},{7,8}} >> vs >> {{1,2,3,4},{5,6,7,8}} > In the first case the first element is {1,2} and in the second case the first > element is {1,2,3,4} so from my point of view there's no way these are the > same. Well, then I think we're converging on agreement that array comparison should always take into account the number of dimensions and the axis lengths. What seems still in question is whether to compare or ignore the axis lower bounds. I'd argue that ordinary equality should include the lower bounds, but I'm willing to provide a separate operator (or whole btree opclass if people want it) that ignores the lower bounds. We just need a name. Maybe ~=, ~<, etc? regards, tom lane
On Wed, Nov 16, 2005 at 03:03:53PM -0500, Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > I could go for a separate operator that has the current behavior > > (might as well ignore number of dimensions too, if we're going to > > ignore bounds). Any thoughts about the operator name? > > Well to me these are two different cases. At least the way I see it > {1,2} is a list of two numbers, and {{1,2,},{3,4}} is a list of two > lists. They aren't the same and they don't even contain the same > thing. Right. > It occurs to me that it would also make sense to have an operator > that considered the arrays in an order-insensitive comparison. That sounds more like the SQL:2003 MULTISET, which is essentially unordered. Any plans for these? Cheers, D -- David Fetter david@fetter.org http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote!
David Fetter <david@fetter.org> writes: > On Wed, Nov 16, 2005 at 03:03:53PM -0500, Greg Stark wrote: >> It occurs to me that it would also make sense to have an operator >> that considered the arrays in an order-insensitive comparison. > That sounds more like the SQL:2003 MULTISET, which is essentially > unordered. Any plans for these? Seems to me it would be really expensive to try to make such a comparison directly with the present array representation. The only sensible way to do it would be to sort the elements of the two arrays (using the comparison operator of the element data type) and then compare the results. So you don't actually need a variant equality operator, you just need a generic array_sort() function, and then go "array_sort(x) = array_sort(y)". Such a function might have other uses besides this, too. regards, tom lane
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: >>Tom Lane <tgl@sss.pgh.pa.us> writes: >> >>>Well, in that case what do you think about >>>{{1,2},{3,4},{5,6},{7,8}} >>>vs >>>{{1,2,3,4},{5,6,7,8}} > >>In the first case the first element is {1,2} and in the second case the first >>element is {1,2,3,4} so from my point of view there's no way these are the >>same. > > Well, then I think we're converging on agreement that array comparison > should always take into account the number of dimensions and the axis > lengths. What seems still in question is whether to compare or ignore > the axis lower bounds. > > I'd argue that ordinary equality should include the lower bounds, but > I'm willing to provide a separate operator (or whole btree opclass > if people want it) that ignores the lower bounds. We just need a name. > Maybe ~=, ~<, etc? A couple of thoughts based on the last time I read SQL2003 WRT arrays. First, the spec only allows arrays to have a lower bound of 1. That requirement simplifies a whole lot of things. I don't think that many people actually depend on other than 1 as a lower bound (or at least if they do, they weren't dumping and reloading those databases prior to 8.0) -- maybe given other possibly non-backward compatible changes for NULLs, we should also change this? Second, the spec does not really directly allow for multidimensional arrays. What it does allow is nesting of arrays. So as Greg states, {1,2} is clearly a different array than {1,2,3,4}. I had been thinking that when (if?) the array literal parser and related infrastructure is rewritten, it should be done so that arrays-as-array-elements are processed similar to any scalar element (and perhaps tuples as array elements as well). My hope was that eventually anyarray I/O functions could eliminate the need to create an array type for every data type you wanted to use as an array element. Joe
Joe Conway <mail@joeconway.com> writes: > First, the spec only allows arrays to have a lower bound of 1. That > requirement simplifies a whole lot of things. I don't think that many > people actually depend on other than 1 as a lower bound (or at least if > they do, they weren't dumping and reloading those databases prior to > 8.0) -- maybe given other possibly non-backward compatible changes for > NULLs, we should also change this? I don't have a lot of use for arguments that go "we should remove any functionality that's not in the spec" ... ISTM that variable lower bounds are clearly useful for some applications, and even if they had bugs in earlier releases that's not an argument for removing them. > ... My hope was that eventually anyarray I/O functions > could eliminate the need to create an array type for every data type you > wanted to use as an array element. Interesting thought, but then how do you declare the type of an array column, or the type of a function argument that's not supposed to range over every array type? If we can't use an OID to identify a data type completely, we're going to have lots of problems. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Joe Conway <mail@joeconway.com> writes: > > First, the spec only allows arrays to have a lower bound of 1. That > > requirement simplifies a whole lot of things. I don't think that many > > people actually depend on other than 1 as a lower bound (or at least if > > they do, they weren't dumping and reloading those databases prior to > > 8.0) -- maybe given other possibly non-backward compatible changes for > > NULLs, we should also change this? > > I don't have a lot of use for arguments that go "we should remove any > functionality that's not in the spec" ... ISTM that variable lower > bounds are clearly useful for some applications, and even if they had > bugs in earlier releases that's not an argument for removing them. Normally I don't either. But it's not just functionality that's not in the spec. It's functionality that creates behaviour the spec specifies otherwise. That is, if you have an array [1,2] the spec says you can get 1 by referring to arr[1]. On Postgres you have to take more care. There could easily be code out there that won't work on Postgres because of this difference. The main reason for having non-zero lower bounds in the first place was originally that NULLs weren't allowed in arrays. Otherwise you run into problems when you try to set arr[5] = 1 when there isn't an arr[1]..arr[4]. But if we have NULLs in arrays then we could easily have all arrays have lower bounds of 1. We don't even have to store the leading NULL elements. I think having all arrays start at 1 would actually much simplify the semantics and avoid a lot of strange corner cases and surprising behaviour that will follow from having non-1 based arrays. -- greg
Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: >>Joe Conway <mail@joeconway.com> writes: >>>First, the spec only allows arrays to have a lower bound of 1. That >>>requirement simplifies a whole lot of things. I don't think that many >>>people actually depend on other than 1 as a lower bound (or at least if >>>they do, they weren't dumping and reloading those databases prior to >>>8.0) -- maybe given other possibly non-backward compatible changes for >>>NULLs, we should also change this? >> >>I don't have a lot of use for arguments that go "we should remove any >>functionality that's not in the spec" ... ISTM that variable lower >>bounds are clearly useful for some applications, and even if they had >>bugs in earlier releases that's not an argument for removing them. > > Normally I don't either. But it's not just functionality that's not in the > spec. It's functionality that creates behaviour the spec specifies otherwise. This is an important point. SQL2003 doesn't leave this as undefined: 4.10.2 Arrays An array is a collection A in which each element is associated with exactly one ordinal position in A. If n is the cardinality of A, then the ordinal position p of an element is an integer in the range 1 (one) <= p <= n. If EDT is the element type of A, then A can thus be considered as a function of the integers in the range 1 (one) to n into EDT. Joe
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I don't have a lot of use for arguments that go "we should remove any >> functionality that's not in the spec" ... ISTM that variable lower >> bounds are clearly useful for some applications, and even if they had >> bugs in earlier releases that's not an argument for removing them. > Normally I don't either. But it's not just functionality that's not in the > spec. It's functionality that creates behaviour the spec specifies otherwise. AFAICS the only cases that give rise to arrays with lower bounds other than one are:* direct entry of a literal with explicit lower bound;* assignment to a subscript or slice below 1;* array_prepend(and the N/N+1-dimension case of array_cat). I don't think "it's not in the spec" is a reason for rejecting #1 or #2. But I agree that there is a reasonable case for modifying array_prepend and array_cat so that they won't generate non-spec lower bounds where none existed before. How about changing them so that the lower bound of the right-hand array is preserved, rather than decreased by one? regards, tom lane
Tom Lane wrote: > AFAICS the only cases that give rise to arrays with lower bounds other > than one are: > * direct entry of a literal with explicit lower bound; > * assignment to a subscript or slice below 1; > * array_prepend (and the N/N+1-dimension case of array_cat). > > I don't think "it's not in the spec" is a reason for rejecting #1 or #2. > But I agree that there is a reasonable case for modifying array_prepend > and array_cat so that they won't generate non-spec lower bounds where > none existed before. > > How about changing them so that the lower bound of the right-hand array > is preserved, rather than decreased by one? > That seems reasonable. I'll do it if you'd like... Joe
Joe Conway <mail@joeconway.com> writes: > Tom Lane wrote: >> How about changing them so that the lower bound of the right-hand array >> is preserved, rather than decreased by one? > That seems reasonable. I'll do it if you'd like... I'll take care of it --- I'm still hacking in that general area anyway. regards, tom lane
Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >>... My hope was that eventually anyarray I/O functions >>could eliminate the need to create an array type for every data type you >>wanted to use as an array element. > > Interesting thought, but then how do you declare the type of an array > column, or the type of a function argument that's not supposed to range > over every array type? If we can't use an OID to identify a data type > completely, we're going to have lots of problems. > You only really need two pieces of information to uniquely identify an array data type -- the OID of the (leaf-node) scalar elements, and the fact that what you have is an array. Even if it is a nested structure of arrays, by recursing (max 5 times), you can eventually find the scalar elements. Last year I played around with this and had it partially working, but then got too busy to pursue it further. Joe
Joe Conway <mail@joeconway.com> writes: > Tom Lane wrote: >> If we can't use an OID to identify a data type >> completely, we're going to have lots of problems. > You only really need two pieces of information to uniquely identify an > array data type -- the OID of the (leaf-node) scalar elements, and the > fact that what you have is an array. Yes, but replacing one piece of data (type OID) with two (type OID plus array flag bit) is going to be notationally horrible. What's more it will force knowledge about the existence of array types into a lot of places that don't have to care right now. For something that is ultimately just a flange on the side of SQL, that doesn't seem like a good tradeoff. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > [ this is back up-thread a bit ] > And changing that would make it harder to test just the contents of the array > without having to match bounds as well. That is, You couldn't say "list = > '{1,2}'" to test if the array contained 1,2. You would have to, well, I'm not > even sure how you would test it actually. Maybe something kludgy like > "'{}'::int[] || list = '{1,2}'" ? Given the just-committed changes to avoid having array_push/array_cat generate non-spec lower bounds unnecessarily, do you still think it's important to have a variant of array comparison that ignores lower bounds? ISTM that ignoring lower bounds is definitely something that violates the principle of least surprise. There was an ease-of-use argument for it before, but now that we changed the other thing I think we don't need such a kluge. regards, tom lane
Tom Lane wrote: > Given the just-committed changes to avoid having array_push/array_cat > generate non-spec lower bounds unnecessarily, do you still think it's > important to have a variant of array comparison that ignores lower > bounds? > > ISTM that ignoring lower bounds is definitely something that violates > the principle of least surprise. There was an ease-of-use argument > for it before, but now that we changed the other thing I think we don't > need such a kluge. I agree. At this point, having an array with other than 1 as a lower bound takes a very conscious decision. I'd think that if you cared that much about the lower bound, you'd not want to ignore it when it comes to comparison. Joe