Thread: Array comparison & prefix search

Array comparison & prefix search

From
Denes Daniel
Date:
Hi,
 
I have a table like this:
 
CREATE TABLE test (
    type text NOT NULL,
    ident text[] NOT NULL,
    ...
);
ALTER TABLE ONLY test ADD CONSTRAINT test_pkey PRIMARY KEY (type, ident);
 
and I would like to query rows that have a specific "type" and whose "ident" array starts with a some given constants.
I mean something like this:
 
INSERT INTO test VALUES ('one', ARRAY['string']);
INSERT INTO test VALUES ('two', ARRAY['tab', 'str1']);
INSERT INTO test VALUES ('two', ARRAY['test', 'str1']);
INSERT INTO test VALUES ('two', ARRAY['test', 'str2']);
INSERT INTO test VALUES ('two', ARRAY['try', 'str1']);
INSERT INTO test VALUES ('three', ARRAY['some', 'more', 'strings']);
 
SELECT * FROM test WHERE type = 'two' AND ident[1] = 'test';
 
But this query uses the primary key index only for the "type" field, and then filters for ident[1]. Is there a way to make it use the index for the array prefix search too, like with " textcol LIKE '123%' " ? The only way I can think of, is this:
 
SELECT * FROM test WHERE type = 'two' AND (ident >= ARRAY['test', ''] AND ident <= ARRAY['test', NULL]);
 
This uses the index as much as possible, so it's fast, and gives correct results. But something's strange, because it's based on the thing that all strings are greather than or equal to the empty string, and all are less than or equal to NULL... which is fine when ordering rows, so it's fine too in the B-tree (I think), but shouldn't it return no rows, because ('string' <= NULL) is NULL?
 
In fact, ('string' <= NULL) is NULL if I test it directly, or use row-wise comparison, but when I use array comparison, NULL is greather than 'string'.
SELECT 'string' <= NULL::text, ARRAY['string'] <= ARRAY[NULL::text];
This gives me a NULL and a TRUE.
Why? Can I rely on this? If I can't, is there another way to make the array prefix search use the index?
 
Regards,
Denes Daniel

Re: Array comparison & prefix search

From
Sam Mason
Date:
On Fri, Dec 04, 2009 at 06:58:21PM +0100, Denes Daniel wrote:
> SELECT * FROM test WHERE type = 'two' AND ident[1] = 'test';
>
> this query uses the primary key index only for the "type" field, and
> then filters for ident[1]. Is there a way to make it use the index for the
> array prefix search too, like with " textcol LIKE '123%' " ? The only way I
> can think of, is this:

I think you want to create a functional index on ident[1], something
like:

  CREATE INDEX test_my_idx ON test (type,(ident[1]));

> In fact, ('string' <= NULL) is NULL if I test it directly, or use row-wise
> comparison, but when I use array comparison, NULL is greather than 'string'.
> SELECT 'string' <= NULL::text, ARRAY['string'] <= ARRAY[NULL::text];
> This gives me a NULL and a TRUE.

The semantics of this are somewhat fuzzy; I think the behavior is
caused by the fact that the value "as a whole" isn't NULL, hence you get
a non-null result.  You only get a NULL result when the "whole" value is
null, hence values of integer type either have a value or they're null.
As you see, for values of non-atomic type it gets a bit more awkward and
there are various opinions about how they "should" be handled.

--
  Sam  http://samason.me.uk/

Re: Array comparison & prefix search

From
Denes Daniel
Date:
2009/12/4 Sam Mason <sam@samason.me.uk>
 
I think you want to create a functional index on ident[1], something
like:

 CREATE INDEX test_my_idx ON test (type,(ident[1]));
 
Sorry, but this approach is no good, since I may search like:
SELECT * FROM test WHERE type = 'three' AND (ident[1] = 'foo' AND ident[2] = 'bar');
or for the first 3 items in an array with 6 items, or any other prefix...
 
The arrays are all the same length for a given type, but for type 'twenty-three' they may be 23 items long, or even longer for another type, so I can't create an index for all possible cases that way. And yet, all the information needed is in the primary index, I just don't know how to get PostgeSQL to use it.
 
 
The semantics of this are somewhat fuzzy; I think the behavior is
caused by the fact that the value "as a whole" isn't NULL, hence you get
a non-null result.  You only get a NULL result when the "whole" value is
null, hence values of integer type either have a value or they're null.
As you see, for values of non-atomic type it gets a bit more awkward and
there are various opinions about how they "should" be handled.
 
I see, but the documentation says: "Array comparisons compare the array contents element-by-element, [...]". So, if we compare two arrays, where the first difference is this 'string' / NULL thing, then we will reach a point (after comparing all those items that are equal) where 'string' compares to NULL, and the result is that NULL is greater. At least that's the only way I can think of, how I'd get this TRUE result. So is NULL really greater than all other text?
And why is it this way when I'm using an ARRAY[], and the other way when using ROW()?
 
SELECT ARRAY['abc', 'string', 'z'] < ARRAY['abc', NULL::text, 'a'];
--> returns TRUE
SELECT ROW('abc', 'string', 'z') < ROW('abc', NULL::text, 'a');
--> returns NULL
 
 
Regards,
Denes Daniel

Re: Array comparison & prefix search

From
Merlin Moncure
Date:
On Fri, Dec 4, 2009 at 12:58 PM, Denes Daniel <panther-d@freemail.hu> wrote:
> Hi,
>
> I have a table like this:
>
> CREATE TABLE test (
>     type text NOT NULL,
>     ident text[] NOT NULL,
>     ...
> );
> ALTER TABLE ONLY test ADD CONSTRAINT test_pkey PRIMARY KEY (type, ident);
>
> and I would like to query rows that have a specific "type" and whose "ident"
> array starts with a some given constants.
> I mean something like this:
>
> INSERT INTO test VALUES ('one', ARRAY['string']);
> INSERT INTO test VALUES ('two', ARRAY['tab', 'str1']);
> INSERT INTO test VALUES ('two', ARRAY['test', 'str1']);
> INSERT INTO test VALUES ('two', ARRAY['test', 'str2']);
> INSERT INTO test VALUES ('two', ARRAY['try', 'str1']);
> INSERT INTO test VALUES ('three', ARRAY['some', 'more', 'strings']);
>
> SELECT * FROM test WHERE type = 'two' AND ident[1] = 'test';
>
> But this query uses the primary key index only for the "type" field, and
> then filters for ident[1]. Is there a way to make it use the index for the
> array prefix search too, like with " textcol LIKE '123%' " ? The only way I
> can think of, is this:
>
> SELECT * FROM test WHERE type = 'two' AND (ident >= ARRAY['test', ''] AND
> ident <= ARRAY['test', NULL]);
>
> This uses the index as much as possible, so it's fast, and gives correct
> results. But something's strange, because it's based on the thing that all
> strings are greather than or equal to the empty string, and all are less
> than or equal to NULL... which is fine when ordering rows, so it's fine too
> in the B-tree (I think), but shouldn't it return no rows, because ('string'
> <= NULL) is NULL?
>
> In fact, ('string' <= NULL) is NULL if I test it directly, or use row-wise
> comparison, but when I use array comparison, NULL is greather than 'string'.
> SELECT 'string' <= NULL::text, ARRAY['string'] <= ARRAY[NULL::text];
> This gives me a NULL and a TRUE.
> Why? Can I rely on this? If I can't, is there another way to make the array
> prefix search use the index?


AFAIK, your approach is the only solution given your requirements.  It
works well...I've used it often, but usually for integers.  Maybe
there is a missing operator for arrays kinda similar to the contains
operator that would be btree indexable.

merlin

Re: Array comparison & prefix search

From
Sam Mason
Date:
On Sat, Dec 05, 2009 at 02:23:13AM +0100, Denes Daniel wrote:
> 2009/12/4 Sam Mason <sam@samason.me.uk>
> >  CREATE INDEX test_my_idx ON test (type,(ident[1]));
>
> Sorry, but this approach is no good, since I may search like:
> SELECT * FROM test WHERE type = 'three' AND (ident[1] = 'foo' AND ident[2] =
> 'bar');
> or for the first 3 items in an array with 6 items, or any other prefix...

Would a GIN index help?  You'd be able to ask if a 'foo' appears
anywhere in the array (or some subset if you want).  You can then have a
subsequent filter that actually expresses the clause you want.  Not sure
what selectivity you're dealing with and if this would be a problem.

> The arrays are all the same length for a given type, but for type
> 'twenty-three' they may be 23 items long, or even longer for another type,
> so I can't create an index for all possible cases that way. And yet, all the
> information needed is in the primary index, I just don't know how to get
> PostgeSQL to use it.

Arrays and PG (not sure how well other databases handle this case
either) don't work too well.  Have you thought about normalising your
schema a bit to give the database more help?

> And why is it this way when I'm using an ARRAY[], and the other way when
> using ROW()?

I'd say ROW is doing the wrong thing here, but I think other people may
well disagree with me.  Composite/non-atomic types don't exist in the
SQL spec much (AFAIK) hence their behavior is somewhat ad-hoc and tends
to reflect the original use case rather than being too consistent.

--
  Sam  http://samason.me.uk/

Re: Array comparison & prefix search

From
Merlin Moncure
Date:
On Sat, Dec 5, 2009 at 4:54 AM, Sam Mason <sam@samason.me.uk> wrote:
> On Sat, Dec 05, 2009 at 02:23:13AM +0100, Denes Daniel wrote:
>> 2009/12/4 Sam Mason <sam@samason.me.uk>
>> >  CREATE INDEX test_my_idx ON test (type,(ident[1]));
>>
>> Sorry, but this approach is no good, since I may search like:
>> SELECT * FROM test WHERE type = 'three' AND (ident[1] = 'foo' AND ident[2] =
>> 'bar');
>> or for the first 3 items in an array with 6 items, or any other prefix...
>
> Would a GIN index help?  You'd be able to ask if a 'foo' appears
> anywhere in the array (or some subset if you want).  You can then have a
> subsequent filter that actually expresses the clause you want.  Not sure
> what selectivity you're dealing with and if this would be a problem.

GIN is a pretty heavy price to pay for something that should be btree
indexable.  Also note he is using a multi column index with array as
second column...that would be pretty awkward with GIN.

>> The arrays are all the same length for a given type, but for type
>> 'twenty-three' they may be 23 items long, or even longer for another type,
>> so I can't create an index for all possible cases that way. And yet, all the
>> information needed is in the primary index, I just don't know how to get
>> PostgeSQL to use it.
>
> Arrays and PG (not sure how well other databases handle this case
> either) don't work too well.  Have you thought about normalising your
> schema a bit to give the database more help?

Normalizing the data loses the nice property of being able to order
the entire structure using a single index.  He's using the array as if
it was a string...it's basically an optimization.

>> And why is it this way when I'm using an ARRAY[], and the other way when
>> using ROW()?
>
> I'd say ROW is doing the wrong thing here, but I think other people may
> well disagree with me.  Composite/non-atomic types don't exist in the
> SQL spec much (AFAIK) hence their behavior is somewhat ad-hoc and tends
> to reflect the original use case rather than being too consistent.

yeah, pg composite type handling with nulls is all over the place.
you can get just about everything to work though.

merlin

Re: Array comparison & prefix search

From
Denes Daniel
Date:
2009/12/5 Sam Mason <sam@samason.me.uk>
 
Would a GIN index help?  You'd be able to ask if a 'foo' appears
anywhere in the array (or some subset if you want).  You can then have a
subsequent filter that actually expresses the clause you want.  Not sure
what selectivity you're dealing with and if this would be a problem.
 
I think that wouldn't be good for me, since the table will be 2-3M rows large and will be updated very often, and GIN indices are too slow at that. (In fact, the whole table's goal is to avoid updating GIN indices so frequently.)
 
Arrays and PG (not sure how well other databases handle this case
either) don't work too well.  Have you thought about normalising your
schema a bit to give the database more help?
 
I don't have any idea how I could do that... except for creating separate tables for all "type"s. But I don't think that would be a better option. If you have any other idea, I'd really appreciate it.
 
I'd say ROW is doing the wrong thing here, but I think other people may
well disagree with me.  Composite/non-atomic types don't exist in the
SQL spec much (AFAIK) hence their behavior is somewhat ad-hoc and tends
to reflect the original use case rather than being too consistent.
 
According to the documentation,
"Note: Prior to PostgreSQL 8.2, the <, <=, > and >= cases were not handled per SQL specification."
I think the way ROW comparisons work now is per SQL specification.
 
But wait! Thank you for making me read this part of the docs, because I've just found what I was looking for, at the very end of the page:
 
Note: The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors or comparing a row constructor to the output of a subquery (as in Section 9.20). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.
 
I was sure I've read this part of the docs a hundred times, so I've gone after why I didn't find this before: this note is new in the 8.4 docs, it wasn't there before (and I'm using 8.3).
But I'm pretty sure now that I can rely on this.
 
 
Thanks,
Denes Daniel

Re: Array comparison & prefix search

From
Denes Daniel
Date:
2009/12/5 Merlin Moncure <mmoncure@gmail.com>
 
AFAIK, your approach is the only solution given your requirements.  It
works well...I've used it often, but usually for integers.  Maybe
there is a missing operator for arrays kinda similar to the contains
operator that would be btree indexable.
 
Yes, I also think that an array prefix equality operator would be good to have... but now that I've found it in the documentation that this solution can be relied on (see my other mail), I can work around the lack of that.
 
Thanks,
Denes Daniel

Re: Array comparison & prefix search

From
Merlin Moncure
Date:
On Sat, Dec 5, 2009 at 10:31 AM, Denes Daniel <panther-d@freemail.hu> wrote:
> According to the documentation,
> http://www.postgresql.org/docs/8.4/static/functions-comparisons.html#ROW-WISE-COMPARISON
> "Note: Prior to PostgreSQL 8.2, the <, <=, > and >= cases were not handled
> per SQL specification."
> I think the way ROW comparisons work now is per SQL specification.
>
> But wait! Thank you for making me read this part of the docs, because I've
> just found what I was looking for, at the very end of the page:
>
>>
>> Note: The SQL specification requires row-wise comparison to return NULL if
>> the result depends on comparing two NULL values or a NULL and a non-NULL.
>> PostgreSQL does this only when comparing the results of two row constructors
>> or comparing a row constructor to the output of a subquery (as in Section
>> 9.20). In other contexts where two composite-type values are compared, two
>> NULL field values are considered equal, and a NULL is considered larger than
>> a non-NULL. This is necessary in order to have consistent sorting and
>> indexing behavior for composite types.
>
>
> I was sure I've read this part of the docs a hundred times, so I've gone
> after why I didn't find this before: this note is new in the 8.4 docs, it
> wasn't there before (and I'm using 8.3).
> http://www.postgresql.org/docs/8.3/static/functions-comparisons.html#ROW-WISE-COMPARISON
> But I'm pretty sure now that I can rely on this.

Note, composite types != arrays.  Being able to index composite types
is new to 8.4 (you have been able to do arrays for longer than that).
Postgres handling of nullls in composite types is pretty funky, but
nulls being highval in arrays for indexing purposes is probably pretty
safe to rely on.

merlin

Re: Array comparison & prefix search

From
Filip Rembiałkowski
Date:
without digging too much into the details - just a suggestion:

look at the ltree contrib. it actually provides an indexable array
prefix search.




2009/12/4 Denes Daniel <panther-d@freemail.hu>:
> Hi,
>
> I have a table like this:
>
> CREATE TABLE test (
>     type text NOT NULL,
>     ident text[] NOT NULL,
>     ...
> );
> ALTER TABLE ONLY test ADD CONSTRAINT test_pkey PRIMARY KEY (type, ident);
>
> and I would like to query rows that have a specific "type" and whose "ident"
> array starts with a some given constants.
> I mean something like this:
>
> INSERT INTO test VALUES ('one', ARRAY['string']);
> INSERT INTO test VALUES ('two', ARRAY['tab', 'str1']);
> INSERT INTO test VALUES ('two', ARRAY['test', 'str1']);
> INSERT INTO test VALUES ('two', ARRAY['test', 'str2']);
> INSERT INTO test VALUES ('two', ARRAY['try', 'str1']);
> INSERT INTO test VALUES ('three', ARRAY['some', 'more', 'strings']);
>
> SELECT * FROM test WHERE type = 'two' AND ident[1] = 'test';
>
> But this query uses the primary key index only for the "type" field, and
> then filters for ident[1]. Is there a way to make it use the index for the
> array prefix search too, like with " textcol LIKE '123%' " ? The only way I
> can think of, is this:
>
> SELECT * FROM test WHERE type = 'two' AND (ident >= ARRAY['test', ''] AND
> ident <= ARRAY['test', NULL]);
>
> This uses the index as much as possible, so it's fast, and gives correct
> results. But something's strange, because it's based on the thing that all
> strings are greather than or equal to the empty string, and all are less
> than or equal to NULL... which is fine when ordering rows, so it's fine too
> in the B-tree (I think), but shouldn't it return no rows, because ('string'
> <= NULL) is NULL?
>
> In fact, ('string' <= NULL) is NULL if I test it directly, or use row-wise
> comparison, but when I use array comparison, NULL is greather than 'string'.
> SELECT 'string' <= NULL::text, ARRAY['string'] <= ARRAY[NULL::text];
> This gives me a NULL and a TRUE.
> Why? Can I rely on this? If I can't, is there another way to make the array
> prefix search use the index?
>
> Regards,
> Denes Daniel



--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

Re: Array comparison & prefix search

From
Sam Mason
Date:
On Sat, Dec 05, 2009 at 09:54:58AM -0500, Merlin Moncure wrote:
> GIN is a pretty heavy price to pay for something that should be btree
> indexable.  Also note he is using a multi column index with array as
> second column...that would be pretty awkward with GIN.

Yup, sounds as though it's not going to work here.  I was mainly
suggesting it as it's working now, as opposed to something that
could/should be made to work.

> Normalizing the data loses the nice property of being able to order
> the entire structure using a single index.  He's using the array as if
> it was a string...it's basically an optimization.

Hum, not sure why this didn't come up already: what about having an
index on (type,(array_to_string(ident,'##')) and relying on the already
existing optimizations for string prefixes.  Not sure what sort of
values can be used in "ident", but it could work.

--
  Sam  http://samason.me.uk/

Re: Array comparison & prefix search

From
Merlin Moncure
Date:
On Mon, Dec 7, 2009 at 7:01 AM, Sam Mason <sam@samason.me.uk> wrote:
> On Sat, Dec 05, 2009 at 09:54:58AM -0500, Merlin Moncure wrote:
>> GIN is a pretty heavy price to pay for something that should be btree
>> indexable.  Also note he is using a multi column index with array as
>> second column...that would be pretty awkward with GIN.
>
> Yup, sounds as though it's not going to work here.  I was mainly
> suggesting it as it's working now, as opposed to something that
> could/should be made to work.
>
>> Normalizing the data loses the nice property of being able to order
>> the entire structure using a single index.  He's using the array as if
>> it was a string...it's basically an optimization.
>
> Hum, not sure why this didn't come up already: what about having an
> index on (type,(array_to_string(ident,'##')) and relying on the already
> existing optimizations for string prefixes.  Not sure what sort of
> values can be used in "ident", but it could work.

hm, that's certainly an interesting idea, but I think unless you pad
all the strings out you are going to run into some odd ordering
issues.  If it works though I think you'll have a much tighter index
than the raw one over the array.

merlin

Re: Array comparison & prefix search

From
Denes Daniel
Date:
I tried making the "ident" column a text instead of text[] in the beginning, but searches were approximately of the same speed; so I voted for the array, because this way there isn't even a possibility for the separator ("##") to cause problems.
 
Anyway, the "ident BETWEEN ARRAY['foo', 'bar'] AND ARRAY['foo', 'bar', NULL]" approach works really fast (uses the index), and selects all arrays that are equal to or start with ['foo', 'bar'].
 
Thanks everybody,
Denes Daniel


 
2009/12/7 Sam Mason <sam@samason.me.uk>
On Sat, Dec 05, 2009 at 09:54:58AM -0500, Merlin Moncure wrote:
> GIN is a pretty heavy price to pay for something that should be btree
> indexable.  Also note he is using a multi column index with array as
> second column...that would be pretty awkward with GIN.

Yup, sounds as though it's not going to work here.  I was mainly
suggesting it as it's working now, as opposed to something that
could/should be made to work.

> Normalizing the data loses the nice property of being able to order
> the entire structure using a single index.  He's using the array as if
> it was a string...it's basically an optimization.

Hum, not sure why this didn't come up already: what about having an
index on (type,(array_to_string(ident,'##')) and relying on the already
existing optimizations for string prefixes.  Not sure what sort of
values can be used in "ident", but it could work.