Thread: Gist indexes on int arrays

Gist indexes on int arrays

From
Greg Stark
Date:
> What do you mean??
> GiST indexing just indexes columns of type *array* for the &&,=,@,~,@@,
> etc.. operators.

Hm, you're right of course. I wonder where I got the idea that it didn't
handle these operators.

This is fascinating and could be useful for something I'm working on. 

How do gist indexes interact with more normal data types to index? I have a
situation where I have a table with millions of records, and I'm mostly
operating on a subset of those records, usually 1k-10k of them. 

The queries would look like

WHERE foo_id = ? AND '{1}'::integer[] ~ attr_a AND '{2}'::integer[] ~ attr_b

Right now I'm using the contrib/array *= operator and I have an index on
foo_id. Having to scan through up to 10,000 records isn't great but isn't too
bad. I wonder whether having a gist index and using the ~ operator would be
worthwhile?

The contrib/array, contrib/intagg, and contrib/intarray directories seem to
all be aimed at handling the same thing and seem to provide mostly
complementary features. Perhaps they should all be merged into one package. I
guess it does show there's lots of demand for this type of datatype.

--
greg



Re: Gist indexes on int arrays

From
Achilleus Mantzios
Date:
On 3 Mar 2003, Greg Stark wrote:

>
> > What do you mean??
> > GiST indexing just indexes columns of type *array* for the &&,=,@,~,@@,
> > etc.. operators.
>
> Hm, you're right of course. I wonder where I got the idea that it didn't
> handle these operators.
>
> This is fascinating and could be useful for something I'm working on.
>
> How do gist indexes interact with more normal data types to index? I have a
> situation where I have a table with millions of records, and I'm mostly
> operating on a subset of those records, usually 1k-10k of them.
>
> The queries would look like
>
> WHERE foo_id = ?
>   AND '{1}'::integer[] ~ attr_a
>   AND '{2}'::integer[] ~ attr_b
>
> Right now I'm using the contrib/array *= operator and I have an index on
> foo_id. Having to scan through up to 10,000 records isn't great but isn't too
> bad. I wonder whether having a gist index and using the ~ operator would be
> worthwhile?

Absolutely.
Moreover if your array element positions that you want to compare
against(e.g attr_a[1], or attr_b[n], where n is the last element) are
known, then you could have a function "first" that returns
the first element (you must pay attention to nulls and out of bound
situations), and a function "last" that returns the last element.
Then you could have normal btree indexes on first(attr_a), and on
last(attr_b), but unfortunately not an index on both.


>
> The contrib/array, contrib/intagg, and contrib/intarray directories seem to
> all be aimed at handling the same thing and seem to provide mostly
> complementary features. Perhaps they should all be merged into one package. I
> guess it does show there's lots of demand for this type of datatype.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel:    +30-210-8981112
fax:    +30-210-8981877
email:  achill@matrix.gatewaynet.com       mantzios@softlab.ece.ntua.gr



Re: Gist indexes on int arrays

From
Greg Stark
Date:
Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:

> Moreover if your array element positions that you want to compare
> against(e.g attr_a[1], or attr_b[n], where n is the last element) are
> known, then you could have a function "first" that returns
> the first element ...

Except that's precisely what I'm *not* doing. I'm treating the arrays as sets
and looking for records where the set contains a given value. This is a
denormalized table generated nightly from fully normalized raw data.

So to simplify it, the query might have clauses like:

WHERE foo_id = 900 AND '{5}'::integer[] ~ attribute_set_array

Right now I have a btree index on (foo_id).

Can I have a GiST index on (foo_id, attribute_set_array) and have it be just
as fast at narrowing the search to just foo_id = 900 but also speed up the ~
operation?

Incidentally, it seems odd that there isn't an operator like ~ but optimized
specifically for searching for a single item. It seems awkward and possibly
unnecessarily slow to have to construct an array for the search parameter
every time.

--
greg



Re: Gist indexes on int arrays

From
Achilleus Mantzios
Date:
On 4 Mar 2003, Greg Stark wrote:

> 
> Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
> 
> > Moreover if your array element positions that you want to compare
> > against(e.g attr_a[1], or attr_b[n], where n is the last element) are
> > known, then you could have a function "first" that returns
> > the first element ...
> 
> Except that's precisely what I'm *not* doing. I'm treating the arrays as sets
> and looking for records where the set contains a given value. This is a
> denormalized table generated nightly from fully normalized raw data.
> 
> So to simplify it, the query might have clauses like:
> 
> WHERE foo_id = 900
>   AND '{5}'::integer[] ~ attribute_set_array
> 
> Right now I have a btree index on (foo_id).
> 
> Can I have a GiST index on (foo_id, attribute_set_array) and have it be just
> as fast at narrowing the search to just foo_id = 900 but also speed up the ~
> operation?

I am afraid you cant do that easily. (but you can follow recent 
discussions on the matter).
int4 does not have an opclass that can cope with "gist".

> 
> Incidentally, it seems odd that there isn't an operator like ~ but optimized
> specifically for searching for a single item. It seems awkward and possibly
> unnecessarily slow to have to construct an array for the search parameter
> every time.
> 

What do you mean by "slow"?

> --
> greg
> 

-- 
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel:    +30-210-8981112
fax:    +30-210-8981877
email:  achill@matrix.gatewaynet.com       mantzios@softlab.ece.ntua.gr



Re: Gist indexes on int arrays

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> Can I have a GiST index on (foo_id, attribute_set_array) and have it be just
> as fast at narrowing the search to just foo_id = 900 but also speed up the ~
> operation?

Hm, so if I understand what I'm reading I can do this if I load the btree_gist
contrib module as well. I'm still not sure whether it'll be worthwhile for
this application though.

I have a bit of a problem though. Is building GiST indexes supposed to take
much much longer than building btree indexes? It's been running nearly an hour
and it's still going. The hard drive is hardly moving so it seems to be all
cpu usage. I don't even see any pgsql_tmp usage.

db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops);

postgres 30176 86.3 22.2 64896 57344 ?       R    11:08  40:32 postgres: postgres slo [local] CREATE INDEX


I don't remember exact numbers but building the normal btree index took on the
order of 15m. This will have to be rebuilt nightly, an hour long index build
won't be practical.

--
greg



Re: Gist indexes on int arrays

From
Oleg Bartunov
Date:
On Tue, 4 Mar 2003, Greg Stark wrote:

>
> Greg Stark <gsstark@MIT.EDU> writes:
>
> > Can I have a GiST index on (foo_id, attribute_set_array) and have it be just
> > as fast at narrowing the search to just foo_id = 900 but also speed up the ~
> > operation?
>
> Hm, so if I understand what I'm reading I can do this if I load the btree_gist
> contrib module as well. I'm still not sure whether it'll be worthwhile for
> this application though.
>
> I have a bit of a problem though. Is building GiST indexes supposed to take
> much much longer than building btree indexes? It's been running nearly an hour
> and it's still going. The hard drive is hardly moving so it seems to be all
> cpu usage. I don't even see any pgsql_tmp usage.
>
> db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops);
>
> postgres 30176 86.3 22.2 64896 57344 ?       R    11:08  40:32 postgres: postgres slo [local] CREATE INDEX
>
>
> I don't remember exact numbers but building the normal btree index took on the
> order of 15m. This will have to be rebuilt nightly, an hour long index build
> won't be practical.

what's the time to create gist indices separately ? I suppose 15m is the time
to create btree index on *single* column ?

>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Gist indexes on int arrays

From
Oleg Bartunov
Date:
On Tue, 4 Mar 2003, Achilleus Mantzios wrote:

> On 4 Mar 2003, Greg Stark wrote:
>
> >
> > Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
> >
> > > Moreover if your array element positions that you want to compare
> > > against(e.g attr_a[1], or attr_b[n], where n is the last element) are
> > > known, then you could have a function "first" that returns
> > > the first element ...
> >
> > Except that's precisely what I'm *not* doing. I'm treating the arrays as sets
> > and looking for records where the set contains a given value. This is a
> > denormalized table generated nightly from fully normalized raw data.
> >
> > So to simplify it, the query might have clauses like:
> >
> > WHERE foo_id = 900
> >   AND '{5}'::integer[] ~ attribute_set_array
> >
> > Right now I have a btree index on (foo_id).
> >
> > Can I have a GiST index on (foo_id, attribute_set_array) and have it be just
> > as fast at narrowing the search to just foo_id = 900 but also speed up the ~
> > operation?
>
> I am afraid you cant do that easily. (but you can follow recent
> discussions on the matter).
> int4 does not have an opclass that can cope with "gist".


no-no, just use contrib/btree_gist !

>
> >
> > Incidentally, it seems odd that there isn't an operator like ~ but optimized
> > specifically for searching for a single item. It seems awkward and possibly
> > unnecessarily slow to have to construct an array for the search parameter
> > every time.
> >
>
> What do you mean by "slow"?
>
> > --
> > greg
> >
>
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Gist indexes on int arrays

From
Greg Stark
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:

> > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops);

> what's the time to create gist indices separately ? I suppose 15m is the time
> to create btree index on *single* column ?

Unfortunately I can't run timing tests on it without invalidating whatever
timing results I get for the current index build which is still running.

postgres 30176 93.7 21.7 68964 56096 ?       R    11:08 133:39 postgres: postgres slo [local] CREATE INDEX

15m was what i remember as the time to build a unique (non-gist) btree index
on (foo_id,bar_id) where foo_id is the same integer column as the leading
column in the gist index, and bar_id is another integer column.

There are 1,161,435 records, with 384 distinct values of foo_id. The number of
records per value of foo_id ranges from 1 to 11,474.

The annoying thing here is that 99% of the attribute_set columns will have
exactly one value in them. All this work is for the 1% that can have multiple
values.

--
greg



Re: Gist indexes on int arrays

From
Guy Fraser
Date:
I can help with your first() and last() functions.

...snip...
> 
> 
> Absolutely.
> Moreover if your array element positions that you want to compare
> against(e.g attr_a[1], or attr_b[n], where n is the last element) are
> known, then you could have a function "first" that returns
> the first element (you must pay attention to nulls and out of bound
> situations), and a function "last" that returns the last element.
> Then you could have normal btree indexes on first(attr_a), and on
> last(attr_b), but unfortunately not an index on both.
> 
> 
> 
...snip...

Here is some code I wrote that works in 7.2 and 7.3 that helps.

This function current is designed for a single dimentional text array, but can 
be converted to work with integers very easily, I just dodn't have a proof 
right now.

--
-- Start of function
--
CREATE FUNCTION array_size (TEXT[]) RETURNS INT AS '
DECLARE  array ALIAS FOR $1;  dim INT;
BEGIN  SELECT INTO dim    rtrim(ltrim(ltrim(array_dims(array),''[012345679''),'':''),'']'')::INT ;    IF dim IS NULL
THEN     dim := 0 ;    END IF;  RETURN dim;
 
END;' LANGUAGE plpgsql;
--
-- End function
--
--Start of Proof
--
CREATE TABLE cruft(array TEXT[]);
INSERT INTO cruft VALUES('{data1,data2,data3}');
SELECT array,array_size(array) FROM cruft;
--
--        array        | array_size
-----------------------+------------
-- {data1,data2,data3} |          3
--(1 row)
--
-- End Proof
--

To get the first and last values :

SELECT array[1] as first,array[array_size(array)] as last FROM cruft;
--
-- first | last
---------+-------
-- data1 | data3
--(1 row)
--

I hope this helps.



Re: Gist indexes on int arrays

From
Oleg Bartunov
Date:
On Tue, 4 Mar 2003, Greg Stark wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>
> > > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops);
>
> > what's the time to create gist indices separately ? I suppose 15m is the time
> > to create btree index on *single* column ?
>
> Unfortunately I can't run timing tests on it without invalidating whatever
> timing results I get for the current index build which is still running.
>
> postgres 30176 93.7 21.7 68964 56096 ?       R    11:08 133:39 postgres: postgres slo [local] CREATE INDEX
>
> 15m was what i remember as the time to build a unique (non-gist) btree index
> on (foo_id,bar_id) where foo_id is the same integer column as the leading
> column in the gist index, and bar_id is another integer column.
>
> There are 1,161,435 records, with 384 distinct values of foo_id. The number of
> records per value of foo_id ranges from 1 to 11,474.

hmm, not so much. We might be interested to play with the data.
How big would be compressed archive of your data, so we could download
it and use as test data.

>
> The annoying thing here is that 99% of the attribute_set columns will have
> exactly one value in them. All this work is for the 1% that can have multiple
> values.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Gist indexes on int arrays

From
Greg Stark
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:

> On Tue, 4 Mar 2003, Greg Stark wrote:
> 
> > > > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops);
> >
> > There are 1,161,435 records, with 384 distinct values of foo_id. The number of
> > records per value of foo_id ranges from 1 to 11,474.
> 
> hmm, not so much. We might be interested to play with the data.
> How big would be compressed archive of your data, so we could download
> it and use as test data.

The -Fc archive is 20M. But I have to get permission from the client.

It's nearly 5h now. Surely something is wrong?

postgres 30176 92.2 21.4 77124 55092 ?       R    11:08 247:48 postgres: postgres slo [local] CREATE INDEX

This is 7.3 btw, would I have better success with CVS?

-- 
greg