Thread: Gist indexes on int arrays
> What do you mean?? > GiST indexing just indexes columns of type *array* for the &&,=,@,~,@@, > etc.. operators. Hm, you're right of course. I wonder where I got the idea that it didn't handle these operators. This is fascinating and could be useful for something I'm working on. How do gist indexes interact with more normal data types to index? I have a situation where I have a table with millions of records, and I'm mostly operating on a subset of those records, usually 1k-10k of them. The queries would look like WHERE foo_id = ? AND '{1}'::integer[] ~ attr_a AND '{2}'::integer[] ~ attr_b Right now I'm using the contrib/array *= operator and I have an index on foo_id. Having to scan through up to 10,000 records isn't great but isn't too bad. I wonder whether having a gist index and using the ~ operator would be worthwhile? The contrib/array, contrib/intagg, and contrib/intarray directories seem to all be aimed at handling the same thing and seem to provide mostly complementary features. Perhaps they should all be merged into one package. I guess it does show there's lots of demand for this type of datatype. -- greg
On 3 Mar 2003, Greg Stark wrote: > > > What do you mean?? > > GiST indexing just indexes columns of type *array* for the &&,=,@,~,@@, > > etc.. operators. > > Hm, you're right of course. I wonder where I got the idea that it didn't > handle these operators. > > This is fascinating and could be useful for something I'm working on. > > How do gist indexes interact with more normal data types to index? I have a > situation where I have a table with millions of records, and I'm mostly > operating on a subset of those records, usually 1k-10k of them. > > The queries would look like > > WHERE foo_id = ? > AND '{1}'::integer[] ~ attr_a > AND '{2}'::integer[] ~ attr_b > > Right now I'm using the contrib/array *= operator and I have an index on > foo_id. Having to scan through up to 10,000 records isn't great but isn't too > bad. I wonder whether having a gist index and using the ~ operator would be > worthwhile? Absolutely. Moreover if your array element positions that you want to compare against(e.g attr_a[1], or attr_b[n], where n is the last element) are known, then you could have a function "first" that returns the first element (you must pay attention to nulls and out of bound situations), and a function "last" that returns the last element. Then you could have normal btree indexes on first(attr_a), and on last(attr_b), but unfortunately not an index on both. > > The contrib/array, contrib/intagg, and contrib/intarray directories seem to > all be aimed at handling the same thing and seem to provide mostly > complementary features. Perhaps they should all be merged into one package. I > guess it does show there's lots of demand for this type of datatype. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > ================================================================== Achilleus Mantzios S/W Engineer IT dept Dynacom Tankers Mngmt Nikis 4, Glyfada Athens 16610 Greece tel: +30-210-8981112 fax: +30-210-8981877 email: achill@matrix.gatewaynet.com mantzios@softlab.ece.ntua.gr
Achilleus Mantzios <achill@matrix.gatewaynet.com> writes: > Moreover if your array element positions that you want to compare > against(e.g attr_a[1], or attr_b[n], where n is the last element) are > known, then you could have a function "first" that returns > the first element ... Except that's precisely what I'm *not* doing. I'm treating the arrays as sets and looking for records where the set contains a given value. This is a denormalized table generated nightly from fully normalized raw data. So to simplify it, the query might have clauses like: WHERE foo_id = 900 AND '{5}'::integer[] ~ attribute_set_array Right now I have a btree index on (foo_id). Can I have a GiST index on (foo_id, attribute_set_array) and have it be just as fast at narrowing the search to just foo_id = 900 but also speed up the ~ operation? Incidentally, it seems odd that there isn't an operator like ~ but optimized specifically for searching for a single item. It seems awkward and possibly unnecessarily slow to have to construct an array for the search parameter every time. -- greg
On 4 Mar 2003, Greg Stark wrote: > > Achilleus Mantzios <achill@matrix.gatewaynet.com> writes: > > > Moreover if your array element positions that you want to compare > > against(e.g attr_a[1], or attr_b[n], where n is the last element) are > > known, then you could have a function "first" that returns > > the first element ... > > Except that's precisely what I'm *not* doing. I'm treating the arrays as sets > and looking for records where the set contains a given value. This is a > denormalized table generated nightly from fully normalized raw data. > > So to simplify it, the query might have clauses like: > > WHERE foo_id = 900 > AND '{5}'::integer[] ~ attribute_set_array > > Right now I have a btree index on (foo_id). > > Can I have a GiST index on (foo_id, attribute_set_array) and have it be just > as fast at narrowing the search to just foo_id = 900 but also speed up the ~ > operation? I am afraid you cant do that easily. (but you can follow recent discussions on the matter). int4 does not have an opclass that can cope with "gist". > > Incidentally, it seems odd that there isn't an operator like ~ but optimized > specifically for searching for a single item. It seems awkward and possibly > unnecessarily slow to have to construct an array for the search parameter > every time. > What do you mean by "slow"? > -- > greg > -- ================================================================== Achilleus Mantzios S/W Engineer IT dept Dynacom Tankers Mngmt Nikis 4, Glyfada Athens 16610 Greece tel: +30-210-8981112 fax: +30-210-8981877 email: achill@matrix.gatewaynet.com mantzios@softlab.ece.ntua.gr
Greg Stark <gsstark@MIT.EDU> writes: > Can I have a GiST index on (foo_id, attribute_set_array) and have it be just > as fast at narrowing the search to just foo_id = 900 but also speed up the ~ > operation? Hm, so if I understand what I'm reading I can do this if I load the btree_gist contrib module as well. I'm still not sure whether it'll be worthwhile for this application though. I have a bit of a problem though. Is building GiST indexes supposed to take much much longer than building btree indexes? It's been running nearly an hour and it's still going. The hard drive is hardly moving so it seems to be all cpu usage. I don't even see any pgsql_tmp usage. db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops); postgres 30176 86.3 22.2 64896 57344 ? R 11:08 40:32 postgres: postgres slo [local] CREATE INDEX I don't remember exact numbers but building the normal btree index took on the order of 15m. This will have to be rebuilt nightly, an hour long index build won't be practical. -- greg
On Tue, 4 Mar 2003, Greg Stark wrote: > > Greg Stark <gsstark@MIT.EDU> writes: > > > Can I have a GiST index on (foo_id, attribute_set_array) and have it be just > > as fast at narrowing the search to just foo_id = 900 but also speed up the ~ > > operation? > > Hm, so if I understand what I'm reading I can do this if I load the btree_gist > contrib module as well. I'm still not sure whether it'll be worthwhile for > this application though. > > I have a bit of a problem though. Is building GiST indexes supposed to take > much much longer than building btree indexes? It's been running nearly an hour > and it's still going. The hard drive is hardly moving so it seems to be all > cpu usage. I don't even see any pgsql_tmp usage. > > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops); > > postgres 30176 86.3 22.2 64896 57344 ? R 11:08 40:32 postgres: postgres slo [local] CREATE INDEX > > > I don't remember exact numbers but building the normal btree index took on the > order of 15m. This will have to be rebuilt nightly, an hour long index build > won't be practical. what's the time to create gist indices separately ? I suppose 15m is the time to create btree index on *single* column ? > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Tue, 4 Mar 2003, Achilleus Mantzios wrote: > On 4 Mar 2003, Greg Stark wrote: > > > > > Achilleus Mantzios <achill@matrix.gatewaynet.com> writes: > > > > > Moreover if your array element positions that you want to compare > > > against(e.g attr_a[1], or attr_b[n], where n is the last element) are > > > known, then you could have a function "first" that returns > > > the first element ... > > > > Except that's precisely what I'm *not* doing. I'm treating the arrays as sets > > and looking for records where the set contains a given value. This is a > > denormalized table generated nightly from fully normalized raw data. > > > > So to simplify it, the query might have clauses like: > > > > WHERE foo_id = 900 > > AND '{5}'::integer[] ~ attribute_set_array > > > > Right now I have a btree index on (foo_id). > > > > Can I have a GiST index on (foo_id, attribute_set_array) and have it be just > > as fast at narrowing the search to just foo_id = 900 but also speed up the ~ > > operation? > > I am afraid you cant do that easily. (but you can follow recent > discussions on the matter). > int4 does not have an opclass that can cope with "gist". no-no, just use contrib/btree_gist ! > > > > > Incidentally, it seems odd that there isn't an operator like ~ but optimized > > specifically for searching for a single item. It seems awkward and possibly > > unnecessarily slow to have to construct an array for the search parameter > > every time. > > > > What do you mean by "slow"? > > > -- > > greg > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops); > what's the time to create gist indices separately ? I suppose 15m is the time > to create btree index on *single* column ? Unfortunately I can't run timing tests on it without invalidating whatever timing results I get for the current index build which is still running. postgres 30176 93.7 21.7 68964 56096 ? R 11:08 133:39 postgres: postgres slo [local] CREATE INDEX 15m was what i remember as the time to build a unique (non-gist) btree index on (foo_id,bar_id) where foo_id is the same integer column as the leading column in the gist index, and bar_id is another integer column. There are 1,161,435 records, with 384 distinct values of foo_id. The number of records per value of foo_id ranges from 1 to 11,474. The annoying thing here is that 99% of the attribute_set columns will have exactly one value in them. All this work is for the 1% that can have multiple values. -- greg
I can help with your first() and last() functions. ...snip... > > > Absolutely. > Moreover if your array element positions that you want to compare > against(e.g attr_a[1], or attr_b[n], where n is the last element) are > known, then you could have a function "first" that returns > the first element (you must pay attention to nulls and out of bound > situations), and a function "last" that returns the last element. > Then you could have normal btree indexes on first(attr_a), and on > last(attr_b), but unfortunately not an index on both. > > > ...snip... Here is some code I wrote that works in 7.2 and 7.3 that helps. This function current is designed for a single dimentional text array, but can be converted to work with integers very easily, I just dodn't have a proof right now. -- -- Start of function -- CREATE FUNCTION array_size (TEXT[]) RETURNS INT AS ' DECLARE array ALIAS FOR $1; dim INT; BEGIN SELECT INTO dim rtrim(ltrim(ltrim(array_dims(array),''[012345679''),'':''),'']'')::INT ; IF dim IS NULL THEN dim := 0 ; END IF; RETURN dim; END;' LANGUAGE plpgsql; -- -- End function -- --Start of Proof -- CREATE TABLE cruft(array TEXT[]); INSERT INTO cruft VALUES('{data1,data2,data3}'); SELECT array,array_size(array) FROM cruft; -- -- array | array_size -----------------------+------------ -- {data1,data2,data3} | 3 --(1 row) -- -- End Proof -- To get the first and last values : SELECT array[1] as first,array[array_size(array)] as last FROM cruft; -- -- first | last ---------+------- -- data1 | data3 --(1 row) -- I hope this helps.
On Tue, 4 Mar 2003, Greg Stark wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: > > > > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops); > > > what's the time to create gist indices separately ? I suppose 15m is the time > > to create btree index on *single* column ? > > Unfortunately I can't run timing tests on it without invalidating whatever > timing results I get for the current index build which is still running. > > postgres 30176 93.7 21.7 68964 56096 ? R 11:08 133:39 postgres: postgres slo [local] CREATE INDEX > > 15m was what i remember as the time to build a unique (non-gist) btree index > on (foo_id,bar_id) where foo_id is the same integer column as the leading > column in the gist index, and bar_id is another integer column. > > There are 1,161,435 records, with 384 distinct values of foo_id. The number of > records per value of foo_id ranges from 1 to 11,474. hmm, not so much. We might be interested to play with the data. How big would be compressed archive of your data, so we could download it and use as test data. > > The annoying thing here is that 99% of the attribute_set columns will have > exactly one value in them. All this work is for the 1% that can have multiple > values. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > On Tue, 4 Mar 2003, Greg Stark wrote: > > > > > db=# CREATE INDEX cache_gist_idx on cache using gist ( foo_id , attribute_set gist__int_ops); > > > > There are 1,161,435 records, with 384 distinct values of foo_id. The number of > > records per value of foo_id ranges from 1 to 11,474. > > hmm, not so much. We might be interested to play with the data. > How big would be compressed archive of your data, so we could download > it and use as test data. The -Fc archive is 20M. But I have to get permission from the client. It's nearly 5h now. Surely something is wrong? postgres 30176 92.2 21.4 77124 55092 ? R 11:08 247:48 postgres: postgres slo [local] CREATE INDEX This is 7.3 btw, would I have better success with CVS? -- greg