Re: Comparing user attributes with bitwise operators - Mailing list pgsql-performance
From | Patrick Clery |
---|---|
Subject | Re: Comparing user attributes with bitwise operators |
Date | |
Msg-id | 200409182126.13686.patrick@phpforhire.com Whole thread Raw |
In response to | Re: Comparing user attributes with bitwise operators (Greg Stark <gsstark@mit.edu>) |
Responses |
Re: Comparing user attributes with bitwise operators
|
List | pgsql-performance |
I have currently implemented a schema for my "Dating Site" that is storing user search preferences and user attributes in an int[] array using the contrib/intarray package (suggested by Greg Stark). But there are a few problems. a) query_int can't be cast to int4. b) query_int can't be indexed. datingsite=> alter table people_attributes add column bla query_int; ALTER TABLE datingsite=> create index idx_query_int on people_attributes (bla); ERROR: data type query_int has no default operator class for access method "btree" HINT: You must specify an operator class for the index or define a default operator class for the data type. datingsite=> create index idx_query_int on people_attributes (bla gist__int_ops); ERROR: operator class "gist__int_ops" does not exist for access method "btree" datingsite=> alter table people_attributes drop column bla; ALTER TABLE c) query_int can only be used in one operation against int[]: README.intarray: int[] @@ query_int - returns TRUE if array satisfies query (like '1&(2|3)') It is not possible to use >=, <=, =, etc. Also, this operator does not work like example says: datingsite=> select '{2,3}'::int[] @@ '1'::query_int; ?column? ---------- f (1 row) d) I can't find a way to simply check if an integer is an array without declaring it as an array; Therefore, I need to use an int[] type for a column that will only be storing one int4 if I want to compare it to an int[] array: README.intarray: int[] && int[] - overlap - returns TRUE if arrays has at least one common elements. e) int[] and query_int are somewhat ugly to deal with since query_int needs to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm just being anal :) Because of these limitations, I've chosen to declare the attribute columns as int[] arrays (even though they will only contain one value) so that I can use '{1,2,3}'::int[] && column_name: README.intarray: int[] && int[] - overlap - returns TRUE if arrays has at least one common elements. Here is the schema: create table people ( person_id serial, datecreated timestamp with time zone default now (), signup_ip cidr not null, username character varying(30) not null, password character varying(28) not null, email character varying(65) not null, dob date not null, primary key (person_id) ); create table people_attributes ( person_id int references people (person_id) on delete cascade initially deferred, askmecount int not null default 0, age int[] not null default '{1}'::int[], gender int[] not null default '{1}'::int[], orientation int[] not null default '{1}'::int[], bodytype int[] not null default '{1}'::int[], children int[] not null default '{1}'::int[], drinking int[] not null default '{1}'::int[], education int[] not null default '{1}'::int[], ethnicity int[] not null default '{1}'::int[], eyecolor int[] not null default '{1}'::int[], haircolor int[] not null default '{1}'::int[], hairstyle int[] not null default '{1}'::int[], height int[] not null default '{1}'::int[], income int[] not null default '{1}'::int[], occupation int[] not null default '{1}'::int[], relation int[] not null default '{1}'::int[], /* multiple answer */ religion int[] not null default '{1}'::int[], seeking int[] not null default '{1}'::int[], /* multiple answer */ smoking int[] not null default '{1}'::int[], want_children int[] not null default '{1}'::int[], weight int[] not null default '{1}'::int[], primary key (person_id) ) without oids; create index people_attributes_search on people_attributes using gist ( age gist__int_ops, gender gist__int_ops, orientation gist__int_ops, bodytype gist__int_ops, children gist__int_ops, drinking gist__int_ops, education gist__int_ops, ethnicity gist__int_ops, eyecolor gist__int_ops, haircolor gist__int_ops, hairstyle gist__int_ops, height gist__int_ops, income gist__int_ops, occupation gist__int_ops, relation gist__int_ops, religion gist__int_ops, seeking gist__int_ops, smoking gist__int_ops, want_children gist__int_ops, weight gist__int_ops ); /* These will be compared against the people_attributes table */ create table people_searchprefs ( person_id int references people (person_id) on delete cascade initially deferred, age int[] not null default '{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[], gender int[] not null default '{1,2,4}'::int[], orientation int[] not null default '{1,2,8}'::int[], bodytype int[] not null default '{1,2,3,4,5,6}'::int[], children int[] not null default '{0}'::int[], drinking int[] not null default '{0}'::int[], education int[] not null default '{0}'::int[], ethnicity int[] not null default '{0}'::int[], eyecolor int[] not null default '{0}'::int[], haircolor int[] not null default '{0}'::int[], hairstyle int[] not null default '{0}'::int[], height int[] not null default '{0}'::int[], income int[] not null default '{0}'::int[], occupation int[] not null default '{0}'::int[], relation int[] not null default '{0}'::int[], religion int[] not null default '{0}'::int[], seeking int[] not null default '{0}'::int[], smoking int[] not null default '{0}'::int[], want_children int[] not null default '{0}'::int[], weight int[] not null default '{0}'::int[], primary key (person_id) ) without oids; And now, the moment you've all been waiting for: performance! (Number of profiles) datingsite=> select count(*) from people_attributes ; count ------- 96146 (1 row) (age, gender and sexual orientation will always be a part of the query, and are necessary to be invoke the index. The query is to show females, age 30-40 of any orientation. But first, without the index) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on people_attributes (cost=0.00..9078.56 rows=1 width=36) (actual time=0.044..299.537 rows=937 loops=1) Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 304.707 ms (3 rows) ( with the index ) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 57.032 ms (3 rows) (more realistically, it will have a limit of 10) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10 loops=1) -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 0.817 ms (4 rows) (slower with an sort key) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation order by age; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937 loops=1) Sort Key: age -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 71.206 ms (5 rows) (no better with a limit) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation order by age limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10 loops=1) -> Sort (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418 rows=10 loops=1) Sort Key: age -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 69.899 ms (6 rows) The last query is the most likely since I will need to be sorting by some key. If I wasn't sorting it looks like it wouldn't be too bad, but sorting is inevitable I think. I've only imported 96,146 of the 150,000 profiles. This seems a bit slow now, and it doesn't look like it will scale. My questions are: - Is there a way of speeding up the sort? - Will using queries like " WHERE orientation IN (1,2,4) " be any better/worse? - The queries with the GiST index are faster, but is it of any benefit when the int[] arrays all contain a single value? - Is there any hope for this structure? Thanks for the suggestion Greg, and thanks to those who responded to this thread. On Thursday 16 September 2004 02:44, Greg Stark wrote: > The only kind of index that is capable of indexing this type of data > structure for arbitrary searches would be a GiST index. I'm not aware of > any implementation for bitfields, though it would be an appropriate use. > > What there is now is the contrib/intarray package. You would have to store > more than just the bitfields, you would have to store an array of integer > flags. That might be denser actually if you end up with many flags few of > which are set. > > GiST indexes allow you to search arbitrary combinations of set and unset > flags. using the "@@" operator > > int[] @@ query_int - returns TRUE if array satisfies query (like > '1&(2|3)') > > You might be able to look at the code there and adapt it to apply to bit > fields. If so I think it would be a useful tool. But GiST indexing is > pretty esoteric stuff. primary key (person_id) ) without oids;
pgsql-performance by date: