Re: Comparing user attributes with bitwise operators - Mailing list pgsql-performance

From Patrick Clery
Subject Re: Comparing user attributes with bitwise operators
Date
Msg-id 200409182126.13686.patrick@phpforhire.com
Whole thread Raw
In response to Re: Comparing user attributes with bitwise operators  (Greg Stark <gsstark@mit.edu>)
Responses Re: Comparing user attributes with bitwise operators
List pgsql-performance
I have currently implemented a schema for my "Dating Site" that is storing
user search preferences and user attributes in an int[] array using the
contrib/intarray package (suggested by Greg Stark). But there are a few
problems.
    a) query_int can't be cast to int4.
    b) query_int can't be indexed.

datingsite=> alter table people_attributes add column bla query_int;
ALTER TABLE
datingsite=> create index idx_query_int on people_attributes (bla);
ERROR:  data type query_int has no default operator class for access method
"btree"
HINT:  You must specify an operator class for the index or define a default
operator class for the data type.
datingsite=> create index idx_query_int on people_attributes (bla
gist__int_ops);
ERROR:  operator class "gist__int_ops" does not exist for access method
"btree"
datingsite=> alter table people_attributes drop column bla;
ALTER TABLE

    c) query_int can only be used in one operation against int[]:

    README.intarray:
    int[] @@ query_int  - returns TRUE if array satisfies query (like '1&(2|3)')

     It is not possible to use >=, <=, =, etc. Also, this operator does not work
like example says:

datingsite=> select '{2,3}'::int[] @@ '1'::query_int;
 ?column?
----------
 f
(1 row)

    d) I can't find a way to simply check if an integer is an array without
declaring it as an array; Therefore, I need to use an int[] type for a column
that will only be storing one int4 if I want to compare it to an int[] array:

    README.intarray:
    int[] && int[]  - overlap - returns TRUE if arrays has at least one common
elements.

    e) int[] and query_int are somewhat ugly to deal with since query_int needs
to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm
just being anal :)


Because of these limitations, I've chosen to declare the attribute columns as
int[] arrays (even though they will only contain one value) so that I can use
'{1,2,3}'::int[] && column_name:

    README.intarray:
    int[] && int[]  - overlap - returns TRUE if arrays has at least one common
elements.

Here is the schema:

create table people (
    person_id serial,
    datecreated timestamp with time zone default now (),
    signup_ip cidr not null,
    username character varying(30) not null,
    password character varying(28) not null,
    email character varying(65) not null,
    dob date not null,
    primary key (person_id)
);

create table people_attributes (
    person_id int references people (person_id) on delete cascade initially
deferred,
    askmecount int not null default 0,
    age int[] not null default '{1}'::int[],
    gender int[] not null default '{1}'::int[],
    orientation int[] not null default '{1}'::int[],
    bodytype int[] not null default '{1}'::int[],
    children int[] not null default '{1}'::int[],
    drinking int[] not null default '{1}'::int[],
    education int[] not null default '{1}'::int[],
    ethnicity int[] not null default '{1}'::int[],
    eyecolor int[] not null default '{1}'::int[],
    haircolor int[] not null default '{1}'::int[],
    hairstyle int[] not null default '{1}'::int[],
    height int[] not null default '{1}'::int[],
    income int[] not null default '{1}'::int[],
    occupation int[] not null default '{1}'::int[],
    relation int[] not null default '{1}'::int[], /* multiple answer */
    religion int[] not null default '{1}'::int[],
    seeking int[] not null default '{1}'::int[], /* multiple answer */
    smoking int[] not null default '{1}'::int[],
    want_children int[] not null default '{1}'::int[],
    weight int[] not null default '{1}'::int[],

    primary key (person_id)
)
without oids;

create index people_attributes_search on people_attributes using gist (
    age gist__int_ops,
    gender gist__int_ops,
    orientation gist__int_ops,
    bodytype gist__int_ops,
    children gist__int_ops,
    drinking gist__int_ops,
    education gist__int_ops,
    ethnicity gist__int_ops,
    eyecolor gist__int_ops,
    haircolor gist__int_ops,
    hairstyle gist__int_ops,
    height gist__int_ops,
    income gist__int_ops,
    occupation gist__int_ops,
    relation gist__int_ops,
    religion gist__int_ops,
    seeking gist__int_ops,
    smoking gist__int_ops,
    want_children gist__int_ops,
    weight gist__int_ops
    );

/* These will be compared against the people_attributes table */
create table people_searchprefs (
    person_id int references people (person_id) on delete cascade initially
deferred,
    age int[] not null default
'{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[],
    gender int[] not null default '{1,2,4}'::int[],
    orientation int[] not null default '{1,2,8}'::int[],
    bodytype int[] not null default '{1,2,3,4,5,6}'::int[],
    children int[] not null default '{0}'::int[],
    drinking int[] not null default '{0}'::int[],
    education int[] not null default '{0}'::int[],
    ethnicity int[] not null default '{0}'::int[],
    eyecolor int[] not null default '{0}'::int[],
    haircolor int[] not null default '{0}'::int[],
    hairstyle int[] not null default '{0}'::int[],
    height int[] not null default '{0}'::int[],
    income int[] not null default '{0}'::int[],
    occupation int[] not null default '{0}'::int[],
    relation int[] not null default '{0}'::int[],
    religion int[] not null default '{0}'::int[],
    seeking int[] not null default '{0}'::int[],
    smoking int[] not null default '{0}'::int[],
    want_children int[] not null default '{0}'::int[],
    weight int[] not null default '{0}'::int[],

    primary key (person_id)
)
without oids;


And now, the moment you've all been waiting for: performance!


(Number of profiles)

datingsite=> select count(*) from people_attributes ;
 count
-------
 96146
(1 row)

(age, gender and sexual orientation will always be a part of the query, and
are necessary to be invoke the index. The query is to show females, age 30-40
of any orientation. But first, without the index)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
                                                                   QUERY PLAN
                         

-------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on people_attributes  (cost=0.00..9078.56 rows=1 width=36) (actual
time=0.044..299.537 rows=937 loops=1)
   Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 304.707 ms
(3 rows)


( with the index )

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
                                                                     QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937
loops=1)
   Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 57.032 ms
(3 rows)

(more realistically, it will have a limit of 10)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation limit 10;

                                                                        QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10
loops=1)
   ->  Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1)
         Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 0.817 ms
(4 rows)

(slower with an sort key)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age;
                                                                        QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937
loops=1)
   Sort Key: age
   ->  Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937
loops=1)
         Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 71.206 ms
(5 rows)

(no better with a limit)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age limit 10;
                                                                           QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10
loops=1)
   ->  Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418
rows=10 loops=1)
         Sort Key: age
         ->  Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937
loops=1)
               Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[]
&& age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] &&
orientation))
 Total runtime: 69.899 ms
(6 rows)

The last query is the most likely since I will need to be sorting by some key.
If I wasn't sorting it looks like it wouldn't be too bad, but sorting is
inevitable I think.  I've only imported  96,146 of the 150,000 profiles. This
seems a bit slow now, and it doesn't look like it will scale.

My questions are:

- Is there a way of speeding up the sort?
- Will using queries like " WHERE orientation IN (1,2,4) " be any
better/worse?
- The queries with the GiST index are faster, but is it of any benefit when
the int[] arrays all contain a single value?
- Is there any hope for this structure?

Thanks for the suggestion Greg, and thanks to those who responded to this
thread.



On Thursday 16 September 2004 02:44, Greg Stark wrote:
> The only kind of index that is capable of indexing this type of data
> structure for arbitrary searches would be a GiST index. I'm not aware of
> any implementation for bitfields, though it would be an appropriate use.
>
> What there is now is the contrib/intarray package. You would have to store
> more than just the bitfields, you would have to store an array of integer
> flags. That might be denser actually if you end up with many flags few of
> which are set.
>
> GiST indexes allow you to search arbitrary combinations of set and unset
> flags. using the "@@" operator
>
>   int[] @@ query_int  - returns TRUE if array satisfies query (like
> '1&(2|3)')
>
> You might be able to look at the code there and adapt it to apply to bit
> fields. If so I think it would be a useful tool. But GiST indexing is
> pretty esoteric stuff.



    primary key (person_id)
)
without oids;


pgsql-performance by date:

Previous
From: "Steinar H. Gunderson"
Date:
Subject: Re: Planner having way wrong estimate for group aggregate
Next
From: Greg Stark
Date:
Subject: Re: Comparing user attributes with bitwise operators