Re: Comparing user attributes with bitwise operators - Mailing list pgsql-performance
From | Patrick Clery |
---|---|
Subject | Re: Comparing user attributes with bitwise operators |
Date | |
Msg-id | 200410050039.24230.patrick@phpforhire.com Whole thread Raw |
In response to | Re: Comparing user attributes with bitwise operators (Greg Stark <gsstark@mit.edu>) |
Responses |
Re: Comparing user attributes with bitwise operators
|
List | pgsql-performance |
Sorry I have taken this long to reply, Greg, but here are the results of the personals site done with contrib/intarray: The first thing I did was add a serial column to the attributes table. So instead of having a unique constraint on (attribute_id,value_id), every row has a unique value: datingsite=> \d attribute_names Table "public.attribute_names" Column | Type | Modifiers ----------------+-----------------------+--------------------------------------------------------------------------- attribute_id | integer | not null default nextval('public.attribute_names_attribute_id_seq'::text) attribute_name | character varying(50) | not null Indexes: "attribute_names_pkey" PRIMARY KEY, btree (attribute_id) "attribute_names_attribute_id_key" UNIQUE, btree (attribute_id, attribute_name an example insert: insert into attribute_names (attribute_name) values ('languages'); datingsite=> \d attribute_values Table "public.attribute_values" Column | Type | Modifiers --------------+------------------------+------------------------------------------------------------------------ attribute_id | integer | not null order_id | integer | not null default (nextval('order_id_seq'::text) - 1) label | character varying(255) | not null value_id | integer | not null default nextval('public.attribute_values_value_id_seq'::text) Indexes: "attribute_values_pkey" PRIMARY KEY, btree (value_id) Foreign-key constraints: "attribute_values_attribute_id_fkey" FOREIGN KEY (attribute_id) REFERENCES attribute_names(attribute_id) an example insert (22 is the attribute_id of "languages"): insert into attribute_values (attribute_id, label) values (22, 'English'); The "value_id" column is where the integers inside the int[] arrays will reference. Even age (between 18-99) and height (between 48-84) have rows for every possible choice, as well as "Ask me!" where a user could choose to leave that blank. Here is "the int[] table": create table people_attributes ( person_id int references people (person_id) on delete cascade initially deferred, askmecount int not null default 0, age int not null references attribute_values(value_id) on delete restrict, gender int not null references attribute_values(value_id) on delete restrict, bodytype int not null references attribute_values(value_id) on delete restrict, children int not null references attribute_values(value_id) on delete restrict, drinking int not null references attribute_values(value_id) on delete restrict, education int not null references attribute_values(value_id) on delete restrict, ethnicity int not null references attribute_values(value_id) on delete restrict, eyecolor int not null references attribute_values(value_id) on delete restrict, haircolor int not null references attribute_values(value_id) on delete restrict, hairstyle int not null references attribute_values(value_id) on delete restrict, height int not null references attribute_values(value_id) on delete restrict, income int not null references attribute_values(value_id) on delete restrict, languages int[] not null, occupation int not null references attribute_values(value_id) on delete restrict, orientation int not null references attribute_values(value_id) on delete restrict, relation int not null references attribute_values(value_id) on delete restrict, religion int not null references attribute_values(value_id) on delete restrict, smoking int not null references attribute_values(value_id) on delete restrict, want_children int not null references attribute_values(value_id) on delete restrict, weight int not null references attribute_values(value_id) on delete restrict, seeking int[] not null, primary key (person_id) ) without oids; If you'll notice that "seeking" and "languages" are both int[] types. I did this because those will be multiple choice. The index was created like so: create index people_attributes_search on people_attributes using gist ( (array[ age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight ] + seeking + languages) gist__int_ops ); seeking and languages are appended with the intarray + op. I'm not going to go too in depth on how this query was generated since that was mostly done with the PHP side of things, but from the structure it should be obvious. I did, however, have to generate a few SQL functions using Smarty templates since it would be way to tedious to map all these values by hand. There are 96,000 rows (people) in the people_attributes table. Here is what is going on in the following query: "Show me all single (48) females (88) who are heterosexual (92) age between 18 and 31 (95|96|97|98|99|100|101|102|103| 104|105|106|107|108)" EXPLAIN ANALYZE SELECT * FROM people_attributes pa WHERE person_id <> 1 AND (ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking + languages) @@ '48 & 89 & 92 & ( 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 )' Index Scan using people_attributes_search on people_attributes pa (cost=0.00..386.45 rows=96 width=140) (actual time=0.057..19.266 rows=516 loops=1) Index Cond: (((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking) + languages) @@ '48 & 89 & 92 & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 )'::query_int) Filter: (person_id <> 1) Total runtime: 21.646 ms The speed only seems to vary significant on very broad searches, e.g: "All females." But once the threshold of about 2 or more attributes is met, the times are very acceptable. If we get a little more specific by adding "non-smokers and non-drinkers between 18 and 22", slight improvements: EXPLAIN ANALYZE SELECT * FROM people_attributes pa WHERE person_id <> 1 AND (ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking + languages) @@ '48 & 89 & 92 & ( 95 | 96 | 97 | 98) & 67 & 2' Index Scan using people_attributes_search on people_attributes pa (cost=0.00..386.45 rows=96 width=140) (actual time=0.077..13.090 rows=32 loops=1) Index Cond: (((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking) + languages) @@ '48 & 89 & 92 & ( ( ( 95 | 96 ) | 97 ) | 98 ) & 67 & 2'::query_int) Filter: (person_id <> 1) Total runtime: 13.393 ms All in all, my final thoughts on this are that it is "hella faster" than the previous methods. Vertical tables for your user attributes will not work for a personals site -- there are just too many conditions to be able to efficiently use an index. Out of all the methods I have tried, verticle table was not even remotely scalable on large amounts of data. Horizontal table is the way to go, but it wouldn't perform like this if not for the intarray module. The array method works quite nicely, especially for the columns like "languages" and "seeking" that are multiple choice. However, even though this method is fast, I still might opt for caching the results because the "real world" search query involves a lot more and will be executed non-stop. But to have it run this fast the first time certainly helps. The only drawback I can think of is that the attributes no longer have values like 1,2,3 -- instead they could be any integer value. This puts a spin on the programming side of things, which required me to write "code that writes code" on a few occassions during the attribute "mapping" process. For example, keeping an associative array of all the attributes without fetching that data from the database each time. My advice: if you're not a masochist, use a template engine (or simply parse out a print_r() ) to create these PHP arrays or SQL functions. Greg, thanks a lot for the advice. I owe you a beer ;) On Saturday 18 September 2004 23:07, you wrote: > Patrick Clery <patrick@phpforhire.com> writes: > > PLAN > > ------------------------------------------------------------------------- > >-------------------------------------------------------------------------- > >-------------- Limit (cost=6.03..6.03 rows=1 width=68) (actual > > time=69.391..69.504 rows=10 loops=1) -> Sort (cost=6.03..6.03 rows=1 > > width=68) (actual time=69.381..69.418 rows=10 loops=1) Sort Key: age > > -> Index Scan using people_attributes_search on > > people_attributes (cost=0.00..6.02 rows=1 width=68) (actual > > time=0.068..61.648 rows=937 loops=1) Index Cond: > > (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND > > ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) > > Total runtime: 69.899 ms > > (6 rows) > > ... > > > - Is there a way of speeding up the sort? > > The sort seems to have only taken 8ms out of 69ms or just over 10%. As long > as the index scan doesn't match too many records the sort should never be > any slower so it shouldn't be the performance bottleneck. You might > consider putting a subquery inside the order by with a limit to ensure that > the sort never gets more than some safe maximum. Something like: > > select * from (select * from people_attributes where ... limit 1000) order > by age limit 10 > > This means if the query matches more than 1000 it won't be sorted properly > by age; you'll get the top 10 out of some random subset. But you're > protected against ever having to sort more than 1000 records. > > > - Will using queries like " WHERE orientation IN (1,2,4) " be any > > better/worse? > > Well they won't use the GiST index, so no. If there was a single column > with a btree index then this would be the cleanest way to go. > > > - The queries with the GiST index are faster, but is it of any benefit > > when the int[] arrays all contain a single value? > > Well you've gone from 5 minutes to 60ms. You'll have to do more than one > test to be sure but it sure seems like it's of some benefit. > > If they're always a single value you could make it an expression index > instead and not have to change your data model. > > Just have the fields be integers individually and make an index as: > > create index idx on people_attributes using gist ( > (array[age]) gist__int_ops, > (array[gender]) gist__int_ops, > ... > ) > > > However I would go one step further. I would make the index simply: > > create index idx on people_attributes using gist ( > (array[age,gender,orientation,...]) gist__int_ops > ) > > And ensure that all of these attributes have distinct domains. Ie, that > they don't share any values. There are 4 billion integer values available > so that shouldn't be an issue. > > Then you could use query_int to compare them the way you want. You > misunderstood how query_int is supposed to work. You index an array column > and then you can check it against a query_int just as you're currently > checking for overlap. Think of @@ as a more powerful version of the overlap > operator that can do complex logical expressions. > > The equivalent of > > where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age > and '{2}'::int[] && gender > and '{1,2,4}'::int[] && orientation > > would then become: > > WHERE array[age,gender,orientation] @@ > '(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)' > > except you would have to change orientation and gender to not both have a > value of 2. > > You might consider doing the expression index a bit of overkill actually. > You might consider just storing a column "attributes" with an integer array > directly in the table. > > You would also want a table that lists the valid attributes to be sure not > to have any overlaps: > > 1 age 1 > 2 age 2 > ... > 101 gender male > 102 gender female > 103 orientation straight > 104 orientation gay > 105 orientation bi > 106 bodytype scrawny > ... > > > - Is there any hope for this structure? > > You'll have to test this carefully. I tried using GiST indexes for my > project and found that I couldn't load the data and build the GiST indexes > fast enough. You have to test the costs of building and maintaining this > index, especially since it has so many columns in it. > > But it looks like your queries are in trouble without it so hopefully it'll > be ok on the insert/update side for you.
pgsql-performance by date: