Thread: Comparing user attributes with bitwise operators
I'm working on a dating/personals/match-making site, that has used many different methods of "match-making", that all seem to be very slow. One I am attempting now that seems to be an efficient method of storage, but not the best for indexing, is using bitwise operators to compare one person's profile to another's. This method seems to be very fast on a small scale, but I am dealing with a large user-base here, in excess of 200,000 users that will be executing this search function every time they login (the search results of their profile will appear on the main page after they have logged in). I've opted to use "label tables" for each possible set of answers. (i.e: Marital Status) For this table, the set of bits -- bit(5) -- are represented as such: +-----+------------+ | Bit | Meaning | +-----+------------+ | 1 | single | | 2 | separated | | 3 | married | | 4 | divorced | | 5 | widowed | +-----+------------+ Here's the structure of the marital status table: # \d marital_matrix Table "public.marital_matrix" Column | Type | Modifiers -----------+----------------+----------------------------------------------------------------------- member_id | integer | not null default nextval('public.marital_matrix_member_id_seq'::text) status | bit varying(5) | not null default (0)::bit(5) p_status | bit varying(5) | not null default (0)::bit(5) Indexes: "marital_matrix_pkey" PRIMARY KEY, btree (member_id) "idx_marital_matrix" btree ((status::"bit" & p_status::"bit")) "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit")) "idx_marital_p_status" btree (p_status) "idx_marital_status" btree (status) Foreign-key constraints: "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a smaller sample than what I will actually be using): datingsite=> select count(*),status,p_status from marital_matrix group by status,p_status; count | status | p_status -------+--------+---------- 89 | 00001 | 00000 1319 | 00010 | 00000 2465 | 00100 | 00000 1 | 00100 | 11111 46117 | 10000 | 00000 here is the user I'll be comparing against, which has selected that he be matched with any but married people: datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21; member_id | status | p_status -----------+--------+---------- 21 | 10000 | 11011 (1 row) Here are a few possible comparison methods I can think of (NOTE: tests were run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10: METHOD 1: Any bit that is set in p_status (prefered marital status) of the searching user should be set in the potential match's marital status. This is the method I'd like to improve, if possible. Running the query twice didn't produce a different runtime. EXPLAIN ANALYZE SELECT m2.member_id FROM marital_matrix m1, marital_matrix m2 WHERE m1.member_id = 21 AND m2.status & m1.p_status != B'00000'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..2357.79 rows=49742 width=4) (actual time=18.062..708.469 rows=47525 loops=1) Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <> B'00000'::"bit") -> Index Scan using marital_matrix_pkey on marital_matrix m1 (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1) Index Cond: (member_id = 21) -> Seq Scan on marital_matrix m2 (cost=0.00..1602.91 rows=49991 width=13) (actual time=17.966..255.529 rows=49991 loops=1) Total runtime: 905.694 ms (6 rows) METHOD 2: Specifying the value (I don't think this would make a difference, but I'll post anyways): EXPLAIN ANALYZE SELECT member_id FROM marital_matrix WHERE status & B'11011' != B'00000'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Seq Scan on marital_matrix (cost=0.00..1852.87 rows=49742 width=4) (actual time=18.113..281.101 rows=47525 loops=1) Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit") Total runtime: 480.836 ms (3 rows) METHOD 3: Checking for one bit only. This is definitely not a "real world" example and unacceptable since the p_status column can and will have multiple bits. For categories other than "Marital Status", such as "Prefered Hair Color", the users are likely to select multiple bits (they choose all that apply). This query does use the index, but is still not very fast at all: EXPLAIN ANALYZE SELECT member_id FROM marital_matrix m1 WHERE status & B'10000' = B'10000'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_marital_matrix_single on marital_matrix m1 (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117 loops=1) Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit") Total runtime: 451.162 ms (3 rows) METHOD 4: Using an IN statement. This method seems to be very fussy about using the index, and I have at some point made it use the index when there are less than 3 possibilites. Also, for fields other than Marital Status, users will be able to select many bits for their own profile, which means there would be many permutations: EXPLAIN ANALYZE SELECT member_id FROM marital_matrix WHERE status & B'11011' IN (B'10000',B'01000'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on marital_matrix (cost=0.00..2602.73 rows=993 width=4) (actual time=17.845..288.279 rows=47525 loops=1) Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit" & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit") = B'00001'::"bit")) Total runtime: 488.651 ms (3 rows) Method 3 is the only one that used the index, but the only really acceptable method here is Method 1. My questions are... - Is there any hope in getting this to use an efficient index? - Any mathmaticians know if there is a way to reorder my bitwise comparison to have the operator use = and not an != (perhaps to force an index)? (AFAIK, the answer to the second question is no) If anyone could offer any performance tips here I'd really appreciate it. I imagine that having this schema wouldn't last an hour with the amount of CPU cycles it would be consuming on math operations. Also, I have read the thread that was posted here by Daniel in August: http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php I have spoke with Daniel on this issue and we both agree it's very difficult to find a solution that can scale to very large sites. I would very much appreciate any advice that some experienced users may have to offer me for such a situation. TIA Patrick
Sounds like you want a many-to-many table that maps user_ids to match_ids Then you can put an index over (user_id, match_id) and the search will be very fast. Chris Patrick Clery wrote: > I'm working on a dating/personals/match-making site, that has used many > different methods of "match-making", that all seem to be very slow. One I am > attempting now that seems to be an efficient method of storage, but not the > best for indexing, is using bitwise operators to compare one person's profile > to another's. > > This method seems to be very fast on a small scale, but I am dealing with a > large user-base here, in excess of 200,000 users that will be executing this > search function every time they login (the search results of their profile > will appear on the main page after they have logged in). I've opted to use > "label tables" for each possible set of answers. (i.e: Marital Status) > > For this table, the set of bits -- bit(5) -- are represented as such: > > +-----+------------+ > | Bit | Meaning | > +-----+------------+ > | 1 | single | > | 2 | separated | > | 3 | married | > | 4 | divorced | > | 5 | widowed | > +-----+------------+ > > Here's the structure of the marital status table: > > # \d marital_matrix > Table "public.marital_matrix" > Column | Type | Modifiers > -----------+----------------+----------------------------------------------------------------------- > member_id | integer | not null default > nextval('public.marital_matrix_member_id_seq'::text) > status | bit varying(5) | not null default (0)::bit(5) > p_status | bit varying(5) | not null default (0)::bit(5) > Indexes: > "marital_matrix_pkey" PRIMARY KEY, btree (member_id) > "idx_marital_matrix" btree ((status::"bit" & p_status::"bit")) > "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit")) > "idx_marital_p_status" btree (p_status) > "idx_marital_status" btree (status) > Foreign-key constraints: > "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE > CASCADE DEFERRABLE INITIALLY DEFERRED > > To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a > smaller sample than what I will actually be using): > > datingsite=> select count(*),status,p_status from marital_matrix group by > status,p_status; > count | status | p_status > -------+--------+---------- > 89 | 00001 | 00000 > 1319 | 00010 | 00000 > 2465 | 00100 | 00000 > 1 | 00100 | 11111 > 46117 | 10000 | 00000 > > here is the user I'll be comparing against, which has selected that he be > matched with any but married people: > > datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21; > member_id | status | p_status > -----------+--------+---------- > 21 | 10000 | 11011 > (1 row) > > > > > Here are a few possible comparison methods I can think of (NOTE: tests were > run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10: > > > METHOD 1: Any bit that is set in p_status (prefered marital status) of the > searching user should be set in the potential match's marital status. This is > the method I'd like to improve, if possible. Running the query twice didn't > produce a different runtime. > > EXPLAIN ANALYZE > SELECT > m2.member_id > FROM > marital_matrix m1, marital_matrix m2 > WHERE > m1.member_id = 21 AND > m2.status & m1.p_status != B'00000'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Nested Loop (cost=0.00..2357.79 rows=49742 width=4) (actual > time=18.062..708.469 rows=47525 loops=1) > Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <> > B'00000'::"bit") > -> Index Scan using marital_matrix_pkey on marital_matrix m1 > (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1) > Index Cond: (member_id = 21) > -> Seq Scan on marital_matrix m2 (cost=0.00..1602.91 rows=49991 width=13) > (actual time=17.966..255.529 rows=49991 loops=1) > Total runtime: 905.694 ms > (6 rows) > > > METHOD 2: Specifying the value (I don't think this would make a difference, > but I'll post anyways): > > EXPLAIN ANALYZE > SELECT > member_id > FROM > marital_matrix > WHERE > status & B'11011' != B'00000'; > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Seq Scan on marital_matrix (cost=0.00..1852.87 rows=49742 width=4) (actual > time=18.113..281.101 rows=47525 loops=1) > Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit") > Total runtime: 480.836 ms > (3 rows) > > > METHOD 3: Checking for one bit only. This is definitely not a "real world" > example and unacceptable since the p_status column can and will have multiple > bits. For categories other than "Marital Status", such as "Prefered Hair > Color", the users are likely to select multiple bits (they choose all that > apply). This query does use the index, but is still not very fast at all: > > EXPLAIN ANALYZE > SELECT > member_id > FROM > marital_matrix m1 > WHERE > status & B'10000' = B'10000'; > QUERY > PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_marital_matrix_single on marital_matrix m1 > (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117 > loops=1) > Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit") > Total runtime: 451.162 ms > (3 rows) > > METHOD 4: Using an IN statement. This method seems to be very fussy about > using the index, and I have at some point made it use the index when there > are less than 3 possibilites. Also, for fields other than Marital Status, > users will be able to select many bits for their own profile, which means > there would be many permutations: > > EXPLAIN ANALYZE > SELECT > member_id > FROM > marital_matrix > WHERE > status & B'11011' IN (B'10000',B'01000'); > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Seq Scan on marital_matrix (cost=0.00..2602.73 rows=993 width=4) (actual > time=17.845..288.279 rows=47525 loops=1) > Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR > (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit" > & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit") > = B'00001'::"bit")) > Total runtime: 488.651 ms > (3 rows) > > > Method 3 is the only one that used the index, but the only really acceptable > method here is Method 1. > > My questions are... > - Is there any hope in getting this to use an efficient index? > - Any mathmaticians know if there is a way to reorder my bitwise comparison to > have the operator use = and not an != (perhaps to force an index)? (AFAIK, > the answer to the second question is no) > > If anyone could offer any performance tips here I'd really appreciate it. I > imagine that having this schema wouldn't last an hour with the amount of CPU > cycles it would be consuming on math operations. > > Also, I have read the thread that was posted here by Daniel in August: > http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php > > I have spoke with Daniel on this issue and we both agree it's very difficult > to find a solution that can scale to very large sites. > > I would very much appreciate any advice that some experienced users may have > to offer me for such a situation. TIA > > Patrick > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Patrick Clery <patrick@phpforhire.com> writes: > Method 3 is the only one that used the index, but the only really acceptable > method here is Method 1. > > My questions are... > - Is there any hope in getting this to use an efficient index? > - Any mathmaticians know if there is a way to reorder my bitwise comparison to > have the operator use = and not an != (perhaps to force an index)? (AFAIK, > the answer to the second question is no) The only kind of index that is capable of indexing this type of data structure for arbitrary searches would be a GiST index. I'm not aware of any implementation for bitfields, though it would be an appropriate use. What there is now is the contrib/intarray package. You would have to store more than just the bitfields, you would have to store an array of integer flags. That might be denser actually if you end up with many flags few of which are set. GiST indexes allow you to search arbitrary combinations of set and unset flags. using the "@@" operator int[] @@ query_int - returns TRUE if array satisfies query (like '1&(2|3)') You might be able to look at the code there and adapt it to apply to bit fields. If so I think it would be a useful tool. But GiST indexing is pretty esoteric stuff. -- greg
Patrick Clery <patrick@phpforhire.com> writes: > Here's the structure of the marital status table: Also I find it very odd that you have a "marital status table". marital status is just one attribute of member. Do you expect to have more than one marital status bitfield per member? How would you distinguish which one to use? It's going to make it very hard to combine criteria against other attributes even if you do manage to get a GiST index to work against marital status and you do the same with the other, then postgres will have to do some sort of merge join between them. It also means you'll have to write the same code over and over for each of these tables. I think you're much more likely to want to merge all these attributes into a single "member_attributes" table, or even into the member table itself. Then your goal would be to match all the member_attribute bits against all the member_preferred bits in the right way. The more conventional approach is to break them out into a fact separate table: member_id, attribute_id And then just have a list of pairs that apply. This kind of normalized data is much more flexible for writing all kinds of queries against. But like you've found, it's hard to optimize this to be fast enough for transactional use. I think the normal approach with dating sites is to leave this for a batch job that populates a match table for everyone and just have the web site display the contents out of that table. -- greg
Christopher Kings-Lynne wrote: > Sounds like you want a many-to-many table that maps user_ids to match_ids > > Then you can put an index over (user_id, match_id) and the search will > be very fast. > > Chris > If I understand you correctly, I believe I've tried this approach. While matching on a single attribute and a single value was indeed very fast and used an index, as soon as I tried to match on more than one value (where valueid in (1, 2, 3)) the index was no longer used. Since my approach used ints, I used in(), which is effectively "or", which is presumably why the index is no longer used. With the bit, one would do a bitwise "or" (where value & search = value). This cannot be easily indexed, afaik. The other problem I had with a 1:many table, where there was a row for every person's attributes (~20M rows) was that somehow time was lost in either sorting or somewhere else. Individual queries against a single attribute would be very fast, but as soon as I tried to join another attribute, the query times got really bad. See http://sh.nu/w/email.txt line 392 (Don't worry, there are line numbers in the page). So far I've stuck with my original plan, which is to maintain a 1:1 table of people:attributes where each attribute is in its own column. Still, no index is used, but it's been the best performer up to now. I'm still looking for a better plan though. Daniel -- Daniel Ceregatti - Programmer Omnis Network, LLC You are fighting for survival in your own sweet and gentle way.
I have currently implemented a schema for my "Dating Site" that is storing user search preferences and user attributes in an int[] array using the contrib/intarray package (suggested by Greg Stark). But there are a few problems. a) query_int can't be cast to int4. b) query_int can't be indexed. datingsite=> alter table people_attributes add column bla query_int; ALTER TABLE datingsite=> create index idx_query_int on people_attributes (bla); ERROR: data type query_int has no default operator class for access method "btree" HINT: You must specify an operator class for the index or define a default operator class for the data type. datingsite=> create index idx_query_int on people_attributes (bla gist__int_ops); ERROR: operator class "gist__int_ops" does not exist for access method "btree" datingsite=> alter table people_attributes drop column bla; ALTER TABLE c) query_int can only be used in one operation against int[]: README.intarray: int[] @@ query_int - returns TRUE if array satisfies query (like '1&(2|3)') It is not possible to use >=, <=, =, etc. Also, this operator does not work like example says: datingsite=> select '{2,3}'::int[] @@ '1'::query_int; ?column? ---------- f (1 row) d) I can't find a way to simply check if an integer is an array without declaring it as an array; Therefore, I need to use an int[] type for a column that will only be storing one int4 if I want to compare it to an int[] array: README.intarray: int[] && int[] - overlap - returns TRUE if arrays has at least one common elements. e) int[] and query_int are somewhat ugly to deal with since query_int needs to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm just being anal :) Because of these limitations, I've chosen to declare the attribute columns as int[] arrays (even though they will only contain one value) so that I can use '{1,2,3}'::int[] && column_name: README.intarray: int[] && int[] - overlap - returns TRUE if arrays has at least one common elements. Here is the schema: create table people ( person_id serial, datecreated timestamp with time zone default now (), signup_ip cidr not null, username character varying(30) not null, password character varying(28) not null, email character varying(65) not null, dob date not null, primary key (person_id) ); create table people_attributes ( person_id int references people (person_id) on delete cascade initially deferred, askmecount int not null default 0, age int[] not null default '{1}'::int[], gender int[] not null default '{1}'::int[], orientation int[] not null default '{1}'::int[], bodytype int[] not null default '{1}'::int[], children int[] not null default '{1}'::int[], drinking int[] not null default '{1}'::int[], education int[] not null default '{1}'::int[], ethnicity int[] not null default '{1}'::int[], eyecolor int[] not null default '{1}'::int[], haircolor int[] not null default '{1}'::int[], hairstyle int[] not null default '{1}'::int[], height int[] not null default '{1}'::int[], income int[] not null default '{1}'::int[], occupation int[] not null default '{1}'::int[], relation int[] not null default '{1}'::int[], /* multiple answer */ religion int[] not null default '{1}'::int[], seeking int[] not null default '{1}'::int[], /* multiple answer */ smoking int[] not null default '{1}'::int[], want_children int[] not null default '{1}'::int[], weight int[] not null default '{1}'::int[], primary key (person_id) ) without oids; create index people_attributes_search on people_attributes using gist ( age gist__int_ops, gender gist__int_ops, orientation gist__int_ops, bodytype gist__int_ops, children gist__int_ops, drinking gist__int_ops, education gist__int_ops, ethnicity gist__int_ops, eyecolor gist__int_ops, haircolor gist__int_ops, hairstyle gist__int_ops, height gist__int_ops, income gist__int_ops, occupation gist__int_ops, relation gist__int_ops, religion gist__int_ops, seeking gist__int_ops, smoking gist__int_ops, want_children gist__int_ops, weight gist__int_ops ); /* These will be compared against the people_attributes table */ create table people_searchprefs ( person_id int references people (person_id) on delete cascade initially deferred, age int[] not null default '{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[], gender int[] not null default '{1,2,4}'::int[], orientation int[] not null default '{1,2,8}'::int[], bodytype int[] not null default '{1,2,3,4,5,6}'::int[], children int[] not null default '{0}'::int[], drinking int[] not null default '{0}'::int[], education int[] not null default '{0}'::int[], ethnicity int[] not null default '{0}'::int[], eyecolor int[] not null default '{0}'::int[], haircolor int[] not null default '{0}'::int[], hairstyle int[] not null default '{0}'::int[], height int[] not null default '{0}'::int[], income int[] not null default '{0}'::int[], occupation int[] not null default '{0}'::int[], relation int[] not null default '{0}'::int[], religion int[] not null default '{0}'::int[], seeking int[] not null default '{0}'::int[], smoking int[] not null default '{0}'::int[], want_children int[] not null default '{0}'::int[], weight int[] not null default '{0}'::int[], primary key (person_id) ) without oids; And now, the moment you've all been waiting for: performance! (Number of profiles) datingsite=> select count(*) from people_attributes ; count ------- 96146 (1 row) (age, gender and sexual orientation will always be a part of the query, and are necessary to be invoke the index. The query is to show females, age 30-40 of any orientation. But first, without the index) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on people_attributes (cost=0.00..9078.56 rows=1 width=36) (actual time=0.044..299.537 rows=937 loops=1) Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 304.707 ms (3 rows) ( with the index ) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 57.032 ms (3 rows) (more realistically, it will have a limit of 10) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10 loops=1) -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 0.817 ms (4 rows) (slower with an sort key) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation order by age; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937 loops=1) Sort Key: age -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 71.206 ms (5 rows) (no better with a limit) explain analyze select person_id, gender from people_attributes where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation order by age limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10 loops=1) -> Sort (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418 rows=10 loops=1) Sort Key: age -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937 loops=1) Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) Total runtime: 69.899 ms (6 rows) The last query is the most likely since I will need to be sorting by some key. If I wasn't sorting it looks like it wouldn't be too bad, but sorting is inevitable I think. I've only imported 96,146 of the 150,000 profiles. This seems a bit slow now, and it doesn't look like it will scale. My questions are: - Is there a way of speeding up the sort? - Will using queries like " WHERE orientation IN (1,2,4) " be any better/worse? - The queries with the GiST index are faster, but is it of any benefit when the int[] arrays all contain a single value? - Is there any hope for this structure? Thanks for the suggestion Greg, and thanks to those who responded to this thread. On Thursday 16 September 2004 02:44, Greg Stark wrote: > The only kind of index that is capable of indexing this type of data > structure for arbitrary searches would be a GiST index. I'm not aware of > any implementation for bitfields, though it would be an appropriate use. > > What there is now is the contrib/intarray package. You would have to store > more than just the bitfields, you would have to store an array of integer > flags. That might be denser actually if you end up with many flags few of > which are set. > > GiST indexes allow you to search arbitrary combinations of set and unset > flags. using the "@@" operator > > int[] @@ query_int - returns TRUE if array satisfies query (like > '1&(2|3)') > > You might be able to look at the code there and adapt it to apply to bit > fields. If so I think it would be a useful tool. But GiST indexing is > pretty esoteric stuff. primary key (person_id) ) without oids;
Patrick Clery <patrick@phpforhire.com> writes: > PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10 loops=1) > -> Sort (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418 rows=10 loops=1) > Sort Key: age > -> Index Scan using people_attributes_search on people_attributes (cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648rows=937 loops=1) > Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND ('{2}'::integer[] && gender) AND('{1,2,4}'::integer[] && orientation)) > Total runtime: 69.899 ms > (6 rows) ... > - Is there a way of speeding up the sort? The sort seems to have only taken 8ms out of 69ms or just over 10%. As long as the index scan doesn't match too many records the sort should never be any slower so it shouldn't be the performance bottleneck. You might consider putting a subquery inside the order by with a limit to ensure that the sort never gets more than some safe maximum. Something like: select * from (select * from people_attributes where ... limit 1000) order by age limit 10 This means if the query matches more than 1000 it won't be sorted properly by age; you'll get the top 10 out of some random subset. But you're protected against ever having to sort more than 1000 records. > - Will using queries like " WHERE orientation IN (1,2,4) " be any better/worse? Well they won't use the GiST index, so no. If there was a single column with a btree index then this would be the cleanest way to go. > - The queries with the GiST index are faster, but is it of any benefit when > the int[] arrays all contain a single value? Well you've gone from 5 minutes to 60ms. You'll have to do more than one test to be sure but it sure seems like it's of some benefit. If they're always a single value you could make it an expression index instead and not have to change your data model. Just have the fields be integers individually and make an index as: create index idx on people_attributes using gist ( (array[age]) gist__int_ops, (array[gender]) gist__int_ops, ... ) However I would go one step further. I would make the index simply: create index idx on people_attributes using gist ( (array[age,gender,orientation,...]) gist__int_ops ) And ensure that all of these attributes have distinct domains. Ie, that they don't share any values. There are 4 billion integer values available so that shouldn't be an issue. Then you could use query_int to compare them the way you want. You misunderstood how query_int is supposed to work. You index an array column and then you can check it against a query_int just as you're currently checking for overlap. Think of @@ as a more powerful version of the overlap operator that can do complex logical expressions. The equivalent of where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age and '{2}'::int[] && gender and '{1,2,4}'::int[] && orientation would then become: WHERE array[age,gender,orientation] @@ '(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)' except you would have to change orientation and gender to not both have a value of 2. You might consider doing the expression index a bit of overkill actually. You might consider just storing a column "attributes" with an integer array directly in the table. You would also want a table that lists the valid attributes to be sure not to have any overlaps: 1 age 1 2 age 2 ... 101 gender male 102 gender female 103 orientation straight 104 orientation gay 105 orientation bi 106 bodytype scrawny ... > - Is there any hope for this structure? You'll have to test this carefully. I tried using GiST indexes for my project and found that I couldn't load the data and build the GiST indexes fast enough. You have to test the costs of building and maintaining this index, especially since it has so many columns in it. But it looks like your queries are in trouble without it so hopefully it'll be ok on the insert/update side for you. -- greg
Sorry I have taken this long to reply, Greg, but here are the results of the personals site done with contrib/intarray: The first thing I did was add a serial column to the attributes table. So instead of having a unique constraint on (attribute_id,value_id), every row has a unique value: datingsite=> \d attribute_names Table "public.attribute_names" Column | Type | Modifiers ----------------+-----------------------+--------------------------------------------------------------------------- attribute_id | integer | not null default nextval('public.attribute_names_attribute_id_seq'::text) attribute_name | character varying(50) | not null Indexes: "attribute_names_pkey" PRIMARY KEY, btree (attribute_id) "attribute_names_attribute_id_key" UNIQUE, btree (attribute_id, attribute_name an example insert: insert into attribute_names (attribute_name) values ('languages'); datingsite=> \d attribute_values Table "public.attribute_values" Column | Type | Modifiers --------------+------------------------+------------------------------------------------------------------------ attribute_id | integer | not null order_id | integer | not null default (nextval('order_id_seq'::text) - 1) label | character varying(255) | not null value_id | integer | not null default nextval('public.attribute_values_value_id_seq'::text) Indexes: "attribute_values_pkey" PRIMARY KEY, btree (value_id) Foreign-key constraints: "attribute_values_attribute_id_fkey" FOREIGN KEY (attribute_id) REFERENCES attribute_names(attribute_id) an example insert (22 is the attribute_id of "languages"): insert into attribute_values (attribute_id, label) values (22, 'English'); The "value_id" column is where the integers inside the int[] arrays will reference. Even age (between 18-99) and height (between 48-84) have rows for every possible choice, as well as "Ask me!" where a user could choose to leave that blank. Here is "the int[] table": create table people_attributes ( person_id int references people (person_id) on delete cascade initially deferred, askmecount int not null default 0, age int not null references attribute_values(value_id) on delete restrict, gender int not null references attribute_values(value_id) on delete restrict, bodytype int not null references attribute_values(value_id) on delete restrict, children int not null references attribute_values(value_id) on delete restrict, drinking int not null references attribute_values(value_id) on delete restrict, education int not null references attribute_values(value_id) on delete restrict, ethnicity int not null references attribute_values(value_id) on delete restrict, eyecolor int not null references attribute_values(value_id) on delete restrict, haircolor int not null references attribute_values(value_id) on delete restrict, hairstyle int not null references attribute_values(value_id) on delete restrict, height int not null references attribute_values(value_id) on delete restrict, income int not null references attribute_values(value_id) on delete restrict, languages int[] not null, occupation int not null references attribute_values(value_id) on delete restrict, orientation int not null references attribute_values(value_id) on delete restrict, relation int not null references attribute_values(value_id) on delete restrict, religion int not null references attribute_values(value_id) on delete restrict, smoking int not null references attribute_values(value_id) on delete restrict, want_children int not null references attribute_values(value_id) on delete restrict, weight int not null references attribute_values(value_id) on delete restrict, seeking int[] not null, primary key (person_id) ) without oids; If you'll notice that "seeking" and "languages" are both int[] types. I did this because those will be multiple choice. The index was created like so: create index people_attributes_search on people_attributes using gist ( (array[ age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight ] + seeking + languages) gist__int_ops ); seeking and languages are appended with the intarray + op. I'm not going to go too in depth on how this query was generated since that was mostly done with the PHP side of things, but from the structure it should be obvious. I did, however, have to generate a few SQL functions using Smarty templates since it would be way to tedious to map all these values by hand. There are 96,000 rows (people) in the people_attributes table. Here is what is going on in the following query: "Show me all single (48) females (88) who are heterosexual (92) age between 18 and 31 (95|96|97|98|99|100|101|102|103| 104|105|106|107|108)" EXPLAIN ANALYZE SELECT * FROM people_attributes pa WHERE person_id <> 1 AND (ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking + languages) @@ '48 & 89 & 92 & ( 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 )' Index Scan using people_attributes_search on people_attributes pa (cost=0.00..386.45 rows=96 width=140) (actual time=0.057..19.266 rows=516 loops=1) Index Cond: (((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking) + languages) @@ '48 & 89 & 92 & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 )'::query_int) Filter: (person_id <> 1) Total runtime: 21.646 ms The speed only seems to vary significant on very broad searches, e.g: "All females." But once the threshold of about 2 or more attributes is met, the times are very acceptable. If we get a little more specific by adding "non-smokers and non-drinkers between 18 and 22", slight improvements: EXPLAIN ANALYZE SELECT * FROM people_attributes pa WHERE person_id <> 1 AND (ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking + languages) @@ '48 & 89 & 92 & ( 95 | 96 | 97 | 98) & 67 & 2' Index Scan using people_attributes_search on people_attributes pa (cost=0.00..386.45 rows=96 width=140) (actual time=0.077..13.090 rows=32 loops=1) Index Cond: (((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking) + languages) @@ '48 & 89 & 92 & ( ( ( 95 | 96 ) | 97 ) | 98 ) & 67 & 2'::query_int) Filter: (person_id <> 1) Total runtime: 13.393 ms All in all, my final thoughts on this are that it is "hella faster" than the previous methods. Vertical tables for your user attributes will not work for a personals site -- there are just too many conditions to be able to efficiently use an index. Out of all the methods I have tried, verticle table was not even remotely scalable on large amounts of data. Horizontal table is the way to go, but it wouldn't perform like this if not for the intarray module. The array method works quite nicely, especially for the columns like "languages" and "seeking" that are multiple choice. However, even though this method is fast, I still might opt for caching the results because the "real world" search query involves a lot more and will be executed non-stop. But to have it run this fast the first time certainly helps. The only drawback I can think of is that the attributes no longer have values like 1,2,3 -- instead they could be any integer value. This puts a spin on the programming side of things, which required me to write "code that writes code" on a few occassions during the attribute "mapping" process. For example, keeping an associative array of all the attributes without fetching that data from the database each time. My advice: if you're not a masochist, use a template engine (or simply parse out a print_r() ) to create these PHP arrays or SQL functions. Greg, thanks a lot for the advice. I owe you a beer ;) On Saturday 18 September 2004 23:07, you wrote: > Patrick Clery <patrick@phpforhire.com> writes: > > PLAN > > ------------------------------------------------------------------------- > >-------------------------------------------------------------------------- > >-------------- Limit (cost=6.03..6.03 rows=1 width=68) (actual > > time=69.391..69.504 rows=10 loops=1) -> Sort (cost=6.03..6.03 rows=1 > > width=68) (actual time=69.381..69.418 rows=10 loops=1) Sort Key: age > > -> Index Scan using people_attributes_search on > > people_attributes (cost=0.00..6.02 rows=1 width=68) (actual > > time=0.068..61.648 rows=937 loops=1) Index Cond: > > (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND > > ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation)) > > Total runtime: 69.899 ms > > (6 rows) > > ... > > > - Is there a way of speeding up the sort? > > The sort seems to have only taken 8ms out of 69ms or just over 10%. As long > as the index scan doesn't match too many records the sort should never be > any slower so it shouldn't be the performance bottleneck. You might > consider putting a subquery inside the order by with a limit to ensure that > the sort never gets more than some safe maximum. Something like: > > select * from (select * from people_attributes where ... limit 1000) order > by age limit 10 > > This means if the query matches more than 1000 it won't be sorted properly > by age; you'll get the top 10 out of some random subset. But you're > protected against ever having to sort more than 1000 records. > > > - Will using queries like " WHERE orientation IN (1,2,4) " be any > > better/worse? > > Well they won't use the GiST index, so no. If there was a single column > with a btree index then this would be the cleanest way to go. > > > - The queries with the GiST index are faster, but is it of any benefit > > when the int[] arrays all contain a single value? > > Well you've gone from 5 minutes to 60ms. You'll have to do more than one > test to be sure but it sure seems like it's of some benefit. > > If they're always a single value you could make it an expression index > instead and not have to change your data model. > > Just have the fields be integers individually and make an index as: > > create index idx on people_attributes using gist ( > (array[age]) gist__int_ops, > (array[gender]) gist__int_ops, > ... > ) > > > However I would go one step further. I would make the index simply: > > create index idx on people_attributes using gist ( > (array[age,gender,orientation,...]) gist__int_ops > ) > > And ensure that all of these attributes have distinct domains. Ie, that > they don't share any values. There are 4 billion integer values available > so that shouldn't be an issue. > > Then you could use query_int to compare them the way you want. You > misunderstood how query_int is supposed to work. You index an array column > and then you can check it against a query_int just as you're currently > checking for overlap. Think of @@ as a more powerful version of the overlap > operator that can do complex logical expressions. > > The equivalent of > > where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age > and '{2}'::int[] && gender > and '{1,2,4}'::int[] && orientation > > would then become: > > WHERE array[age,gender,orientation] @@ > '(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)' > > except you would have to change orientation and gender to not both have a > value of 2. > > You might consider doing the expression index a bit of overkill actually. > You might consider just storing a column "attributes" with an integer array > directly in the table. > > You would also want a table that lists the valid attributes to be sure not > to have any overlaps: > > 1 age 1 > 2 age 2 > ... > 101 gender male > 102 gender female > 103 orientation straight > 104 orientation gay > 105 orientation bi > 106 bodytype scrawny > ... > > > - Is there any hope for this structure? > > You'll have to test this carefully. I tried using GiST indexes for my > project and found that I couldn't load the data and build the GiST indexes > fast enough. You have to test the costs of building and maintaining this > index, especially since it has so many columns in it. > > But it looks like your queries are in trouble without it so hopefully it'll > be ok on the insert/update side for you.
Patrick, First off, thanks for posting this solution! I love to see a new demo of The Power of Postgres(tm) and have been wondering about this particular problem since it came up on IRC. > The array method works quite nicely, especially for the > columns like "languages" and "seeking" that are multiple choice. However, > even though this method is fast, I still might opt for caching the results > because the "real world" search query involves a lot more and will be > executed non-stop. But to have it run this fast the first time certainly > helps. Now, for the bad news: you need to test having a large load of users updating their data. The drawback to GiST indexes is that they are low-concurrency, because the updating process needs to lock the whole index (this has been on our TODO list for about a decade, but it's a hard problem). -- Josh Berkus Aglio Database Solutions San Francisco
Another problem I should note is that when I first insert all the data into the people_attributes table ("the int[] table"), the GiST index is not used: THE INDEX: "people_attributes_search" gist ((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, w ant_children, weight] + seeking + languages)) PART OF THE QUERY PLAN: Seq Scan on people_attributes pa (cost=0.00..0.00 rows=1 width=20) Filter: (((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92 & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) & ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190 | 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) | 200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) | 209 ) | 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) | 219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) | 228 ) | 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) | 238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int) So I run "VACUUM ANALYZE people_attributes", then run again: PART OF THE QUERY PLAN: Index Scan using people_attributes_pkey on people_attributes pa (cost=0.00..5.32 rows=1 width=20) Index Cond: (pa.person_id = "outer".person_id) Filter: (((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, want_children, weight] + seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92 & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) & ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190 | 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) | 200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) | 209 ) | 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) | 219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) | 228 ) | 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) | 238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int) Still not using the index. I'm trying to DROP INDEX and recreate it, but the query just stalls. I remember last time this situation happened that I just dropped and recreated the index, and voila it was using the index again. Now I can't seem to get this index to drop. Here's the table structure: Column | Type | Modifiers ---------------+-----------+-------------------- person_id | integer | not null askmecount | integer | not null default 0 age | integer | not null gender | integer | not null bodytype | integer | not null children | integer | not null drinking | integer | not null education | integer | not null ethnicity | integer | not null eyecolor | integer | not null haircolor | integer | not null hairstyle | integer | not null height | integer | not null income | integer | not null languages | integer[] | not null occupation | integer | not null orientation | integer | not null relation | integer | not null religion | integer | not null smoking | integer | not null want_children | integer | not null weight | integer | not null seeking | integer[] | not null Indexes: "people_attributes_pkey" PRIMARY KEY, btree (person_id) "people_attributes_search" gist ((ARRAY[age, gender, orientation, children, drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, relation, religion, smoking, w ant_children, weight] + seeking + languages)) Foreign-key constraints: "people_attributes_weight_fkey" FOREIGN KEY (weight) REFERENCES attribute_values(value_id) ON DEL ETE RESTRICT "people_attributes_person_id_fkey" FOREIGN KEY (person_id) REFERENCES people(person_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED "people_attributes_age_fkey" FOREIGN KEY (age) REFERENCES attribute_values(value_id) ON DELETE RE STRICT "people_attributes_gender_fkey" FOREIGN KEY (gender) REFERENCES attribute_values(value_id) ON DEL ETE RESTRICT "people_attributes_bodytype_fkey" FOREIGN KEY (bodytype) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_children_fkey" FOREIGN KEY (children) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_drinking_fkey" FOREIGN KEY (drinking) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_education_fkey" FOREIGN KEY (education) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_ethnicity_fkey" FOREIGN KEY (ethnicity) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_eyecolor_fkey" FOREIGN KEY (eyecolor) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_haircolor_fkey" FOREIGN KEY (haircolor) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_hairstyle_fkey" FOREIGN KEY (hairstyle) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_height_fkey" FOREIGN KEY (height) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_income_fkey" FOREIGN KEY (income) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_occupation_fkey" FOREIGN KEY (occupation) REFERENCES attribute_values(value_id ) ON DELETE RESTRICT "people_attributes_orientation_fkey" FOREIGN KEY (orientation) REFERENCES attribute_values(value_ id) ON DELETE RESTRICT "people_attributes_relation_fkey" FOREIGN KEY (relation) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_religion_fkey" FOREIGN KEY (religion) REFERENCES attribute_values(value_id) ON DELETE RESTRICT "people_attributes_smoking_fkey" FOREIGN KEY (smoking) REFERENCES attribute_values(value_id) ON D ELETE RESTRICT "people_attributes_want_children_fkey" FOREIGN KEY (want_children) REFERENCES attribute_values(va lue_id) ON DELETE RESTRICT Is it all the foreign keys that are stalling the drop? I have done VACUUM ANALYZE on the entire db. Could anyone offer some insight as to why this index is not being used or why the index is not dropping easily? On Tuesday 05 October 2004 10:32, you wrote: > Patrick, > > First off, thanks for posting this solution! I love to see a new demo of > The Power of Postgres(tm) and have been wondering about this particular > problem since it came up on IRC. > > > The array method works quite nicely, especially for the > > columns like "languages" and "seeking" that are multiple choice. However, > > even though this method is fast, I still might opt for caching the > > results because the "real world" search query involves a lot more and > > will be executed non-stop. But to have it run this fast the first time > > certainly helps. > > Now, for the bad news: you need to test having a large load of users > updating their data. The drawback to GiST indexes is that they are > low-concurrency, because the updating process needs to lock the whole index > (this has been on our TODO list for about a decade, but it's a hard > problem).
Err... I REINDEX'ed it and it is now using the index. :) I'd still appreciate if anyone could tell me why this needs to be reindexed. Is the index not updated when the records are inserted? > On Wednesday 06 October 2004 12:55, I wrote: > > Another problem I should note is that when I first insert all the data > > into the people_attributes table ("the int[] table"), the GiST index is > > not used: > > > > THE INDEX: > > "people_attributes_search" gist ((ARRAY[age, gender, orientation, > > children, drinking, education, > > ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, > > relation, religion, smoking, w > > ant_children, weight] + seeking + languages)) > > > > PART OF THE QUERY PLAN: > > Seq Scan on people_attributes pa (cost=0.00..0.00 rows=1 width=20) > > Filter: (((ARRAY[age, gender, orientation, children, > > drinking, education, ethnicity, eyecolor, haircolor, hairstyle, height, > > income, occupation, relation, religion, smoking, want_children, weight] + > > seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92 > > & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 > > ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) & > > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( > > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190 > > > > | 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) | > > > > 200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) | > > 209 ) > > > > | 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) | > > > > 219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) | > > 228 ) > > > > | 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) | > > > > 238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int) > > > > > > So I run "VACUUM ANALYZE people_attributes", then run again: > > > > PART OF THE QUERY PLAN: > > Index Scan using people_attributes_pkey on people_attributes pa > > (cost=0.00..5.32 rows=1 width=20) > > Index Cond: (pa.person_id = "outer".person_id) > > Filter: (((ARRAY[age, gender, orientation, children, drinking, > > education, ethnicity, eyecolor, haircolor, hairstyle, height, income, > > occupation, relation, religion, smoking, want_children, weight] + > > seeking) + languages) @@ '( ( 4 | 5 ) | 6 ) & 88 & 48 & ( 69 | 70 ) & 92 > > & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) | 98 ) | 99 ) | 100 ) | 101 > > ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) | 108 ) & > > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( > > ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 190 > > > > | 191 ) | 192 ) | 193 ) | 194 ) | 195 ) | 196 ) | 197 ) | 198 ) | 199 ) | > > > > 200 ) | 201 ) | 202 ) | 203 ) | 204 ) | 205 ) | 206 ) | 207 ) | 208 ) | > > 209 ) > > > > | 210 ) | 211 ) | 212 ) | 213 ) | 214 ) | 215 ) | 216 ) | 217 ) | 218 ) | > > > > 219 ) | 220 ) | 221 ) | 222 ) | 223 ) | 224 ) | 225 ) | 226 ) | 227 ) | > > 228 ) > > > > | 229 ) | 230 ) | 231 ) | 232 ) | 233 ) | 234 ) | 235 ) | 236 ) | 237 ) | > > > > 238 ) | 239 ) | 240 ) | 241 ) | 242 ) | 243 )'::query_int) > > > > Still not using the index. I'm trying to DROP INDEX and recreate it, but > > the query just stalls. I remember last time this situation happened that > > I just dropped and recreated the index, and voila it was using the index > > again. Now I can't seem to get this index to drop. Here's the table > > structure: > > > > > > Column | Type | Modifiers > > ---------------+-----------+-------------------- > > person_id | integer | not null > > askmecount | integer | not null default 0 > > age | integer | not null > > gender | integer | not null > > bodytype | integer | not null > > children | integer | not null > > drinking | integer | not null > > education | integer | not null > > ethnicity | integer | not null > > eyecolor | integer | not null > > haircolor | integer | not null > > hairstyle | integer | not null > > height | integer | not null > > income | integer | not null > > languages | integer[] | not null > > occupation | integer | not null > > orientation | integer | not null > > relation | integer | not null > > religion | integer | not null > > smoking | integer | not null > > want_children | integer | not null > > weight | integer | not null > > seeking | integer[] | not null > > Indexes: > > "people_attributes_pkey" PRIMARY KEY, btree (person_id) > > "people_attributes_search" gist ((ARRAY[age, gender, orientation, > > children, drinking, education, > > ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation, > > relation, religion, smoking, w > > ant_children, weight] + seeking + languages)) > > Foreign-key constraints: > > "people_attributes_weight_fkey" FOREIGN KEY (weight) REFERENCES > > attribute_values(value_id) ON DEL > > ETE RESTRICT > > "people_attributes_person_id_fkey" FOREIGN KEY (person_id) REFERENCES > > people(person_id) ON DELETE > > CASCADE DEFERRABLE INITIALLY DEFERRED > > "people_attributes_age_fkey" FOREIGN KEY (age) REFERENCES > > attribute_values(value_id) ON DELETE RE > > STRICT > > "people_attributes_gender_fkey" FOREIGN KEY (gender) REFERENCES > > attribute_values(value_id) ON DEL > > ETE RESTRICT > > "people_attributes_bodytype_fkey" FOREIGN KEY (bodytype) REFERENCES > > attribute_values(value_id) ON > > DELETE RESTRICT > > "people_attributes_children_fkey" FOREIGN KEY (children) REFERENCES > > attribute_values(value_id) ON > > DELETE RESTRICT > > "people_attributes_drinking_fkey" FOREIGN KEY (drinking) REFERENCES > > attribute_values(value_id) ON > > DELETE RESTRICT > > "people_attributes_education_fkey" FOREIGN KEY (education) REFERENCES > > attribute_values(value_id) > > ON DELETE RESTRICT > > "people_attributes_ethnicity_fkey" FOREIGN KEY (ethnicity) REFERENCES > > attribute_values(value_id) > > ON DELETE RESTRICT > > "people_attributes_eyecolor_fkey" FOREIGN KEY (eyecolor) REFERENCES > > attribute_values(value_id) ON > > DELETE RESTRICT > > "people_attributes_haircolor_fkey" FOREIGN KEY (haircolor) REFERENCES > > attribute_values(value_id) > > ON DELETE RESTRICT > > "people_attributes_hairstyle_fkey" FOREIGN KEY (hairstyle) REFERENCES > > attribute_values(value_id) > > ON DELETE RESTRICT > > "people_attributes_height_fkey" FOREIGN KEY (height) REFERENCES > > attribute_values(value_id) ON DELETE RESTRICT > > "people_attributes_income_fkey" FOREIGN KEY (income) REFERENCES > > attribute_values(value_id) ON DELETE RESTRICT > > "people_attributes_occupation_fkey" FOREIGN KEY (occupation) > > REFERENCES attribute_values(value_id > > ) ON DELETE RESTRICT > > "people_attributes_orientation_fkey" FOREIGN KEY (orientation) > > REFERENCES attribute_values(value_ > > id) ON DELETE RESTRICT > > "people_attributes_relation_fkey" FOREIGN KEY (relation) REFERENCES > > attribute_values(value_id) ON > > DELETE RESTRICT > > "people_attributes_religion_fkey" FOREIGN KEY (religion) REFERENCES > > attribute_values(value_id) ON > > DELETE RESTRICT > > "people_attributes_smoking_fkey" FOREIGN KEY (smoking) REFERENCES > > attribute_values(value_id) ON D > > ELETE RESTRICT > > "people_attributes_want_children_fkey" FOREIGN KEY (want_children) > > REFERENCES attribute_values(va > > lue_id) ON DELETE RESTRICT > > > > > > Is it all the foreign keys that are stalling the drop? I have done VACUUM > > ANALYZE on the entire db. Could anyone offer some insight as to why this > > index is not being used or why the index is not dropping easily? > > > > On Tuesday 05 October 2004 10:32, you wrote: > > > Patrick, > > > > > > First off, thanks for posting this solution! I love to see a new demo > > > of The Power of Postgres(tm) and have been wondering about this > > > particular problem since it came up on IRC. > > > > > > > The array method works quite nicely, especially for the > > > > columns like "languages" and "seeking" that are multiple choice. > > > > However, even though this method is fast, I still might opt for > > > > caching the results because the "real world" search query involves a > > > > lot more and will be executed non-stop. But to have it run this fast > > > > the first time certainly helps. > > > > > > Now, for the bad news: you need to test having a large load of users > > > updating their data. The drawback to GiST indexes is that they are > > > low-concurrency, because the updating process needs to lock the whole > > > index (this has been on our TODO list for about a decade, but it's a > > > hard problem). > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org
Patrick Clery <patrick@phpforhire.com> writes: > PART OF THE QUERY PLAN: > Index Scan using people_attributes_pkey on people_attributes pa (cost=0.00..5.32 rows=1 width=20) > Index Cond: (pa.person_id = "outer".person_id) > Filter: (((ARRAY[age, gender, orientation, children, drinking, You'll probably have to show the rest of the plan for anyone to have much idea what's going on. It seems to be part of a join of some sort and the planner is choosing to drive the join from the wrong table. This may make it awkward to force the right plan using enable_seqscan or anything like that. But GiST indexes don't have very good selectivity estimates so I'm not sure you can hope for the optimizer to guess right on its own. > Is it all the foreign keys that are stalling the drop? I have done VACUUM > ANALYZE on the entire db. Could anyone offer some insight as to why this > index is not being used or why the index is not dropping easily? I don't think foreign keys cause problems dropping indexes. Foreign key constraints are just checked whenever there's an insert/update/delete. Perhaps you're just underestimating the size of this index and the amount of time it'll take to delete it? Or are there queries actively executing using the index while you're trying to delete it? Or a vacuum running? -- greg