Re: Comparing user attributes with bitwise operators - Mailing list pgsql-performance
From | Christopher Kings-Lynne |
---|---|
Subject | Re: Comparing user attributes with bitwise operators |
Date | |
Msg-id | 41494A94.5080800@familyhealth.com.au Whole thread Raw |
In response to | Comparing user attributes with bitwise operators (Patrick Clery <patrick@phpforhire.com>) |
Responses |
Re: Comparing user attributes with bitwise operators
|
List | pgsql-performance |
Sounds like you want a many-to-many table that maps user_ids to match_ids Then you can put an index over (user_id, match_id) and the search will be very fast. Chris Patrick Clery wrote: > I'm working on a dating/personals/match-making site, that has used many > different methods of "match-making", that all seem to be very slow. One I am > attempting now that seems to be an efficient method of storage, but not the > best for indexing, is using bitwise operators to compare one person's profile > to another's. > > This method seems to be very fast on a small scale, but I am dealing with a > large user-base here, in excess of 200,000 users that will be executing this > search function every time they login (the search results of their profile > will appear on the main page after they have logged in). I've opted to use > "label tables" for each possible set of answers. (i.e: Marital Status) > > For this table, the set of bits -- bit(5) -- are represented as such: > > +-----+------------+ > | Bit | Meaning | > +-----+------------+ > | 1 | single | > | 2 | separated | > | 3 | married | > | 4 | divorced | > | 5 | widowed | > +-----+------------+ > > Here's the structure of the marital status table: > > # \d marital_matrix > Table "public.marital_matrix" > Column | Type | Modifiers > -----------+----------------+----------------------------------------------------------------------- > member_id | integer | not null default > nextval('public.marital_matrix_member_id_seq'::text) > status | bit varying(5) | not null default (0)::bit(5) > p_status | bit varying(5) | not null default (0)::bit(5) > Indexes: > "marital_matrix_pkey" PRIMARY KEY, btree (member_id) > "idx_marital_matrix" btree ((status::"bit" & p_status::"bit")) > "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit")) > "idx_marital_p_status" btree (p_status) > "idx_marital_status" btree (status) > Foreign-key constraints: > "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE > CASCADE DEFERRABLE INITIALLY DEFERRED > > To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a > smaller sample than what I will actually be using): > > datingsite=> select count(*),status,p_status from marital_matrix group by > status,p_status; > count | status | p_status > -------+--------+---------- > 89 | 00001 | 00000 > 1319 | 00010 | 00000 > 2465 | 00100 | 00000 > 1 | 00100 | 11111 > 46117 | 10000 | 00000 > > here is the user I'll be comparing against, which has selected that he be > matched with any but married people: > > datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21; > member_id | status | p_status > -----------+--------+---------- > 21 | 10000 | 11011 > (1 row) > > > > > Here are a few possible comparison methods I can think of (NOTE: tests were > run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10: > > > METHOD 1: Any bit that is set in p_status (prefered marital status) of the > searching user should be set in the potential match's marital status. This is > the method I'd like to improve, if possible. Running the query twice didn't > produce a different runtime. > > EXPLAIN ANALYZE > SELECT > m2.member_id > FROM > marital_matrix m1, marital_matrix m2 > WHERE > m1.member_id = 21 AND > m2.status & m1.p_status != B'00000'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Nested Loop (cost=0.00..2357.79 rows=49742 width=4) (actual > time=18.062..708.469 rows=47525 loops=1) > Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <> > B'00000'::"bit") > -> Index Scan using marital_matrix_pkey on marital_matrix m1 > (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1) > Index Cond: (member_id = 21) > -> Seq Scan on marital_matrix m2 (cost=0.00..1602.91 rows=49991 width=13) > (actual time=17.966..255.529 rows=49991 loops=1) > Total runtime: 905.694 ms > (6 rows) > > > METHOD 2: Specifying the value (I don't think this would make a difference, > but I'll post anyways): > > EXPLAIN ANALYZE > SELECT > member_id > FROM > marital_matrix > WHERE > status & B'11011' != B'00000'; > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Seq Scan on marital_matrix (cost=0.00..1852.87 rows=49742 width=4) (actual > time=18.113..281.101 rows=47525 loops=1) > Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit") > Total runtime: 480.836 ms > (3 rows) > > > METHOD 3: Checking for one bit only. This is definitely not a "real world" > example and unacceptable since the p_status column can and will have multiple > bits. For categories other than "Marital Status", such as "Prefered Hair > Color", the users are likely to select multiple bits (they choose all that > apply). This query does use the index, but is still not very fast at all: > > EXPLAIN ANALYZE > SELECT > member_id > FROM > marital_matrix m1 > WHERE > status & B'10000' = B'10000'; > QUERY > PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_marital_matrix_single on marital_matrix m1 > (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117 > loops=1) > Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit") > Total runtime: 451.162 ms > (3 rows) > > METHOD 4: Using an IN statement. This method seems to be very fussy about > using the index, and I have at some point made it use the index when there > are less than 3 possibilites. Also, for fields other than Marital Status, > users will be able to select many bits for their own profile, which means > there would be many permutations: > > EXPLAIN ANALYZE > SELECT > member_id > FROM > marital_matrix > WHERE > status & B'11011' IN (B'10000',B'01000'); > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Seq Scan on marital_matrix (cost=0.00..2602.73 rows=993 width=4) (actual > time=17.845..288.279 rows=47525 loops=1) > Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR > (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit" > & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit") > = B'00001'::"bit")) > Total runtime: 488.651 ms > (3 rows) > > > Method 3 is the only one that used the index, but the only really acceptable > method here is Method 1. > > My questions are... > - Is there any hope in getting this to use an efficient index? > - Any mathmaticians know if there is a way to reorder my bitwise comparison to > have the operator use = and not an != (perhaps to force an index)? (AFAIK, > the answer to the second question is no) > > If anyone could offer any performance tips here I'd really appreciate it. I > imagine that having this schema wouldn't last an hour with the amount of CPU > cycles it would be consuming on math operations. > > Also, I have read the thread that was posted here by Daniel in August: > http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php > > I have spoke with Daniel on this issue and we both agree it's very difficult > to find a solution that can scale to very large sites. > > I would very much appreciate any advice that some experienced users may have > to offer me for such a situation. TIA > > Patrick > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
pgsql-performance by date: