Re: Comparing user attributes with bitwise operators - Mailing list pgsql-performance

From Christopher Kings-Lynne
Subject Re: Comparing user attributes with bitwise operators
Date
Msg-id 41494A94.5080800@familyhealth.com.au
Whole thread Raw
In response to Comparing user attributes with bitwise operators  (Patrick Clery <patrick@phpforhire.com>)
Responses Re: Comparing user attributes with bitwise operators
List pgsql-performance
Sounds like you want a many-to-many table that maps user_ids to match_ids

Then you can put an index over (user_id, match_id) and the search will
be very fast.

Chris

Patrick Clery wrote:
> I'm working on a dating/personals/match-making site, that has used many
> different methods of "match-making", that all seem to be very slow. One I am
> attempting now that seems to be an efficient method of storage, but not the
> best for indexing, is using bitwise operators to compare one person's profile
> to another's.
>
> This method seems to be very fast on a small scale, but I am dealing with a
> large user-base here, in excess of 200,000 users that will be executing this
> search function every time they login (the search results of their profile
> will appear on the main page after they have logged in). I've opted to use
> "label tables" for each possible set of answers. (i.e: Marital Status)
>
> For this table, the set of bits -- bit(5) -- are represented as such:
>
> +-----+------------+
> | Bit | Meaning    |
> +-----+------------+
> | 1   | single     |
> | 2   | separated  |
> | 3   | married    |
> | 4   | divorced   |
> | 5   | widowed    |
> +-----+------------+
>
> Here's the structure of the marital status table:
>
> # \d marital_matrix
              
> Table "public.marital_matrix"
>   Column   |      Type      |                               Modifiers
> -----------+----------------+-----------------------------------------------------------------------
>  member_id | integer        | not null default
> nextval('public.marital_matrix_member_id_seq'::text)
>  status    | bit varying(5) | not null default (0)::bit(5)
>  p_status  | bit varying(5) | not null default (0)::bit(5)
> Indexes:
>     "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
>     "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
>     "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
>     "idx_marital_p_status" btree (p_status)
>     "idx_marital_status" btree (status)
> Foreign-key constraints:
>     "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE
> CASCADE DEFERRABLE INITIALLY DEFERRED
>
> To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a
> smaller sample than what I will actually be using):
>
> datingsite=> select count(*),status,p_status from marital_matrix group by
> status,p_status;
>  count | status | p_status
> -------+--------+----------
>     89 | 00001  | 00000
>   1319 | 00010  | 00000
>   2465 | 00100  | 00000
>      1 | 00100  | 11111
>  46117 | 10000  | 00000
>
> here is the user I'll be comparing against, which has selected that he be
> matched with any but married people:
>
> datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
>  member_id | status | p_status
> -----------+--------+----------
>         21 | 10000  | 11011
> (1 row)
>
>
>
>
> Here are a few possible comparison methods I can think of (NOTE: tests were
> run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:
>
>
> METHOD 1: Any bit that is set in p_status (prefered marital status) of the
> searching user should be set in the potential match's marital status. This is
> the method I'd like to improve, if possible. Running the query twice didn't
> produce a different runtime.
>
> EXPLAIN ANALYZE
> SELECT
>     m2.member_id
> FROM
>     marital_matrix m1, marital_matrix m2
> WHERE
>     m1.member_id = 21 AND
>     m2.status & m1.p_status != B'00000';
>                                                                 QUERY PLAN
                      
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..2357.79 rows=49742 width=4) (actual
> time=18.062..708.469 rows=47525 loops=1)
>    Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <>
> B'00000'::"bit")
>    ->  Index Scan using marital_matrix_pkey on marital_matrix m1
> (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
>          Index Cond: (member_id = 21)
>    ->  Seq Scan on marital_matrix m2  (cost=0.00..1602.91 rows=49991 width=13)
> (actual time=17.966..255.529 rows=49991 loops=1)
>  Total runtime: 905.694 ms
> (6 rows)
>
>
> METHOD 2: Specifying the value (I don't think this would make a difference,
> but I'll post anyways):
>
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix
> WHERE
>     status & B'11011' != B'00000';
>
>                                                       QUERY PLAN

>
----------------------------------------------------------------------------------------------------------------------
>  Seq Scan on marital_matrix  (cost=0.00..1852.87 rows=49742 width=4) (actual
> time=18.113..281.101 rows=47525 loops=1)
>    Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
>  Total runtime: 480.836 ms
> (3 rows)
>
>
> METHOD 3: Checking for one bit only. This is definitely not a "real world"
> example and unacceptable since the p_status column can and will have multiple
> bits. For categories other than "Marital Status", such as "Prefered Hair
> Color", the users are likely to select multiple bits (they choose all that
> apply). This query does use the index, but is still not very fast at all:
>
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix m1
> WHERE
>     status & B'10000' = B'10000';
>                                                                       QUERY
> PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------
>  Index Scan using idx_marital_matrix_single on marital_matrix m1
> (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117
> loops=1)
>    Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
>  Total runtime: 451.162 ms
> (3 rows)
>
> METHOD 4: Using an IN statement. This method seems to be very fussy about
> using the index, and I have at some point made it use the index when there
> are less than 3 possibilites. Also, for fields other than Marital Status,
> users will be able to select many bits for their own profile, which means
> there would be many permutations:
>
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix
> WHERE
>     status & B'11011' IN (B'10000',B'01000');
>           QUERY PLAN
                    
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on marital_matrix  (cost=0.00..2602.73 rows=993 width=4) (actual
> time=17.845..288.279 rows=47525 loops=1)
>    Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR
> (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit"
> & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit")
> = B'00001'::"bit"))
>  Total runtime: 488.651 ms
> (3 rows)
>
>
> Method 3 is the only one that used the index, but the only really acceptable
> method here is Method 1.
>
> My questions are...
> - Is there any hope in getting this to use an efficient index?
> - Any mathmaticians know if there is a way to reorder my bitwise comparison to
> have the operator use = and not an != (perhaps to force an index)? (AFAIK,
> the answer to the second question is no)
>
> If anyone could offer any performance tips here I'd really appreciate it. I
> imagine that having this schema wouldn't last an hour with the amount of CPU
> cycles it would be consuming on math operations.
>
> Also, I have read the thread that was posted here by Daniel in August:
> http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php
>
> I have spoke with Daniel on this issue and we both agree it's very difficult
> to find a solution that can scale to very large sites.
>
> I would very much appreciate any advice that some experienced users may have
> to offer me for such a situation. TIA
>
> Patrick
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

pgsql-performance by date:

Previous
From: Patrick Clery
Date:
Subject: Comparing user attributes with bitwise operators
Next
From: Greg Stark
Date:
Subject: Re: Comparing user attributes with bitwise operators