> We use bitfields on our large user table. It is becoming unworkable to
> scan for matches, since overall most people have very few selections
> made.
>
> We are moving it to a model like your favorite_colors table which just
> links the option and the user.
>
> We find that doing joins on large tables which can be indexed to avoid
> full table scans are very fast in postgres, since the index can do much
> of your culling of potential matching rows. With bitfields, you are
> more or less forced into doing a sequence scan to find everyone who
> likes the color red.
>
> Of course, if you're playing with only a few thousand users, either
> approach works well.
Things you could try :
* Use an integer array instead of a bitfield
(for instance, in users table, column favourites would contain { 1,2 } if
the user selected items 1 and 2 )
Then, you can make a Gist index on it and use the indexed intersection
operator
This is likely the optimal solution if the maximum number of items is
small (say, 100 is good, 100000 is not)
* keep your bitfields and create conditional indexes :
CREATE INDEX ... WHERE bitfield_column & 1;
CREATE INDEX ... WHERE bitfield_column & 2;
CREATE INDEX ... WHERE bitfield_column & 4;
CREATE INDEX ... WHERE bitfield_column & 8;
CREATE INDEX ... WHERE bitfield_column & 16;
etc...
Obviously this will only work if you have, say, 10 favouritess. 100
indexes on a table would really suck.
Then, when looking for users who chose bits 1 and 2, do :
SELECT WHERE (bitfield_column & 1) AND (bitfield_column & 2)
postgres will do a bitmap-and using the two indexes
(note : when we get bitmap indexes, this will be even better)
* use tsearch2 :
favourites = 'red blue'
and fulltext-search it
* use a favourites table :
This makes queries hard to optimize. Consider the table (user_id,
item_id) meaning user selected this item as favourite.
If you want to know which users did select both items 1 and 2, you have
to do a self-join, something like :
SELECT... FROM favourites a, favourites b WHERE a.user_id = b.user_id AND
a.item_id=1 AND b.item_id = 2
This is likely to be "not very fast" if 1 million users check each option
but only 100 check both.