Re: Could use some advice on search architecture - Mailing list pgsql-general

From Robin
Subject Re: Could use some advice on search architecture
Date
Msg-id BLU0-SMTP534CA46A4EDAC0E2BBE22EE25C0@phx.gbl
Whole thread Raw
In response to Re: Could use some advice on search architecture  (Andy Colson <andy@squeakycode.net>)
Responses Re: Could use some advice on search architecture  (Marc Mamin <M.Mamin@intershop.de>)
List pgsql-general

On 18/04/2014 21:24, Andy Colson wrote:
> On 4/18/2014 8:59 AM, Ron Pasch wrote:
>> Hello,
>>
>> I'm contemplating what architecture I should use to make searching as
>> fast as possible given the information available and the search
>> requirements. Let me give some background first;
>>
>> - The database contains products of can potentially have a lot of them
>> (up to about 3 to 5 million)
>> - Each product has about 30 different properties defined about them.
>> Things like what color they are etc. All these properties are enumerated
>> choices, so for instance for color there is a list of available static
>> never changing options of which one can be chosen for that product. This
>> is the same for all those 30 properties. Currently they are stored as
>> enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
>> 'option2', etc..)
>> - It should be possible to search for products and provide properties
>> that the product SHOULD have, not must have. For instance, for color,
>> the search could specify that it should return products that are either
>> red, blue or green.
>> - The products that match with the most properties should be in the top
>> of the search results
>> - If different products match with the same amount of properties, the
>> ordering should then be on the product that is most popular. There is
>> information in the database (and if need be also in the same table)
>> about how many times a product is sold.
>> - The results will be paginated per 15 products
>>
>> The requirement is that these searches should be as fast as possible,
>> with a maximum of about 200 ms time taken for a search query.
>>
>> What would be the best approach to this if I were to do this in the
>> database only? Should/can this be done with postgresql only or should I
>> look into other types of technology? (Lucene? Sphinx? others?)
>>
>> Any advice on this would be greatly appreciated.
>>
>> Thx in advance!
>>
>> Ron
>>
>>
>>
>
>
> As we are PG users, on a PG list, we are gonna recommend PG,
> obviously. :-)
>
> Actually though, I recommend PG.
>
>
> > - The products that match with the most properties should be in the top
> > of the search results
>
>
> That kinda query is going to be really difficult, I think, regardless
> of what you use.  To find that you'll have to measure every product
> (all 5 million) and then sort the results.
>
>
> > enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
> > 'option2', etc..)
>
> So, you have something like:
>
> CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
> CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');
>
> create table product
> (
>   id serial,
>   name text,
>   color colortype,
>   size  sizetype,
>    ...
> );
>
>
> I assume the problem is you don't want to index all 30 properties?
> That makes sense.
>
> > - It should be possible to search for products and provide properties
> > that the product SHOULD have, not must have.
>
> I don't understand this.  Say you have a sprocket in red and green.
> Do you want to search for:
>
> select * from product where name = 'sprocket' and (color = 'red' or
> color = 'green')
>
> Or do you want something else?  Does the user say they'd "prefer"
> blue, but will take whatever else you have?
>
> Do you search for some properties exactly and some "preferred"?
>
> Perhaps you could describe a little more how you want to query the
> database?  Or, maybe, what your user's are searching for?
>
> -Andy
>
>
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.

Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniques

So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)

This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8  + 32 = 41 = 0010 1001


Robin


pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: pg_upgrade & tablespaces
Next
From: "Gunnar \"Nick\" Bluth"
Date:
Subject: Patched odbc_dfw for PG >= 9.2