Re: OT: Data structure design question: How do they count so fast? - Mailing list pgsql-performance
From | Brendan Duddridge |
---|---|
Subject | Re: OT: Data structure design question: How do they count so fast? |
Date | |
Msg-id | C5D7737E-9D01-4450-8C0B-C57B18C9D8FB@clickspace.com Whole thread Raw |
In response to | Re: OT: Data structure design question: How do they count (Richard Huxton <dev@archonet.com>) |
List | pgsql-performance |
Hi Richard (and anyone else who want's to comment!), I'm not sure it will really work pre-computed. At least not in an obvious way (for me! :-)) It's fine to display a pre-computed list of product counts for the initial set of attribute and attribute values, but we need to be able to display the counts of products for any combination of selected attribute values. Once an attribute value is picked that reduces the set of products to a small enough set, the queries are fast, so, perhaps we just need to pre-compute the counts for combinations of attribute values that lead to a high count of products. Hence, our thought on building a decision-tree type data structure. This would store the various combinations of attributes and attribute values, along with the product count, along with a list of attributes and attribute values that are applicable for the current set of selected attribute values. This sounds like it could get rather complicated, so we were hoping someone might have an idea on a much simpler solution. Thanks, ____________________________________________________________________ Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Apr 10, 2006, at 3:23 AM, Richard Huxton wrote: > Brendan Duddridge wrote: >> Now, initially I thought they would just pre-compute these counts, >> but the problem is, when you click on any of the above attribute >> values, they reduce the remaining possible set of matching >> products (and set of possible remaining attributes and attribute >> values) by the amount displayed next to the attribute value >> selected. You can click on any combination of attribute values to >> filter down the remaining set of matching products, so there's a >> large combination of paths you can take to arrive at a set of >> products you might be interested in. >> Do you think they are pre-computed? Or do you think they might use >> a query similar to the following?: > > Pre-computed almost certainly, but at what level of granularity? > And with application-level caching? > >> select pav.attribute_value_id, count(p.product_id) >> from product_attribute_value pav, >> attribute a, >> product p >> where a.attribute_id in (some set of attribute ids) and >> pav.product_id = p.product_id and >> pav.attribute_id = a.attribute_id and p.product_id in >> (select product_id >> from category_product >> where category_id = some category id) and >> p.is_active = 'true' >> group by pav.attribute_value_id; >> It would seem to me that although the above query suggests a >> normalized database structure, that joining with 3 tables plus a >> 4th table in the sub-query with an IN qualifier and grouping to >> get the product counts would take a VERY long time, especially on >> a possible result set of 1,260,658 products. > > Hmm - I'm not sure I'd say this was necessarily normalised. In the > example you gave there were three definite types of attribute: > 1. Price range (< 20, 20-50, ...) > 2. Product type (lighting, rugs, ...) > 3. Store (art.com, homeannex, ...) > Your example discards this type information. > > I'm also not sure it lets store A sell widgets for 19.99 and B for > 25.99 > > So - let's look at how we might break this down into simple relations: > product_types (product_id, prod_type, prod_subtype) > product_availability (product_id, store_id, price_range) > and so on for each set of parameters. > > Then, if PG isn't calculating fast enough I'd be tempted to throw > in a summary table: > product_counts(store_id, price_range, prod_type, > prod_subtype, ..., num_products) > Then total over this for the top-level queries. > > I'd also cache common top-level queries at the applicaton level > anyway. > > -- > Richard Huxton > Archonet Ltd >
pgsql-performance by date: