Re: OT: Data structure design question: How do they count - Mailing list pgsql-performance

From Richard Huxton
Subject Re: OT: Data structure design question: How do they count
Date
Msg-id 443A2424.4020702@archonet.com
Whole thread Raw
In response to OT: Data structure design question: How do they count so fast?  (Brendan Duddridge <brendan@clickspace.com>)
Responses Re: OT: Data structure design question: How do they count so fast?
List pgsql-performance
Brendan Duddridge wrote:
>
> Now, initially I thought they would just pre-compute these counts, but
> the problem is, when you click on any of the above attribute values,
> they reduce the remaining possible set of matching products (and set of
> possible remaining attributes and attribute values) by the amount
> displayed next to the attribute value selected. You can click on any
> combination of attribute values to filter down the remaining set of
> matching products, so there's a large combination of paths you can take
> to arrive at a set of products you might be interested in.
>
> Do you think they are pre-computed? Or do you think they might use a
> query similar to the following?:

Pre-computed almost certainly, but at what level of granularity? And
with application-level caching?

> select pav.attribute_value_id, count(p.product_id)
> from product_attribute_value pav,
>      attribute a,
>      product p
> where a.attribute_id in (some set of attribute ids) and
> pav.product_id = p.product_id and
> pav.attribute_id = a.attribute_id and p.product_id in
>     (select product_id
>      from category_product
>      where category_id = some category id) and
> p.is_active = 'true'
> group by pav.attribute_value_id;
>
> It would seem to me that although the above query suggests a normalized
> database structure, that joining with 3 tables plus a 4th table in the
> sub-query with an IN qualifier and grouping to get the product counts
> would take a VERY long time, especially on a possible result set of
> 1,260,658 products.

Hmm - I'm not sure I'd say this was necessarily normalised. In the
example you gave there were three definite types of attribute:
  1. Price range (< 20, 20-50, ...)
  2. Product type (lighting, rugs, ...)
  3. Store (art.com, homeannex, ...)
Your example discards this type information.

I'm also not sure it lets store A sell widgets for 19.99 and B for 25.99

So - let's look at how we might break this down into simple relations:
  product_types (product_id, prod_type, prod_subtype)
  product_availability (product_id, store_id, price_range)
and so on for each set of parameters.

Then, if PG isn't calculating fast enough I'd be tempted to throw in a
summary table:
  product_counts(store_id, price_range, prod_type, prod_subtype, ...,
num_products)
Then total over this for the top-level queries.

I'd also cache common top-level queries at the applicaton level anyway.

--
   Richard Huxton
   Archonet Ltd

pgsql-performance by date:

Previous
From: Richard Huxton
Date:
Subject: Re: pg 8.1.3, AIX, huge box, painfully slow.
Next
From: Simon Riggs
Date:
Subject: Re: pg 8.1.3, AIX, huge box, painfully slow.