Re: people who buy A, also buy C, D, E - Mailing list pgsql-sql

From Dan Langille
Subject Re: people who buy A, also buy C, D, E
Date
Msg-id 426DFD48.16933.D44C1F2@localhost
Whole thread Raw
In response to Re: people who buy A, also buy C, D, E  (Christoph Haller <ch@rodos.fzk.de>)
List pgsql-sql
On 26 Apr 2005 at 14:24, Christoph Haller wrote:

> Dan Langille wrote:
> > 
> > The goal of my query is: given a book, what did other people who
> > bought this book also buy?  I plan the list the 5 most popular such
> > books.  In reality, this isn't about books, but that makes it easier
> > to understand I think.
> > 
> > We have a table of customer_id (watch_list_id) and book_id
> > (element_id).
> > 
> > freshports.org=# \d watch_list_element
> >   Table "public.watch_list_element"
> >     Column     |  Type   | Modifiers
> > ---------------+---------+-----------
> >  watch_list_id | integer | not null
> >  element_id    | integer | not null
> > Indexes:
> >     "watch_list_element_pkey" primary key, btree (watch_list_id,
> > element_id)
> >     "watch_list_element_element_id" btree (element_id)
> > Foreign-key constraints:
> >     "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > UPDATE CASCADE ON DELETE CASCADE
> >     "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > CASCADE ON DELETE CASCADE
> > 
> > freshports.org=#
> > 
> > I have a query which returns the needed results:
> > 
> >      SELECT W.element_id
> >        FROM watch_list_element W
> >       WHERE w.watch_list_id in (select watch_list_id from
> > watch_list_element where element_id = 54968)
> >    GROUP BY W.element_id
> >    ORDER BY count(W.watch_list_id) DESC
> >       LIMIT 5;
> > 
> > But performance is an issue here.  So I'm planning to calculate all
> > the possible values and cache them. That is, given each element_id
> > in a watch_list, what are the top 5 element_id values on all the
> > lists on which the original element_id appears?
> > 
> > I'm having trouble constructing the query.  I'm not even sure I can
> > do this in one select, but that would be nice.  Examples and clues
> > are appreciated.
> > 
> > Any ideas?
> > 
> > Thank you.
> > --
> 
> Just two ideas. 
> 
> 1) Older Postgres versions are notorious for being slow 
> on "IN" clauses. 
> Does this one (untested) perform better: 
> 
> SELECT W.element_id, count(W.watch_list_id)
>   FROM watch_list_element W
> WHERE EXISTS
> (SELECT * FROM watch_list_element E
>  WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;

I'm on PostgreSQL 7.4.7:

freshports.org=# explain analyse
freshports.org-# SELECT W.element_id, count(W.watch_list_id)
freshports.org-#   FROM watch_list_element W
freshports.org-# WHERE EXISTS
freshports.org-# (SELECT * FROM watch_list_element E
freshports.org(#  WHERE E.element_id = 54968 AND W.watch_list_id = 
E.watch_list_id)
freshports.org-# GROUP BY W.element_id
freshports.org-# ORDER BY 2 DESC
freshports.org-# LIMIT 5;                                                                               QUERY PLAN
----------------------------------------------------------------------
----------------------------------------------------------------------
---------------------------------Limit  (cost=417905.49..417905.51 rows=5 width=8) (actual 
time=3142.480..3142.528 rows=5 loops=1)  ->  Sort  (cost=417905.49..417908.08 rows=1033 width=8) (actual 
time=3142.471..3142.486 rows=5 loops=1)        Sort Key: count(watch_list_id)        ->  HashAggregate
(cost=417851.20..417853.78rows=1033 
 
width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)              ->  Seq Scan on watch_list_element w  
(cost=0.00..417506.76 rows=68888 width=8) (actual 
time=0.129..2619.989 rows=94018 loops=1)                    Filter: (subplan)                    SubPlan
     ->  Index Scan using watch_list_element_pkey 
 
on watch_list_element e  (cost=0.00..3.02 rows=1 width=8) (actual 
time=0.011..0.011 rows=1 loops=137776)                            Index Cond: (($0 = watch_list_id) AND 
(element_id = 54968))Total runtime: 3143.304 ms
(10 rows)

freshports.org=#

Compare that to the original query:

freshports.org=# explain analyse
freshports.org-#      SELECT W.element_id
freshports.org-#        FROM watch_list_element W
freshports.org-#       WHERE w.watch_list_id in (select watch_list_id 
from
freshports.org(# watch_list_element where element_id = 54968)
freshports.org-#    GROUP BY W.element_id
freshports.org-#    ORDER BY count(W.watch_list_id) DESC
freshports.org-#       LIMIT 5;
QUERYPLAN
 
----------------------------------------------------------------------
----------------------------------------------------------------------
--------------------------------------------Limit  (cost=1174.26..1174.28 rows=5 width=8) (actual 
time=1786.628..1786.675 rows=5 loops=1)  ->  Sort  (cost=1174.26..1176.16 rows=758 width=8) (actual 
time=1786.618..1786.634 rows=5 loops=1)        Sort Key: count(w.watch_list_id)        ->  HashAggregate
(cost=1136.11..1138.01rows=758 width=8) 
 
(actual time=1718.439..1756.290 rows=7338 loops=1)              ->  Nested Loop  (cost=465.98..1132.32 rows=758 
width=8) (actual time=44.372..1292.270 rows=94018 loops=1)                    ->  HashAggregate  (cost=465.98..465.98
rows=6
 
width=4) (actual time=44.332..47.136 rows=599 loops=1)                          ->  Index Scan using 
watch_list_element_element_id on watch_list_element  
(cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789 
rows=599 loops=1)                                Index Cond: (element_id = 54968)                    ->  Index Scan
usingwatch_list_element_pkey on 
 
watch_list_element w  (cost=0.00..109.48 rows=126 width=8) (actual 
time=0.049..0.863 rows=157 loops=599)                          Index Cond: (w.watch_list_id = 
"outer".watch_list_id)Total runtime: 1787.466 ms
(11 rows)

freshports.org=#

> 2) I suspect calculating all possible values would require time and an
> enormous cache buffer in size as well as re-calculating pretty often.
> So my approach would be trying to tune the query before introducing
> cached results. 

Yes, it's pretty extensive: for each element (there are 9000), get me 

all the watch lists it appears upon, then get me the top 5 elements 
on
*those* watch lists.
-- 
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/  NEW brochure available at
http://www.bsdcan.org/2005/advocacy/



pgsql-sql by date:

Previous
From: Christoph Haller
Date:
Subject: Re: people who buy A, also buy C, D, E
Next
From: Achilleus Mantzios
Date:
Subject: Re: people who buy A, also buy C, D, E