Re: people who buy A, also buy C, D, E - Mailing list pgsql-sql

From Christoph Haller
Subject Re: people who buy A, also buy C, D, E
Date
Msg-id 426E41D0.D4103DEF@rodos.fzk.de
Whole thread Raw
In response to Re: people who buy A, also buy C, D, E  ("Dan Langille" <dan@langille.org>)
List pgsql-sql
Dan Langille wrote:
> 
> On 26 Apr 2005 at 14:24, Christoph Haller wrote:
> 
> > Dan Langille wrote:
> > >
> > > The goal of my query is: given a book, what did other people who
> > > bought this book also buy?  I plan the list the 5 most popular such
> > > books.  In reality, this isn't about books, but that makes it easier
> > > to understand I think.
> > >
> > > We have a table of customer_id (watch_list_id) and book_id
> > > (element_id).
> > >
> > > freshports.org=# \d watch_list_element
> > >   Table "public.watch_list_element"
> > >     Column     |  Type   | Modifiers
> > > ---------------+---------+-----------
> > >  watch_list_id | integer | not null
> > >  element_id    | integer | not null
> > > Indexes:
> > >     "watch_list_element_pkey" primary key, btree (watch_list_id,
> > > element_id)
> > >     "watch_list_element_element_id" btree (element_id)
> > > Foreign-key constraints:
> > >     "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > > UPDATE CASCADE ON DELETE CASCADE
> > >     "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > > CASCADE ON DELETE CASCADE
> > >
> > > freshports.org=#
> > >
> > > I have a query which returns the needed results:
> > >
> > >      SELECT W.element_id
> > >        FROM watch_list_element W
> > >       WHERE w.watch_list_id in (select watch_list_id from
> > > watch_list_element where element_id = 54968)
> > >    GROUP BY W.element_id
> > >    ORDER BY count(W.watch_list_id) DESC
> > >       LIMIT 5;
> > >
> > > But performance is an issue here.  So I'm planning to calculate all
> > > the possible values and cache them. That is, given each element_id
> > > in a watch_list, what are the top 5 element_id values on all the
> > > lists on which the original element_id appears?
> > >
> > > I'm having trouble constructing the query.  I'm not even sure I can
> > > do this in one select, but that would be nice.  Examples and clues
> > > are appreciated.
> > >
> > > Any ideas?
> > >
> > > Thank you.
> > > --
> >
> > Just two ideas.
> >
> > 1) Older Postgres versions are notorious for being slow
> > on "IN" clauses.
> > Does this one (untested) perform better:
> >
> > SELECT W.element_id, count(W.watch_list_id)
> >   FROM watch_list_element W
> > WHERE EXISTS
> > (SELECT * FROM watch_list_element E
> >  WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;
> 
> I'm on PostgreSQL 7.4.7:
> 
> freshports.org=# explain analyse
> freshports.org-# SELECT W.element_id, count(W.watch_list_id)
> freshports.org-#   FROM watch_list_element W
> freshports.org-# WHERE EXISTS
> freshports.org-# (SELECT * FROM watch_list_element E
> freshports.org(#  WHERE E.element_id = 54968 AND W.watch_list_id =
> E.watch_list_id)
> freshports.org-# GROUP BY W.element_id
> freshports.org-# ORDER BY 2 DESC
> freshports.org-# LIMIT 5;
> 
>            QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------
>  Limit  (cost=417905.49..417905.51 rows=5 width=8) (actual
> time=3142.480..3142.528 rows=5 loops=1)
>    ->  Sort  (cost=417905.49..417908.08 rows=1033 width=8) (actual
> time=3142.471..3142.486 rows=5 loops=1)
>          Sort Key: count(watch_list_id)
>          ->  HashAggregate  (cost=417851.20..417853.78 rows=1033
> width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)
>                ->  Seq Scan on watch_list_element w
> (cost=0.00..417506.76 rows=68888 width=8) (actual
> time=0.129..2619.989 rows=94018 loops=1)
>                      Filter: (subplan)
>                      SubPlan
>                        ->  Index Scan using watch_list_element_pkey
> on watch_list_element e  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.011..0.011 rows=1 loops=137776)
>                              Index Cond: (($0 = watch_list_id) AND
> (element_id = 54968))
>  Total runtime: 3143.304 ms
> (10 rows)
> 
> freshports.org=#
> 
> Compare that to the original query:
> 
> freshports.org=# explain analyse
> freshports.org-#      SELECT W.element_id
> freshports.org-#        FROM watch_list_element W
> freshports.org-#       WHERE w.watch_list_id in (select watch_list_id
> from
> freshports.org(# watch_list_element where element_id = 54968)
> freshports.org-#    GROUP BY W.element_id
> freshports.org-#    ORDER BY count(W.watch_list_id) DESC
> freshports.org-#       LIMIT 5;
> 
>                  QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------
>  Limit  (cost=1174.26..1174.28 rows=5 width=8) (actual
> time=1786.628..1786.675 rows=5 loops=1)
>    ->  Sort  (cost=1174.26..1176.16 rows=758 width=8) (actual
> time=1786.618..1786.634 rows=5 loops=1)
>          Sort Key: count(w.watch_list_id)
>          ->  HashAggregate  (cost=1136.11..1138.01 rows=758 width=8)
> (actual time=1718.439..1756.290 rows=7338 loops=1)
>                ->  Nested Loop  (cost=465.98..1132.32 rows=758
> width=8) (actual time=44.372..1292.270 rows=94018 loops=1)
>                      ->  HashAggregate  (cost=465.98..465.98 rows=6
> width=4) (actual time=44.332..47.136 rows=599 loops=1)
>                            ->  Index Scan using
> watch_list_element_element_id on watch_list_element
> (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789
> rows=599 loops=1)
>                                  Index Cond: (element_id = 54968)
>                      ->  Index Scan using watch_list_element_pkey on
> watch_list_element w  (cost=0.00..109.48 rows=126 width=8) (actual
> time=0.049..0.863 rows=157 loops=599)
>                            Index Cond: (w.watch_list_id =
> "outer".watch_list_id)
>  Total runtime: 1787.466 ms
> (11 rows)
> 
> freshports.org=#
> 
> > 2) I suspect calculating all possible values would require time and an
> > enormous cache buffer in size as well as re-calculating pretty often.
> > So my approach would be trying to tune the query before introducing
> > cached results.
> 
> Yes, it's pretty extensive: for each element (there are 9000), get me
> all the watch lists it appears upon, then get me the top 5 elements
> on *those* watch lists.
> --

So your original query outruns my "EXISTS" version 
by a factor of 2. I did not realize the "IN" has 
become so much faster. 
I am running out of clues. 
May be the [PERFORMANCE] list has more 
qualified approaches. 
Regards, Christoph


pgsql-sql by date:

Previous
From: Achilleus Mantzios
Date:
Subject: Re: people who buy A, also buy C, D, E
Next
From: Rodrigo Carvalhaes
Date:
Subject: Re: UPDATE WITH ORDER BY