Re: people who buy A, also buy C, D, E - Mailing list pgsql-sql
From | Christoph Haller |
---|---|
Subject | Re: people who buy A, also buy C, D, E |
Date | |
Msg-id | 426E41D0.D4103DEF@rodos.fzk.de Whole thread Raw |
In response to | Re: people who buy A, also buy C, D, E ("Dan Langille" <dan@langille.org>) |
List | pgsql-sql |
Dan Langille wrote: > > On 26 Apr 2005 at 14:24, Christoph Haller wrote: > > > Dan Langille wrote: > > > > > > The goal of my query is: given a book, what did other people who > > > bought this book also buy? I plan the list the 5 most popular such > > > books. In reality, this isn't about books, but that makes it easier > > > to understand I think. > > > > > > We have a table of customer_id (watch_list_id) and book_id > > > (element_id). > > > > > > freshports.org=# \d watch_list_element > > > Table "public.watch_list_element" > > > Column | Type | Modifiers > > > ---------------+---------+----------- > > > watch_list_id | integer | not null > > > element_id | integer | not null > > > Indexes: > > > "watch_list_element_pkey" primary key, btree (watch_list_id, > > > element_id) > > > "watch_list_element_element_id" btree (element_id) > > > Foreign-key constraints: > > > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON > > > UPDATE CASCADE ON DELETE CASCADE > > > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE > > > CASCADE ON DELETE CASCADE > > > > > > freshports.org=# > > > > > > I have a query which returns the needed results: > > > > > > SELECT W.element_id > > > FROM watch_list_element W > > > WHERE w.watch_list_id in (select watch_list_id from > > > watch_list_element where element_id = 54968) > > > GROUP BY W.element_id > > > ORDER BY count(W.watch_list_id) DESC > > > LIMIT 5; > > > > > > But performance is an issue here. So I'm planning to calculate all > > > the possible values and cache them. That is, given each element_id > > > in a watch_list, what are the top 5 element_id values on all the > > > lists on which the original element_id appears? > > > > > > I'm having trouble constructing the query. I'm not even sure I can > > > do this in one select, but that would be nice. Examples and clues > > > are appreciated. > > > > > > Any ideas? > > > > > > Thank you. > > > -- > > > > Just two ideas. > > > > 1) Older Postgres versions are notorious for being slow > > on "IN" clauses. > > Does this one (untested) perform better: > > > > SELECT W.element_id, count(W.watch_list_id) > > FROM watch_list_element W > > WHERE EXISTS > > (SELECT * FROM watch_list_element E > > WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id) > > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5; > > I'm on PostgreSQL 7.4.7: > > freshports.org=# explain analyse > freshports.org-# SELECT W.element_id, count(W.watch_list_id) > freshports.org-# FROM watch_list_element W > freshports.org-# WHERE EXISTS > freshports.org-# (SELECT * FROM watch_list_element E > freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id = > E.watch_list_id) > freshports.org-# GROUP BY W.element_id > freshports.org-# ORDER BY 2 DESC > freshports.org-# LIMIT 5; > > QUERY PLAN > ---------------------------------------------------------------------- > ---------------------------------------------------------------------- > --------------------------------- > Limit (cost=417905.49..417905.51 rows=5 width=8) (actual > time=3142.480..3142.528 rows=5 loops=1) > -> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual > time=3142.471..3142.486 rows=5 loops=1) > Sort Key: count(watch_list_id) > -> HashAggregate (cost=417851.20..417853.78 rows=1033 > width=8) (actual time=3074.170..3112.294 rows=7338 loops=1) > -> Seq Scan on watch_list_element w > (cost=0.00..417506.76 rows=68888 width=8) (actual > time=0.129..2619.989 rows=94018 loops=1) > Filter: (subplan) > SubPlan > -> Index Scan using watch_list_element_pkey > on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual > time=0.011..0.011 rows=1 loops=137776) > Index Cond: (($0 = watch_list_id) AND > (element_id = 54968)) > Total runtime: 3143.304 ms > (10 rows) > > freshports.org=# > > Compare that to the original query: > > freshports.org=# explain analyse > freshports.org-# SELECT W.element_id > freshports.org-# FROM watch_list_element W > freshports.org-# WHERE w.watch_list_id in (select watch_list_id > from > freshports.org(# watch_list_element where element_id = 54968) > freshports.org-# GROUP BY W.element_id > freshports.org-# ORDER BY count(W.watch_list_id) DESC > freshports.org-# LIMIT 5; > > QUERY PLAN > ---------------------------------------------------------------------- > ---------------------------------------------------------------------- > -------------------------------------------- > Limit (cost=1174.26..1174.28 rows=5 width=8) (actual > time=1786.628..1786.675 rows=5 loops=1) > -> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual > time=1786.618..1786.634 rows=5 loops=1) > Sort Key: count(w.watch_list_id) > -> HashAggregate (cost=1136.11..1138.01 rows=758 width=8) > (actual time=1718.439..1756.290 rows=7338 loops=1) > -> Nested Loop (cost=465.98..1132.32 rows=758 > width=8) (actual time=44.372..1292.270 rows=94018 loops=1) > -> HashAggregate (cost=465.98..465.98 rows=6 > width=4) (actual time=44.332..47.136 rows=599 loops=1) > -> Index Scan using > watch_list_element_element_id on watch_list_element > (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789 > rows=599 loops=1) > Index Cond: (element_id = 54968) > -> Index Scan using watch_list_element_pkey on > watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual > time=0.049..0.863 rows=157 loops=599) > Index Cond: (w.watch_list_id = > "outer".watch_list_id) > Total runtime: 1787.466 ms > (11 rows) > > freshports.org=# > > > 2) I suspect calculating all possible values would require time and an > > enormous cache buffer in size as well as re-calculating pretty often. > > So my approach would be trying to tune the query before introducing > > cached results. > > Yes, it's pretty extensive: for each element (there are 9000), get me > all the watch lists it appears upon, then get me the top 5 elements > on *those* watch lists. > -- So your original query outruns my "EXISTS" version by a factor of 2. I did not realize the "IN" has become so much faster. I am running out of clues. May be the [PERFORMANCE] list has more qualified approaches. Regards, Christoph