Thread: people who buy A, also buy C, D, E
The goal of my query is: given a book, what did other people who bought this book also buy? I plan the list the 5 most popular such books. In reality, this isn't about books, but that makes it easier to understand I think. We have a table of customer_id (watch_list_id) and book_id (element_id). freshports.org=# \d watch_list_element Table "public.watch_list_element" Column | Type | Modifiers ---------------+---------+-----------watch_list_id | integer | not nullelement_id | integer | not null Indexes: "watch_list_element_pkey" primary key, btree (watch_list_id, element_id) "watch_list_element_element_id" btree (element_id) Foreign-key constraints: "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON UPDATE CASCADE ON DELETE CASCADE "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE CASCADE ON DELETE CASCADE freshports.org=# I have a query which returns the needed results: SELECT W.element_id FROM watch_list_element W WHERE w.watch_list_id in (select watch_list_id from watch_list_element where element_id = 54968) GROUP BY W.element_id ORDER BY count(W.watch_list_id) DESC LIMIT 5; But performance is an issue here. So I'm planning to calculate all the possible values and cache them. That is, given each element_id in a watch_list, what are the top 5 element_id values on all the lists on which the original element_id appears? I'm having trouble constructing the query. I'm not even sure I can do this in one select, but that would be nice. Examples and clues are appreciated. Any ideas? Thank you. -- Dan Langille : http://www.langille.org/ BSDCan - The Technical BSD Conference - http://www.bsdcan.org/ NEW brochure available at http://www.bsdcan.org/2005/advocacy/
Dan Langille wrote: > > The goal of my query is: given a book, what did other people who > bought this book also buy? I plan the list the 5 most popular such > books. In reality, this isn't about books, but that makes it easier > to understand I think. > > We have a table of customer_id (watch_list_id) and book_id > (element_id). > > freshports.org=# \d watch_list_element > Table "public.watch_list_element" > Column | Type | Modifiers > ---------------+---------+----------- > watch_list_id | integer | not null > element_id | integer | not null > Indexes: > "watch_list_element_pkey" primary key, btree (watch_list_id, > element_id) > "watch_list_element_element_id" btree (element_id) > Foreign-key constraints: > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON > UPDATE CASCADE ON DELETE CASCADE > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE > CASCADE ON DELETE CASCADE > > freshports.org=# > > I have a query which returns the needed results: > > SELECT W.element_id > FROM watch_list_element W > WHERE w.watch_list_id in (select watch_list_id from > watch_list_element where element_id = 54968) > GROUP BY W.element_id > ORDER BY count(W.watch_list_id) DESC > LIMIT 5; > > But performance is an issue here. So I'm planning to calculate all > the possible values and cache them. That is, given each element_id in > a watch_list, what are the top 5 element_id values on all the lists > on which the original element_id appears? > > I'm having trouble constructing the query. I'm not even sure I can > do this in one select, but that would be nice. Examples and clues > are appreciated. > > Any ideas? > > Thank you. > -- Just two ideas. 1) Older Postgres versions are notorious for being slow on "IN" clauses. Does this one (untested) perform better: SELECT W.element_id, count(W.watch_list_id) FROM watch_list_element W WHERE EXISTS (SELECT * FROM watch_list_element EWHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id) GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5; 2) I suspect calculating all possible values would require time and an enormous cache buffer in size as well as re-calculating pretty often. So my approach would be trying to tune the query before introducing cached results. HTH Regards, Christoph
On 26 Apr 2005 at 14:24, Christoph Haller wrote: > Dan Langille wrote: > > > > The goal of my query is: given a book, what did other people who > > bought this book also buy? I plan the list the 5 most popular such > > books. In reality, this isn't about books, but that makes it easier > > to understand I think. > > > > We have a table of customer_id (watch_list_id) and book_id > > (element_id). > > > > freshports.org=# \d watch_list_element > > Table "public.watch_list_element" > > Column | Type | Modifiers > > ---------------+---------+----------- > > watch_list_id | integer | not null > > element_id | integer | not null > > Indexes: > > "watch_list_element_pkey" primary key, btree (watch_list_id, > > element_id) > > "watch_list_element_element_id" btree (element_id) > > Foreign-key constraints: > > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON > > UPDATE CASCADE ON DELETE CASCADE > > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE > > CASCADE ON DELETE CASCADE > > > > freshports.org=# > > > > I have a query which returns the needed results: > > > > SELECT W.element_id > > FROM watch_list_element W > > WHERE w.watch_list_id in (select watch_list_id from > > watch_list_element where element_id = 54968) > > GROUP BY W.element_id > > ORDER BY count(W.watch_list_id) DESC > > LIMIT 5; > > > > But performance is an issue here. So I'm planning to calculate all > > the possible values and cache them. That is, given each element_id > > in a watch_list, what are the top 5 element_id values on all the > > lists on which the original element_id appears? > > > > I'm having trouble constructing the query. I'm not even sure I can > > do this in one select, but that would be nice. Examples and clues > > are appreciated. > > > > Any ideas? > > > > Thank you. > > -- > > Just two ideas. > > 1) Older Postgres versions are notorious for being slow > on "IN" clauses. > Does this one (untested) perform better: > > SELECT W.element_id, count(W.watch_list_id) > FROM watch_list_element W > WHERE EXISTS > (SELECT * FROM watch_list_element E > WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id) > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5; I'm on PostgreSQL 7.4.7: freshports.org=# explain analyse freshports.org-# SELECT W.element_id, count(W.watch_list_id) freshports.org-# FROM watch_list_element W freshports.org-# WHERE EXISTS freshports.org-# (SELECT * FROM watch_list_element E freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id) freshports.org-# GROUP BY W.element_id freshports.org-# ORDER BY 2 DESC freshports.org-# LIMIT 5; QUERY PLAN ---------------------------------------------------------------------- ---------------------------------------------------------------------- ---------------------------------Limit (cost=417905.49..417905.51 rows=5 width=8) (actual time=3142.480..3142.528 rows=5 loops=1) -> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual time=3142.471..3142.486 rows=5 loops=1) Sort Key: count(watch_list_id) -> HashAggregate (cost=417851.20..417853.78rows=1033 width=8) (actual time=3074.170..3112.294 rows=7338 loops=1) -> Seq Scan on watch_list_element w (cost=0.00..417506.76 rows=68888 width=8) (actual time=0.129..2619.989 rows=94018 loops=1) Filter: (subplan) SubPlan -> Index Scan using watch_list_element_pkey on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual time=0.011..0.011 rows=1 loops=137776) Index Cond: (($0 = watch_list_id) AND (element_id = 54968))Total runtime: 3143.304 ms (10 rows) freshports.org=# Compare that to the original query: freshports.org=# explain analyse freshports.org-# SELECT W.element_id freshports.org-# FROM watch_list_element W freshports.org-# WHERE w.watch_list_id in (select watch_list_id from freshports.org(# watch_list_element where element_id = 54968) freshports.org-# GROUP BY W.element_id freshports.org-# ORDER BY count(W.watch_list_id) DESC freshports.org-# LIMIT 5; QUERYPLAN ---------------------------------------------------------------------- ---------------------------------------------------------------------- --------------------------------------------Limit (cost=1174.26..1174.28 rows=5 width=8) (actual time=1786.628..1786.675 rows=5 loops=1) -> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual time=1786.618..1786.634 rows=5 loops=1) Sort Key: count(w.watch_list_id) -> HashAggregate (cost=1136.11..1138.01rows=758 width=8) (actual time=1718.439..1756.290 rows=7338 loops=1) -> Nested Loop (cost=465.98..1132.32 rows=758 width=8) (actual time=44.372..1292.270 rows=94018 loops=1) -> HashAggregate (cost=465.98..465.98 rows=6 width=4) (actual time=44.332..47.136 rows=599 loops=1) -> Index Scan using watch_list_element_element_id on watch_list_element (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789 rows=599 loops=1) Index Cond: (element_id = 54968) -> Index Scan usingwatch_list_element_pkey on watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual time=0.049..0.863 rows=157 loops=599) Index Cond: (w.watch_list_id = "outer".watch_list_id)Total runtime: 1787.466 ms (11 rows) freshports.org=# > 2) I suspect calculating all possible values would require time and an > enormous cache buffer in size as well as re-calculating pretty often. > So my approach would be trying to tune the query before introducing > cached results. Yes, it's pretty extensive: for each element (there are 9000), get me all the watch lists it appears upon, then get me the top 5 elements on *those* watch lists. -- Dan Langille : http://www.langille.org/ BSDCan - The Technical BSD Conference - http://www.bsdcan.org/ NEW brochure available at http://www.bsdcan.org/2005/advocacy/
O Christoph Haller έγραψε στις Apr 26, 2005 : > Dan Langille wrote: > > > > The goal of my query is: given a book, what did other people who > > bought this book also buy? I plan the list the 5 most popular such > > books. In reality, this isn't about books, but that makes it easier > > to understand I think. > > > > We have a table of customer_id (watch_list_id) and book_id > > (element_id). > > > > freshports.org=# \d watch_list_element > > Table "public.watch_list_element" > > Column | Type | Modifiers > > ---------------+---------+----------- > > watch_list_id | integer | not null > > element_id | integer | not null > > Indexes: > > "watch_list_element_pkey" primary key, btree (watch_list_id, > > element_id) > > "watch_list_element_element_id" btree (element_id) > > Foreign-key constraints: > > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON > > UPDATE CASCADE ON DELETE CASCADE > > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE > > CASCADE ON DELETE CASCADE > > > > freshports.org=# > > > > I have a query which returns the needed results: > > > > SELECT W.element_id > > FROM watch_list_element W > > WHERE w.watch_list_id in (select watch_list_id from > > watch_list_element where element_id = 54968) > > GROUP BY W.element_id > > ORDER BY count(W.watch_list_id) DESC > > LIMIT 5; > > > > But performance is an issue here. So I'm planning to calculate all > > the possible values and cache them. That is, given each element_id in > > a watch_list, what are the top 5 element_id values on all the lists > > on which the original element_id appears? > > > > I'm having trouble constructing the query. I'm not even sure I can > > do this in one select, but that would be nice. Examples and clues > > are appreciated. > > > > Any ideas? > > > > Thank you. > > -- > > Just two ideas. > > 1) Older Postgres versions are notorious for being slow > on "IN" clauses. > Does this one (untested) perform better: > > SELECT W.element_id, count(W.watch_list_id) > FROM watch_list_element W > WHERE EXISTS > (SELECT * FROM watch_list_element E > WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id) > GROUP BY W.element_id > ORDER BY 2 DESC > LIMIT 5; > > 2) I suspect calculating all possible values would require time and > an enormous cache buffer in size as well as re-calculating pretty often. > So my approach would be trying to tune the query before introducing > cached results. AFAIK, problems like this fall into the "Data Mining" field, and often their solution go beyond some DB arrangments. A little research wouldn't hurt, IMO. > > HTH > > Regards, Christoph > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > -- -Achilleus
Dan Langille wrote: > > On 26 Apr 2005 at 14:24, Christoph Haller wrote: > > > Dan Langille wrote: > > > > > > The goal of my query is: given a book, what did other people who > > > bought this book also buy? I plan the list the 5 most popular such > > > books. In reality, this isn't about books, but that makes it easier > > > to understand I think. > > > > > > We have a table of customer_id (watch_list_id) and book_id > > > (element_id). > > > > > > freshports.org=# \d watch_list_element > > > Table "public.watch_list_element" > > > Column | Type | Modifiers > > > ---------------+---------+----------- > > > watch_list_id | integer | not null > > > element_id | integer | not null > > > Indexes: > > > "watch_list_element_pkey" primary key, btree (watch_list_id, > > > element_id) > > > "watch_list_element_element_id" btree (element_id) > > > Foreign-key constraints: > > > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON > > > UPDATE CASCADE ON DELETE CASCADE > > > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE > > > CASCADE ON DELETE CASCADE > > > > > > freshports.org=# > > > > > > I have a query which returns the needed results: > > > > > > SELECT W.element_id > > > FROM watch_list_element W > > > WHERE w.watch_list_id in (select watch_list_id from > > > watch_list_element where element_id = 54968) > > > GROUP BY W.element_id > > > ORDER BY count(W.watch_list_id) DESC > > > LIMIT 5; > > > > > > But performance is an issue here. So I'm planning to calculate all > > > the possible values and cache them. That is, given each element_id > > > in a watch_list, what are the top 5 element_id values on all the > > > lists on which the original element_id appears? > > > > > > I'm having trouble constructing the query. I'm not even sure I can > > > do this in one select, but that would be nice. Examples and clues > > > are appreciated. > > > > > > Any ideas? > > > > > > Thank you. > > > -- > > > > Just two ideas. > > > > 1) Older Postgres versions are notorious for being slow > > on "IN" clauses. > > Does this one (untested) perform better: > > > > SELECT W.element_id, count(W.watch_list_id) > > FROM watch_list_element W > > WHERE EXISTS > > (SELECT * FROM watch_list_element E > > WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id) > > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5; > > I'm on PostgreSQL 7.4.7: > > freshports.org=# explain analyse > freshports.org-# SELECT W.element_id, count(W.watch_list_id) > freshports.org-# FROM watch_list_element W > freshports.org-# WHERE EXISTS > freshports.org-# (SELECT * FROM watch_list_element E > freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id = > E.watch_list_id) > freshports.org-# GROUP BY W.element_id > freshports.org-# ORDER BY 2 DESC > freshports.org-# LIMIT 5; > > QUERY PLAN > ---------------------------------------------------------------------- > ---------------------------------------------------------------------- > --------------------------------- > Limit (cost=417905.49..417905.51 rows=5 width=8) (actual > time=3142.480..3142.528 rows=5 loops=1) > -> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual > time=3142.471..3142.486 rows=5 loops=1) > Sort Key: count(watch_list_id) > -> HashAggregate (cost=417851.20..417853.78 rows=1033 > width=8) (actual time=3074.170..3112.294 rows=7338 loops=1) > -> Seq Scan on watch_list_element w > (cost=0.00..417506.76 rows=68888 width=8) (actual > time=0.129..2619.989 rows=94018 loops=1) > Filter: (subplan) > SubPlan > -> Index Scan using watch_list_element_pkey > on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual > time=0.011..0.011 rows=1 loops=137776) > Index Cond: (($0 = watch_list_id) AND > (element_id = 54968)) > Total runtime: 3143.304 ms > (10 rows) > > freshports.org=# > > Compare that to the original query: > > freshports.org=# explain analyse > freshports.org-# SELECT W.element_id > freshports.org-# FROM watch_list_element W > freshports.org-# WHERE w.watch_list_id in (select watch_list_id > from > freshports.org(# watch_list_element where element_id = 54968) > freshports.org-# GROUP BY W.element_id > freshports.org-# ORDER BY count(W.watch_list_id) DESC > freshports.org-# LIMIT 5; > > QUERY PLAN > ---------------------------------------------------------------------- > ---------------------------------------------------------------------- > -------------------------------------------- > Limit (cost=1174.26..1174.28 rows=5 width=8) (actual > time=1786.628..1786.675 rows=5 loops=1) > -> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual > time=1786.618..1786.634 rows=5 loops=1) > Sort Key: count(w.watch_list_id) > -> HashAggregate (cost=1136.11..1138.01 rows=758 width=8) > (actual time=1718.439..1756.290 rows=7338 loops=1) > -> Nested Loop (cost=465.98..1132.32 rows=758 > width=8) (actual time=44.372..1292.270 rows=94018 loops=1) > -> HashAggregate (cost=465.98..465.98 rows=6 > width=4) (actual time=44.332..47.136 rows=599 loops=1) > -> Index Scan using > watch_list_element_element_id on watch_list_element > (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789 > rows=599 loops=1) > Index Cond: (element_id = 54968) > -> Index Scan using watch_list_element_pkey on > watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual > time=0.049..0.863 rows=157 loops=599) > Index Cond: (w.watch_list_id = > "outer".watch_list_id) > Total runtime: 1787.466 ms > (11 rows) > > freshports.org=# > > > 2) I suspect calculating all possible values would require time and an > > enormous cache buffer in size as well as re-calculating pretty often. > > So my approach would be trying to tune the query before introducing > > cached results. > > Yes, it's pretty extensive: for each element (there are 9000), get me > all the watch lists it appears upon, then get me the top 5 elements > on *those* watch lists. > -- So your original query outruns my "EXISTS" version by a factor of 2. I did not realize the "IN" has become so much faster. I am running out of clues. May be the [PERFORMANCE] list has more qualified approaches. Regards, Christoph
Hi I am bit confused.. If you want to display first 5 the following query will fetch top 5 book id's. I am not able to understand,why there is a sub-query. SELECT ELEMENT_ID , COUNT( * ) FROM WATCH_LIST_ELEMENT GROUP BY ELEMENT_ID ORDER BY COUNT(*) DESC LIMIT 5 Regards, R.Muralidharan -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]On Behalf Of Dan Langille Sent: Tuesday, April 26, 2005 7:52 AM To: pgsql-sql@postgresql.org Cc: dan@langille.org Subject: [SQL] people who buy A, also buy C, D, E The goal of my query is: given a book, what did other people who bought this book also buy? I plan the list the 5 most popular such books. In reality, this isn't about books, but that makes it easier to understand I think. We have a table of customer_id (watch_list_id) and book_id (element_id). freshports.org=# \d watch_list_element Table "public.watch_list_element" Column | Type | Modifiers ---------------+---------+-----------watch_list_id | integer | not nullelement_id | integer | not null Indexes: "watch_list_element_pkey" primary key, btree (watch_list_id, element_id) "watch_list_element_element_id" btree (element_id) Foreign-key constraints: "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON UPDATE CASCADE ON DELETE CASCADE "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE CASCADE ON DELETE CASCADE freshports.org=# I have a query which returns the needed results: SELECT W.element_id FROM watch_list_element W WHERE w.watch_list_id in (select watch_list_id from watch_list_element where element_id = 54968) GROUP BY W.element_id ORDER BY count(W.watch_list_id) DESC LIMIT 5; But performance is an issue here. So I'm planning to calculate all the possible values and cache them. That is, given each element_id in a watch_list, what are the top 5 element_id values on all the lists on which the original element_id appears? I'm having trouble constructing the query. I'm not even sure I can do this in one select, but that would be nice. Examples and clues are appreciated. Any ideas? Thank you. -- Dan Langille : http://www.langille.org/ BSDCan - The Technical BSD Conference - http://www.bsdcan.org/ NEW brochure available at http://www.bsdcan.org/2005/advocacy/ ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > The goal of my query is: given a book, what did other people who > bought this book also buy? I plan the list the 5 most popular such > books. I've been playing with this a little bit, and I don't think you are going to get better than you already have. Certainly, the caching won't work either as any insert into the watch_list_element has the potential to change a very large number of pre-compiled lists. However, there are some minor optimizations that can be made to speed up the existing query quite a bit. One note first: the LIMIT should be 6 not 5 if you really want the five other "books" and the book itself will more than likely appear in the list. Picking it out is something the client app can do. * Make sure the tables are freshly analyzed. Might want to bump up the default stats a bit too. * Looks like you already have indexes on the watch_list_element table. The watch_list_element_element_id index could be broken into multiple conditional indexes, but your explain shows this would not really gain us much: actual time=37.957..41.789 * One big gain would be to cluster the table on watch_list_id: CREATE INDEX watch_index ON watch_list_element (watch_list_id); CLUSTER watch_index ON watch_list_element; I got about a 25% speedup on my queries by doing this. YMMV, as I don't know enough about your conditions to do more than make an approximate test database. But it should help this query out. * Finally, you should upgrade if at all possible. Going from 7.4.7 to 8.0.1 gave me a 10% speed increase, while going from 8.0.1 to 8.1.0 (e.g. the upcoming version) gave me an additional 25% speed boost, mostly due to the new bitmap stuff. So, making the jump to 8.0.1 will be good practice for the 8.1.0 jump, right? :) Overall, I was able to get the query to go about a third faster than when I started. Hope this helps. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200506242328 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFCvNCrvJuQZxSWSsgRAmkDAJ44z/Ei27HuEBqx/htmCRHJZWi8VQCfV2mm upeE0p3z4h11NJzl5aOqCkc= =LVqI -----END PGP SIGNATURE-----
On 6/24/2005 11:35 PM, Greg Sabino Mullane wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > >> The goal of my query is: given a book, what did other people who >> bought this book also buy? I plan the list the 5 most popular such >> books. This sounds very much like one of the congestion problems given in the TPC-W benchmark. You might want to take a look at some of the published full disclosure reports or the PHP TPC-W implementation on pgfoundry to get some hints. Jan > > I've been playing with this a little bit, and I don't think you are > going to get better than you already have. Certainly, the caching > won't work either as any insert into the watch_list_element has > the potential to change a very large number of pre-compiled lists. > However, there are some minor optimizations that can be made to > speed up the existing query quite a bit. One note first: the LIMIT > should be 6 not 5 if you really want the five other "books" and the > book itself will more than likely appear in the list. Picking it > out is something the client app can do. > > * Make sure the tables are freshly analyzed. Might want to bump > up the default stats a bit too. > > * Looks like you already have indexes on the watch_list_element > table. The watch_list_element_element_id index could be broken > into multiple conditional indexes, but your explain shows this > would not really gain us much: > > actual time=37.957..41.789 > > * One big gain would be to cluster the table on watch_list_id: > > CREATE INDEX watch_index ON watch_list_element (watch_list_id); > CLUSTER watch_index ON watch_list_element; > > I got about a 25% speedup on my queries by doing this. YMMV, as I > don't know enough about your conditions to do more than make an > approximate test database. But it should help this query out. > > * Finally, you should upgrade if at all possible. Going from > 7.4.7 to 8.0.1 gave me a 10% speed increase, while going from > 8.0.1 to 8.1.0 (e.g. the upcoming version) gave me an additional > 25% speed boost, mostly due to the new bitmap stuff. So, making > the jump to 8.0.1 will be good practice for the 8.1.0 jump, right? :) > > Overall, I was able to get the query to go about a third faster > than when I started. Hope this helps. > > - -- > Greg Sabino Mullane greg@turnstep.com > PGP Key: 0x14964AC8 200506242328 > http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 > > -----BEGIN PGP SIGNATURE----- > > iD8DBQFCvNCrvJuQZxSWSsgRAmkDAJ44z/Ei27HuEBqx/htmCRHJZWi8VQCfV2mm > upeE0p3z4h11NJzl5aOqCkc= > =LVqI > -----END PGP SIGNATURE----- > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
>>> The goal of my query is: given a book, what did other people who >>> bought this book also buy? I plan the list the 5 most popular such >>> books. You can use the table listing ordered products directly, for example : table ordered_products: order_id, product_id, quantity SELECT b.product_id, sum(quantity) as rank FROM ordered_products a, ordered_products b WHERE a.product_id=(the product id) AND b.order_id=a.order_id AND b.product_id != a.product_id GROUP BY b.product_id ORDER BY rank DESC LIMIT 6; This will need indexes on order_id and product_id that you probably already have. It will also be slow. You can also have a cache table : cache prod_id_a, prod_id_b, quantity With a constraint that prod_id_a < prod_id_b You add a trigger on insert, update or delete to ordered_products to insert or update rows in this table, modifying the quantity according to the purchase. To select you do : SELECT * FROM ( (SELECT prod_id_b as pid, quantity FROM cache WHERE prod_id_a=(your id) ORDER BY prod_id_a DESC, quantity DESC LIMIT 5) UNION ALL (SELECT prod_id_a as pid, quantity FROM cache WHERE prod_id_b=(your id) ORDER BY prod_id_b DESC, quantity DESC LIMIT 5) ) as foo ORDER BY quantity DESC LIMIT 5; It will be probably very fast but the table will grow huge and need various indexes : (prod_id_a, quantity) (prod_id_b quantity) (prod_id_a, prod_id_b) (the primary key) You'll get 1/2 * N * (N-1) rows, N being the number of products on your site. If you remove the constraint prod_id_a < prod_id_b you'll get N^2 rows which is worse. Another solution : Table cache : product_id integer, also_purchased integer[] After every order, update also_purchased with the results of the query using the self join on ordered_products tables above. This query should not be fast enough to use in a product webpage but it shouldn't be slow enough to be used like thi, only when orders are made. To get the "also purchased products" all you have to do is read a line in this table.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>> The goal of my query is: given a book, what did other people who >>> bought this book also buy? I plan the list the 5 most popular such >>> books. > > SELECT b.product_id, sum(quantity) as rank FROM ordered_products a, > ordered_products b WHERE a.product_id=(the product id) AND > b.order_id=a.order_id AND b.product_id != a.product_id GROUP BY > b.product_id ORDER BY rank DESC LIMIT 6; I don't think this is exactly what the original poster had in mind: we want a ranking of a dynamically generated subset of all possible products (e.g. books). So if someone buys "Harry Potter and the Proprietary Database", then only the books bought by people who also bought /that/ book are considered, ranked, and ordered. There's not a lot of caching that can be effectively done, due to the high number of combinations and large potential for change. > table ordered_products: order_id, product_id, quantity I'm not sure where you are getting "quantity" from: as near as I can tell, this will always be a quantity of 1: one person ordering one item. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200506281946 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFCweKavJuQZxSWSsgRAkmHAJ9fQ+Degs6jSrGRozEoI35F8nlyBACfYm2u QgawxHOij5FHVd0FopW25IU= =r5eo -----END PGP SIGNATURE-----