Thread: people who buy A, also buy C, D, E

people who buy A, also buy C, D, E

From
"Dan Langille"
Date:
The goal of my query is: given a book, what did other people who 
bought this book also buy?  I plan the list the 5 most popular such 
books.  In reality, this isn't about books, but that makes it easier 
to understand I think.

We have a table of customer_id (watch_list_id) and book_id 
(element_id).

freshports.org=# \d watch_list_element Table "public.watch_list_element"   Column     |  Type   | Modifiers
---------------+---------+-----------watch_list_id | integer | not nullelement_id    | integer | not null
Indexes:   "watch_list_element_pkey" primary key, btree (watch_list_id, 
element_id)   "watch_list_element_element_id" btree (element_id)
Foreign-key constraints:   "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON 
UPDATE CASCADE ON DELETE CASCADE   "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE 
CASCADE ON DELETE CASCADE

freshports.org=#

I have a query which returns the needed results:
    SELECT W.element_id      FROM watch_list_element W     WHERE w.watch_list_id in (select watch_list_id from 
watch_list_element where element_id = 54968)  GROUP BY W.element_id  ORDER BY count(W.watch_list_id) DESC     LIMIT 5;

But performance is an issue here.  So I'm planning to calculate all 
the possible values and cache them. That is, given each element_id in 
a watch_list, what are the top 5 element_id values on all the lists 
on which the original element_id appears?

I'm having trouble constructing the query.  I'm not even sure I can 
do this in one select, but that would be nice.  Examples and clues 
are appreciated.

Any ideas?

Thank you.
-- 
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/  NEW brochure available at
http://www.bsdcan.org/2005/advocacy/



Re: people who buy A, also buy C, D, E

From
Christoph Haller
Date:
Dan Langille wrote:
> 
> The goal of my query is: given a book, what did other people who
> bought this book also buy?  I plan the list the 5 most popular such
> books.  In reality, this isn't about books, but that makes it easier
> to understand I think.
> 
> We have a table of customer_id (watch_list_id) and book_id
> (element_id).
> 
> freshports.org=# \d watch_list_element
>   Table "public.watch_list_element"
>     Column     |  Type   | Modifiers
> ---------------+---------+-----------
>  watch_list_id | integer | not null
>  element_id    | integer | not null
> Indexes:
>     "watch_list_element_pkey" primary key, btree (watch_list_id,
> element_id)
>     "watch_list_element_element_id" btree (element_id)
> Foreign-key constraints:
>     "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> UPDATE CASCADE ON DELETE CASCADE
>     "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> CASCADE ON DELETE CASCADE
> 
> freshports.org=#
> 
> I have a query which returns the needed results:
> 
>      SELECT W.element_id
>        FROM watch_list_element W
>       WHERE w.watch_list_id in (select watch_list_id from
> watch_list_element where element_id = 54968)
>    GROUP BY W.element_id
>    ORDER BY count(W.watch_list_id) DESC
>       LIMIT 5;
> 
> But performance is an issue here.  So I'm planning to calculate all
> the possible values and cache them. That is, given each element_id in
> a watch_list, what are the top 5 element_id values on all the lists
> on which the original element_id appears?
> 
> I'm having trouble constructing the query.  I'm not even sure I can
> do this in one select, but that would be nice.  Examples and clues
> are appreciated.
> 
> Any ideas?
> 
> Thank you.
> --

Just two ideas. 

1) Older Postgres versions are notorious for being slow 
on "IN" clauses. 
Does this one (untested) perform better: 

SELECT W.element_id, count(W.watch_list_id) FROM watch_list_element W
WHERE EXISTS
(SELECT * FROM watch_list_element EWHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
GROUP BY W.element_id
ORDER BY 2 DESC
LIMIT 5;

2) I suspect calculating all possible values would require time and 
an enormous cache buffer in size as well as re-calculating pretty often. 
So my approach would be trying to tune the query before introducing 
cached results. 

HTH

Regards, Christoph


Re: people who buy A, also buy C, D, E

From
"Dan Langille"
Date:
On 26 Apr 2005 at 14:24, Christoph Haller wrote:

> Dan Langille wrote:
> > 
> > The goal of my query is: given a book, what did other people who
> > bought this book also buy?  I plan the list the 5 most popular such
> > books.  In reality, this isn't about books, but that makes it easier
> > to understand I think.
> > 
> > We have a table of customer_id (watch_list_id) and book_id
> > (element_id).
> > 
> > freshports.org=# \d watch_list_element
> >   Table "public.watch_list_element"
> >     Column     |  Type   | Modifiers
> > ---------------+---------+-----------
> >  watch_list_id | integer | not null
> >  element_id    | integer | not null
> > Indexes:
> >     "watch_list_element_pkey" primary key, btree (watch_list_id,
> > element_id)
> >     "watch_list_element_element_id" btree (element_id)
> > Foreign-key constraints:
> >     "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > UPDATE CASCADE ON DELETE CASCADE
> >     "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > CASCADE ON DELETE CASCADE
> > 
> > freshports.org=#
> > 
> > I have a query which returns the needed results:
> > 
> >      SELECT W.element_id
> >        FROM watch_list_element W
> >       WHERE w.watch_list_id in (select watch_list_id from
> > watch_list_element where element_id = 54968)
> >    GROUP BY W.element_id
> >    ORDER BY count(W.watch_list_id) DESC
> >       LIMIT 5;
> > 
> > But performance is an issue here.  So I'm planning to calculate all
> > the possible values and cache them. That is, given each element_id
> > in a watch_list, what are the top 5 element_id values on all the
> > lists on which the original element_id appears?
> > 
> > I'm having trouble constructing the query.  I'm not even sure I can
> > do this in one select, but that would be nice.  Examples and clues
> > are appreciated.
> > 
> > Any ideas?
> > 
> > Thank you.
> > --
> 
> Just two ideas. 
> 
> 1) Older Postgres versions are notorious for being slow 
> on "IN" clauses. 
> Does this one (untested) perform better: 
> 
> SELECT W.element_id, count(W.watch_list_id)
>   FROM watch_list_element W
> WHERE EXISTS
> (SELECT * FROM watch_list_element E
>  WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;

I'm on PostgreSQL 7.4.7:

freshports.org=# explain analyse
freshports.org-# SELECT W.element_id, count(W.watch_list_id)
freshports.org-#   FROM watch_list_element W
freshports.org-# WHERE EXISTS
freshports.org-# (SELECT * FROM watch_list_element E
freshports.org(#  WHERE E.element_id = 54968 AND W.watch_list_id = 
E.watch_list_id)
freshports.org-# GROUP BY W.element_id
freshports.org-# ORDER BY 2 DESC
freshports.org-# LIMIT 5;                                                                               QUERY PLAN
----------------------------------------------------------------------
----------------------------------------------------------------------
---------------------------------Limit  (cost=417905.49..417905.51 rows=5 width=8) (actual 
time=3142.480..3142.528 rows=5 loops=1)  ->  Sort  (cost=417905.49..417908.08 rows=1033 width=8) (actual 
time=3142.471..3142.486 rows=5 loops=1)        Sort Key: count(watch_list_id)        ->  HashAggregate
(cost=417851.20..417853.78rows=1033 
 
width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)              ->  Seq Scan on watch_list_element w  
(cost=0.00..417506.76 rows=68888 width=8) (actual 
time=0.129..2619.989 rows=94018 loops=1)                    Filter: (subplan)                    SubPlan
     ->  Index Scan using watch_list_element_pkey 
 
on watch_list_element e  (cost=0.00..3.02 rows=1 width=8) (actual 
time=0.011..0.011 rows=1 loops=137776)                            Index Cond: (($0 = watch_list_id) AND 
(element_id = 54968))Total runtime: 3143.304 ms
(10 rows)

freshports.org=#

Compare that to the original query:

freshports.org=# explain analyse
freshports.org-#      SELECT W.element_id
freshports.org-#        FROM watch_list_element W
freshports.org-#       WHERE w.watch_list_id in (select watch_list_id 
from
freshports.org(# watch_list_element where element_id = 54968)
freshports.org-#    GROUP BY W.element_id
freshports.org-#    ORDER BY count(W.watch_list_id) DESC
freshports.org-#       LIMIT 5;
QUERYPLAN
 
----------------------------------------------------------------------
----------------------------------------------------------------------
--------------------------------------------Limit  (cost=1174.26..1174.28 rows=5 width=8) (actual 
time=1786.628..1786.675 rows=5 loops=1)  ->  Sort  (cost=1174.26..1176.16 rows=758 width=8) (actual 
time=1786.618..1786.634 rows=5 loops=1)        Sort Key: count(w.watch_list_id)        ->  HashAggregate
(cost=1136.11..1138.01rows=758 width=8) 
 
(actual time=1718.439..1756.290 rows=7338 loops=1)              ->  Nested Loop  (cost=465.98..1132.32 rows=758 
width=8) (actual time=44.372..1292.270 rows=94018 loops=1)                    ->  HashAggregate  (cost=465.98..465.98
rows=6
 
width=4) (actual time=44.332..47.136 rows=599 loops=1)                          ->  Index Scan using 
watch_list_element_element_id on watch_list_element  
(cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789 
rows=599 loops=1)                                Index Cond: (element_id = 54968)                    ->  Index Scan
usingwatch_list_element_pkey on 
 
watch_list_element w  (cost=0.00..109.48 rows=126 width=8) (actual 
time=0.049..0.863 rows=157 loops=599)                          Index Cond: (w.watch_list_id = 
"outer".watch_list_id)Total runtime: 1787.466 ms
(11 rows)

freshports.org=#

> 2) I suspect calculating all possible values would require time and an
> enormous cache buffer in size as well as re-calculating pretty often.
> So my approach would be trying to tune the query before introducing
> cached results. 

Yes, it's pretty extensive: for each element (there are 9000), get me 

all the watch lists it appears upon, then get me the top 5 elements 
on
*those* watch lists.
-- 
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/  NEW brochure available at
http://www.bsdcan.org/2005/advocacy/



Re: people who buy A, also buy C, D, E

From
Achilleus Mantzios
Date:
O Christoph Haller έγραψε στις Apr 26, 2005 :

> Dan Langille wrote:
> > 
> > The goal of my query is: given a book, what did other people who
> > bought this book also buy?  I plan the list the 5 most popular such
> > books.  In reality, this isn't about books, but that makes it easier
> > to understand I think.
> > 
> > We have a table of customer_id (watch_list_id) and book_id
> > (element_id).
> > 
> > freshports.org=# \d watch_list_element
> >   Table "public.watch_list_element"
> >     Column     |  Type   | Modifiers
> > ---------------+---------+-----------
> >  watch_list_id | integer | not null
> >  element_id    | integer | not null
> > Indexes:
> >     "watch_list_element_pkey" primary key, btree (watch_list_id,
> > element_id)
> >     "watch_list_element_element_id" btree (element_id)
> > Foreign-key constraints:
> >     "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > UPDATE CASCADE ON DELETE CASCADE
> >     "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > CASCADE ON DELETE CASCADE
> > 
> > freshports.org=#
> > 
> > I have a query which returns the needed results:
> > 
> >      SELECT W.element_id
> >        FROM watch_list_element W
> >       WHERE w.watch_list_id in (select watch_list_id from
> > watch_list_element where element_id = 54968)
> >    GROUP BY W.element_id
> >    ORDER BY count(W.watch_list_id) DESC
> >       LIMIT 5;
> > 
> > But performance is an issue here.  So I'm planning to calculate all
> > the possible values and cache them. That is, given each element_id in
> > a watch_list, what are the top 5 element_id values on all the lists
> > on which the original element_id appears?
> > 
> > I'm having trouble constructing the query.  I'm not even sure I can
> > do this in one select, but that would be nice.  Examples and clues
> > are appreciated.
> > 
> > Any ideas?
> > 
> > Thank you.
> > --
> 
> Just two ideas. 
> 
> 1) Older Postgres versions are notorious for being slow 
> on "IN" clauses. 
> Does this one (untested) perform better: 
> 
> SELECT W.element_id, count(W.watch_list_id)
>   FROM watch_list_element W
> WHERE EXISTS
> (SELECT * FROM watch_list_element E
>  WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> GROUP BY W.element_id
> ORDER BY 2 DESC
> LIMIT 5;
> 
> 2) I suspect calculating all possible values would require time and 
> an enormous cache buffer in size as well as re-calculating pretty often. 
> So my approach would be trying to tune the query before introducing 
> cached results. 

AFAIK, problems like this fall into the "Data Mining" field,
and often their solution go beyond some DB arrangments.
A little research wouldn't hurt, IMO.

> 
> HTH
> 
> Regards, Christoph
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
> 

-- 
-Achilleus



Re: people who buy A, also buy C, D, E

From
Christoph Haller
Date:
Dan Langille wrote:
> 
> On 26 Apr 2005 at 14:24, Christoph Haller wrote:
> 
> > Dan Langille wrote:
> > >
> > > The goal of my query is: given a book, what did other people who
> > > bought this book also buy?  I plan the list the 5 most popular such
> > > books.  In reality, this isn't about books, but that makes it easier
> > > to understand I think.
> > >
> > > We have a table of customer_id (watch_list_id) and book_id
> > > (element_id).
> > >
> > > freshports.org=# \d watch_list_element
> > >   Table "public.watch_list_element"
> > >     Column     |  Type   | Modifiers
> > > ---------------+---------+-----------
> > >  watch_list_id | integer | not null
> > >  element_id    | integer | not null
> > > Indexes:
> > >     "watch_list_element_pkey" primary key, btree (watch_list_id,
> > > element_id)
> > >     "watch_list_element_element_id" btree (element_id)
> > > Foreign-key constraints:
> > >     "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > > UPDATE CASCADE ON DELETE CASCADE
> > >     "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > > CASCADE ON DELETE CASCADE
> > >
> > > freshports.org=#
> > >
> > > I have a query which returns the needed results:
> > >
> > >      SELECT W.element_id
> > >        FROM watch_list_element W
> > >       WHERE w.watch_list_id in (select watch_list_id from
> > > watch_list_element where element_id = 54968)
> > >    GROUP BY W.element_id
> > >    ORDER BY count(W.watch_list_id) DESC
> > >       LIMIT 5;
> > >
> > > But performance is an issue here.  So I'm planning to calculate all
> > > the possible values and cache them. That is, given each element_id
> > > in a watch_list, what are the top 5 element_id values on all the
> > > lists on which the original element_id appears?
> > >
> > > I'm having trouble constructing the query.  I'm not even sure I can
> > > do this in one select, but that would be nice.  Examples and clues
> > > are appreciated.
> > >
> > > Any ideas?
> > >
> > > Thank you.
> > > --
> >
> > Just two ideas.
> >
> > 1) Older Postgres versions are notorious for being slow
> > on "IN" clauses.
> > Does this one (untested) perform better:
> >
> > SELECT W.element_id, count(W.watch_list_id)
> >   FROM watch_list_element W
> > WHERE EXISTS
> > (SELECT * FROM watch_list_element E
> >  WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;
> 
> I'm on PostgreSQL 7.4.7:
> 
> freshports.org=# explain analyse
> freshports.org-# SELECT W.element_id, count(W.watch_list_id)
> freshports.org-#   FROM watch_list_element W
> freshports.org-# WHERE EXISTS
> freshports.org-# (SELECT * FROM watch_list_element E
> freshports.org(#  WHERE E.element_id = 54968 AND W.watch_list_id =
> E.watch_list_id)
> freshports.org-# GROUP BY W.element_id
> freshports.org-# ORDER BY 2 DESC
> freshports.org-# LIMIT 5;
> 
>            QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------
>  Limit  (cost=417905.49..417905.51 rows=5 width=8) (actual
> time=3142.480..3142.528 rows=5 loops=1)
>    ->  Sort  (cost=417905.49..417908.08 rows=1033 width=8) (actual
> time=3142.471..3142.486 rows=5 loops=1)
>          Sort Key: count(watch_list_id)
>          ->  HashAggregate  (cost=417851.20..417853.78 rows=1033
> width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)
>                ->  Seq Scan on watch_list_element w
> (cost=0.00..417506.76 rows=68888 width=8) (actual
> time=0.129..2619.989 rows=94018 loops=1)
>                      Filter: (subplan)
>                      SubPlan
>                        ->  Index Scan using watch_list_element_pkey
> on watch_list_element e  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.011..0.011 rows=1 loops=137776)
>                              Index Cond: (($0 = watch_list_id) AND
> (element_id = 54968))
>  Total runtime: 3143.304 ms
> (10 rows)
> 
> freshports.org=#
> 
> Compare that to the original query:
> 
> freshports.org=# explain analyse
> freshports.org-#      SELECT W.element_id
> freshports.org-#        FROM watch_list_element W
> freshports.org-#       WHERE w.watch_list_id in (select watch_list_id
> from
> freshports.org(# watch_list_element where element_id = 54968)
> freshports.org-#    GROUP BY W.element_id
> freshports.org-#    ORDER BY count(W.watch_list_id) DESC
> freshports.org-#       LIMIT 5;
> 
>                  QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------
>  Limit  (cost=1174.26..1174.28 rows=5 width=8) (actual
> time=1786.628..1786.675 rows=5 loops=1)
>    ->  Sort  (cost=1174.26..1176.16 rows=758 width=8) (actual
> time=1786.618..1786.634 rows=5 loops=1)
>          Sort Key: count(w.watch_list_id)
>          ->  HashAggregate  (cost=1136.11..1138.01 rows=758 width=8)
> (actual time=1718.439..1756.290 rows=7338 loops=1)
>                ->  Nested Loop  (cost=465.98..1132.32 rows=758
> width=8) (actual time=44.372..1292.270 rows=94018 loops=1)
>                      ->  HashAggregate  (cost=465.98..465.98 rows=6
> width=4) (actual time=44.332..47.136 rows=599 loops=1)
>                            ->  Index Scan using
> watch_list_element_element_id on watch_list_element
> (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789
> rows=599 loops=1)
>                                  Index Cond: (element_id = 54968)
>                      ->  Index Scan using watch_list_element_pkey on
> watch_list_element w  (cost=0.00..109.48 rows=126 width=8) (actual
> time=0.049..0.863 rows=157 loops=599)
>                            Index Cond: (w.watch_list_id =
> "outer".watch_list_id)
>  Total runtime: 1787.466 ms
> (11 rows)
> 
> freshports.org=#
> 
> > 2) I suspect calculating all possible values would require time and an
> > enormous cache buffer in size as well as re-calculating pretty often.
> > So my approach would be trying to tune the query before introducing
> > cached results.
> 
> Yes, it's pretty extensive: for each element (there are 9000), get me
> all the watch lists it appears upon, then get me the top 5 elements
> on *those* watch lists.
> --

So your original query outruns my "EXISTS" version 
by a factor of 2. I did not realize the "IN" has 
become so much faster. 
I am running out of clues. 
May be the [PERFORMANCE] list has more 
qualified approaches. 
Regards, Christoph


Re: people who buy A, also buy C, D, E

From
"Ramakrishnan Muralidharan"
Date:
Hi
 I am bit confused.. If you want to display first 5 the following query will fetch top 5 book id's. I am not able to
understand,why there is a sub-query. 
 SELECT ELEMENT_ID , COUNT( * ) FROM WATCH_LIST_ELEMENT GROUP BY ELEMENT_ID  ORDER BY COUNT(*) DESC LIMIT 5

Regards,
R.Muralidharan

-----Original Message-----
From: pgsql-sql-owner@postgresql.org
[mailto:pgsql-sql-owner@postgresql.org]On Behalf Of Dan Langille
Sent: Tuesday, April 26, 2005 7:52 AM
To: pgsql-sql@postgresql.org
Cc: dan@langille.org
Subject: [SQL] people who buy A, also buy C, D, E


The goal of my query is: given a book, what did other people who
bought this book also buy?  I plan the list the 5 most popular such
books.  In reality, this isn't about books, but that makes it easier
to understand I think.

We have a table of customer_id (watch_list_id) and book_id
(element_id).

freshports.org=# \d watch_list_element Table "public.watch_list_element"   Column     |  Type   | Modifiers
---------------+---------+-----------watch_list_id | integer | not nullelement_id    | integer | not null
Indexes:   "watch_list_element_pkey" primary key, btree (watch_list_id,
element_id)   "watch_list_element_element_id" btree (element_id)
Foreign-key constraints:   "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
UPDATE CASCADE ON DELETE CASCADE   "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
CASCADE ON DELETE CASCADE

freshports.org=#

I have a query which returns the needed results:
    SELECT W.element_id      FROM watch_list_element W     WHERE w.watch_list_id in (select watch_list_id from
watch_list_element where element_id = 54968)  GROUP BY W.element_id  ORDER BY count(W.watch_list_id) DESC     LIMIT 5;

But performance is an issue here.  So I'm planning to calculate all
the possible values and cache them. That is, given each element_id in
a watch_list, what are the top 5 element_id values on all the lists
on which the original element_id appears?

I'm having trouble constructing the query.  I'm not even sure I can
do this in one select, but that would be nice.  Examples and clues
are appreciated.

Any ideas?

Thank you.
--
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/  NEW brochure available at
http://www.bsdcan.org/2005/advocacy/


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


Re: people who buy A, also buy C, D, E

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> The goal of my query is: given a book, what did other people who
> bought this book also buy?  I plan the list the 5 most popular such
> books.

I've been playing with this a little bit, and I don't think you are
going to get better than you already have. Certainly, the caching
won't work either as any insert into the watch_list_element has
the potential to change a very large number of pre-compiled lists.
However, there are some minor optimizations that can be made to
speed up the existing query quite a bit. One note first: the LIMIT
should be 6 not 5 if you really want the five other "books" and the
book itself will more than likely appear in the list. Picking it
out is something the client app can do.

* Make sure the tables are freshly analyzed. Might want to bump
up the default stats a bit too.

* Looks like you already have indexes on the watch_list_element
table. The watch_list_element_element_id index could be broken
into multiple conditional indexes, but your explain shows this
would not really gain us much:

actual time=37.957..41.789

* One big gain would be to cluster the table on watch_list_id:

CREATE INDEX watch_index ON watch_list_element (watch_list_id);
CLUSTER watch_index ON watch_list_element;

I got about a 25% speedup on my queries by doing this. YMMV, as I
don't know enough about your conditions to do more than make an
approximate test database. But it should help this query out.

* Finally, you should upgrade if at all possible. Going from
7.4.7 to 8.0.1 gave me a 10% speed increase, while going from
8.0.1 to 8.1.0 (e.g. the upcoming version) gave me an additional
25% speed boost, mostly due to the new bitmap stuff. So, making
the jump to 8.0.1 will be good practice for the 8.1.0 jump, right? :)

Overall, I was able to get the query to go about a third faster
than when I started. Hope this helps.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200506242328
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFCvNCrvJuQZxSWSsgRAmkDAJ44z/Ei27HuEBqx/htmCRHJZWi8VQCfV2mm
upeE0p3z4h11NJzl5aOqCkc=
=LVqI
-----END PGP SIGNATURE-----




Re: people who buy A, also buy C, D, E

From
Jan Wieck
Date:
On 6/24/2005 11:35 PM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
>> The goal of my query is: given a book, what did other people who
>> bought this book also buy?  I plan the list the 5 most popular such
>> books.

This sounds very much like one of the congestion problems given in the 
TPC-W benchmark. You might want to take a look at some of the published 
full disclosure reports or the PHP TPC-W implementation on pgfoundry to 
get some hints.


Jan

> 
> I've been playing with this a little bit, and I don't think you are
> going to get better than you already have. Certainly, the caching
> won't work either as any insert into the watch_list_element has
> the potential to change a very large number of pre-compiled lists.
> However, there are some minor optimizations that can be made to
> speed up the existing query quite a bit. One note first: the LIMIT
> should be 6 not 5 if you really want the five other "books" and the
> book itself will more than likely appear in the list. Picking it
> out is something the client app can do.
> 
> * Make sure the tables are freshly analyzed. Might want to bump
> up the default stats a bit too.
> 
> * Looks like you already have indexes on the watch_list_element
> table. The watch_list_element_element_id index could be broken
> into multiple conditional indexes, but your explain shows this
> would not really gain us much:
> 
> actual time=37.957..41.789
> 
> * One big gain would be to cluster the table on watch_list_id:
> 
> CREATE INDEX watch_index ON watch_list_element (watch_list_id);
> CLUSTER watch_index ON watch_list_element;
> 
> I got about a 25% speedup on my queries by doing this. YMMV, as I
> don't know enough about your conditions to do more than make an
> approximate test database. But it should help this query out.
> 
> * Finally, you should upgrade if at all possible. Going from
> 7.4.7 to 8.0.1 gave me a 10% speed increase, while going from
> 8.0.1 to 8.1.0 (e.g. the upcoming version) gave me an additional
> 25% speed boost, mostly due to the new bitmap stuff. So, making
> the jump to 8.0.1 will be good practice for the 8.1.0 jump, right? :)
> 
> Overall, I was able to get the query to go about a third faster
> than when I started. Hope this helps.
> 
> - --
> Greg Sabino Mullane greg@turnstep.com
> PGP Key: 0x14964AC8 200506242328
> http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
> 
> -----BEGIN PGP SIGNATURE-----
> 
> iD8DBQFCvNCrvJuQZxSWSsgRAmkDAJ44z/Ei27HuEBqx/htmCRHJZWi8VQCfV2mm
> upeE0p3z4h11NJzl5aOqCkc=
> =LVqI
> -----END PGP SIGNATURE-----
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: people who buy A, also buy C, D, E

From
PFC
Date:

>>> The goal of my query is: given a book, what did other people who
>>> bought this book also buy?  I plan the list the 5 most popular such
>>> books.
You can use the table listing ordered products directly, for example :

table ordered_products:    order_id,  product_id,  quantity

SELECT b.product_id, sum(quantity) as rank FROM ordered_products a,  
ordered_products b WHERE a.product_id=(the product id) AND  
b.order_id=a.order_id AND b.product_id != a.product_id GROUP BY  
b.product_id ORDER BY rank DESC LIMIT 6;

This will need indexes on order_id and product_id that you probably  
already have.
It will also be slow.
You can also have a cache table :

cache    prod_id_a, prod_id_b, quantity
With a constraint that prod_id_a < prod_id_b

You add a trigger on insert, update or delete to ordered_products to  
insert or update rows in this table, modifying the quantity according to  
the purchase.

To select you do :

SELECT * FROM
(
(SELECT prod_id_b as pid, quantity FROM cache WHERE prod_id_a=(your id)  
ORDER BY prod_id_a DESC, quantity DESC LIMIT 5)
UNION ALL
(SELECT prod_id_a as pid, quantity FROM cache WHERE prod_id_b=(your id)  
ORDER BY prod_id_b DESC, quantity DESC LIMIT 5)
) as foo
ORDER BY quantity DESC
LIMIT 5;

It will be probably very fast but the table will grow huge and need  
various indexes :
(prod_id_a, quantity)
(prod_id_b quantity)
(prod_id_a, prod_id_b)    (the primary key)

You'll get 1/2 * N * (N-1) rows, N being the number of products on your  
site. If you remove the constraint  prod_id_a < prod_id_b you'll get N^2 rows which is worse.
Another solution :

Table cache : product_id integer, also_purchased integer[]

After every order, update also_purchased with the results of the query  
using the self join on ordered_products tables above.
This query should not be fast enough to use in a product webpage but it  
shouldn't be slow enough to be used like thi, only when orders are made.

To get the "also purchased products" all you have to do is read a line in  
this table.














Re: people who buy A, also buy C, D, E

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



>>> The goal of my query is: given a book, what did other people who
>>> bought this book also buy?  I plan the list the 5 most popular such
>>> books.
>
> SELECT b.product_id, sum(quantity) as rank FROM ordered_products a,
> ordered_products b WHERE a.product_id=(the product id) AND
> b.order_id=a.order_id AND b.product_id != a.product_id GROUP BY
> b.product_id ORDER BY rank DESC LIMIT 6;

I don't think this is exactly what the original poster had in mind:
we want a ranking of a dynamically generated subset of all possible
products (e.g. books). So if someone buys "Harry Potter and the Proprietary
Database", then only the books bought by people who also bought /that/
book are considered, ranked, and ordered. There's not a lot of caching that
can be effectively done, due to the high number of combinations and large
potential for change.

> table ordered_products: order_id,  product_id,  quantity

I'm not sure where you are getting "quantity" from: as near as I
can tell, this will always be a quantity of 1: one person ordering
one item.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200506281946
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8


-----BEGIN PGP SIGNATURE-----

iD8DBQFCweKavJuQZxSWSsgRAkmHAJ9fQ+Degs6jSrGRozEoI35F8nlyBACfYm2u
QgawxHOij5FHVd0FopW25IU=
=r5eo
-----END PGP SIGNATURE-----