Re: Removing redundant itemsets - Mailing list pgsql-sql

From Craig Ringer
Subject Re: Removing redundant itemsets
Date
Msg-id 47F0CB12.5010506@postnewspapers.com.au
Whole thread Raw
In response to Re: Removing redundant itemsets  (Craig Ringer <craig@postnewspapers.com.au>)
Responses Re: Removing redundant itemsets  (Craig Ringer <craig@postnewspapers.com.au>)
List pgsql-sql
Craig Ringer wrote:
> Allan Kamau wrote:
>> Hi all,
>> I have a list of purchases (market basket) and I would like to select
>> non redundant longest possible patterns by eliminating
>> (creating/populating other table to contain only non redandant itemsets)
>> purchases having item lists which are fully included in at least one
>> other purchase.
> 
> Here's a possibly slow and surely ugly solution (I think it's right,
> though I haven't done more than passing testing):
> 
> 
> 
> CREATE VIEW togo_as_arr AS
>   SELECT a.tid,
>     ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item)
>     AS items
>   FROM togo a GROUP BY tid;
> 
> SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by
> FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b
> WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items;

Alternately:

-- Helps with the massively repeated subquery below
CREATE INDEX togo_by_tid_and_item ON togo(tid,item);

-- Find any `a' for which `item_from_a_is_in_b' is
-- true for all items in `a'
SELECT a_tid AS is_redundant, b_tid AS contained_by
FROM ( -- For every item in every pair of purchases, -- determine whether the item in purchase `a' -- was also in
purchase`b'. SELECT   a.tid AS a_tid,   b.tid AS b_tid,   a.item AS item,   EXISTS(     -- Was this item from `a' also
inthe `b' purchase?     SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item   ) AS item_from_a_is_in_b FROM
togoa INNER JOIN togo b ON (a.tid <> b.tid) GROUP BY a.tid, b.tid, a.item) AS item_containment
 
GROUP BY a_tid, b_tid
HAVING every(item_from_a_is_in_b);


... which avoids the array building, but is actually considerably slower
on the trivial test data. That's not too surprising given that this
approach requires a subquery:
  SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item

for EVERY item to be tested. Twice, actually, as each record appears in
both `a' and `b' positions.

I'd be very interested to see what happened on real world test data,
especially compared to doing the array accumulation based query off a
temporary table instead of a view.

I suspect it'll depend on the average number of items per purchase -
lots of items per purchase and the array building cost will dominate.
That's really just a guess, though.

I'm sure there's a properly smart way to do this that I just can't
figure out, but this is the best I've come up with so far.

--
Craig Ringer


pgsql-sql by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Removing redundant itemsets
Next
From: Craig Ringer
Date:
Subject: Re: Removing redundant itemsets