separating improperly grouped page views - Mailing list pgsql-sql

From Jeff Frost
Subject separating improperly grouped page views
Date
Msg-id Pine.LNX.4.64.0706172157410.14506@discord.home.frostconsultingllc.com
Whole thread Raw
Responses Re: separating improperly grouped page views  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql
It seems a user visit cookie's expiration has been improperly set and so the 
tables which represent a user's clickstream through the website are grouped 
improperly.  I'd like to run a set-wise query to regroup them based on periods 
of inactivity greather than 1 hour.  So, what we want to do is adjust the 
visit_id's in the page_view table to point to separate visits whenever the 
user has been idle for more than 1 hour. The query I've come up with looks 
like this:

CREATE TABLE visit_clean AS
SELECT id     ,       CASE       WHEN stamp - last_stamp > INTERVAL '1 hour'    OR last_stamp IS NULL THEN
nextval('tracking_cleanup_seq')      ELSE currval('tracking_cleanup_seq')       END AS visit_id     , visit_id AS
old_visit_id    , uri     , params     , stamp     , cindex     , tindex     , method     , source_address     ,
server_name FROM (    SELECT id        , visit_id        , uri        , params        , stamp        , cindex        ,
tindex       , method        , source_address        , (        SELECT max(pv2.stamp)          FROM page_view pv2
WHERE pv2.visit_id            = pv1.visit_id          AND pv2.stamp               < pv1.stamp          ) AS last_stamp
     , server_name      FROM page_view pv1       ) x;
 

It works reasonably well on a small data set.  But, the plan for this is 
horrible on the real data.
                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1  (cost=0.00..2634763281.70 rows=3588374 width=239)   SubPlan     ->  Result
(cost=367.09..367.10rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..367.09 rows=1 width=8)
          ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..158215.86 rows=431 width=8)
                     Index Cond: (stamp < $1)                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
   ->  Result  (cost=367.09..367.10 rows=1 width=0)           InitPlan             ->  Limit  (cost=0.00..367.09 rows=1
width=8)                  ->  Index Scan Backward using page_view_stamp_idx on page_view pv2  (cost=0.00..158215.86
rows=431width=8)                         Index Cond: (stamp < $1)                         Filter: ((stamp IS NOT NULL)
AND(visit_id = $0))
 

The page_view table is 829MB in size and has 3,590,185 rows.

I let the query run on my test server for about 26hrs without finishing, so I 
can't provide explain analyze output yet.

Does anyone have a better method of separating this data out?

---
Jeff Frost, Owner       <jeff@frostconsultingllc.com>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954



pgsql-sql by date:

Previous
From: Andrew Sullivan
Date:
Subject: Re: joining a table whose name is stored in the primary record
Next
From: Tom Lane
Date:
Subject: Re: separating improperly grouped page views