It seems a user visit cookie's expiration has been improperly set and so the
tables which represent a user's clickstream through the website are grouped
improperly. I'd like to run a set-wise query to regroup them based on periods
of inactivity greather than 1 hour. So, what we want to do is adjust the
visit_id's in the page_view table to point to separate visits whenever the
user has been idle for more than 1 hour. The query I've come up with looks
like this:
CREATE TABLE visit_clean AS
SELECT id , CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN
nextval('tracking_cleanup_seq') ELSE currval('tracking_cleanup_seq') END AS visit_id , visit_id AS
old_visit_id , uri , params , stamp , cindex , tindex , method , source_address ,
server_name FROM ( SELECT id , visit_id , uri , params , stamp , cindex ,
tindex , method , source_address , ( SELECT max(pv2.stamp) FROM page_view pv2
WHERE pv2.visit_id = pv1.visit_id AND pv2.stamp < pv1.stamp ) AS last_stamp
, server_name FROM page_view pv1 ) x;
It works reasonably well on a small data set. But, the plan for this is
horrible on the real data.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
SeqScan on page_view pv1 (cost=0.00..2634763281.70 rows=3588374 width=239) SubPlan -> Result
(cost=367.09..367.10rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1 width=8)
-> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8)
Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
-> Result (cost=367.09..367.10 rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1
width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86
rows=431width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL)
AND(visit_id = $0))
The page_view table is 829MB in size and has 3,590,185 rows.
I let the query run on my test server for about 26hrs without finishing, so I
can't provide explain analyze output yet.
Does anyone have a better method of separating this data out?
---
Jeff Frost, Owner <jeff@frostconsultingllc.com>
Frost Consulting, LLC http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954