Thread: separating improperly grouped page views
It seems a user visit cookie's expiration has been improperly set and so the tables which represent a user's clickstream through the website are grouped improperly. I'd like to run a set-wise query to regroup them based on periods of inactivity greather than 1 hour. So, what we want to do is adjust the visit_id's in the page_view table to point to separate visits whenever the user has been idle for more than 1 hour. The query I've come up with looks like this: CREATE TABLE visit_clean AS SELECT id , CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN nextval('tracking_cleanup_seq') ELSE currval('tracking_cleanup_seq') END AS visit_id , visit_id AS old_visit_id , uri , params , stamp , cindex , tindex , method , source_address , server_name FROM ( SELECT id , visit_id , uri , params , stamp , cindex , tindex , method , source_address , ( SELECT max(pv2.stamp) FROM page_view pv2 WHERE pv2.visit_id = pv1.visit_id AND pv2.stamp < pv1.stamp ) AS last_stamp , server_name FROM page_view pv1 ) x; It works reasonably well on a small data set. But, the plan for this is horrible on the real data. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- SeqScan on page_view pv1 (cost=0.00..2634763281.70 rows=3588374 width=239) SubPlan -> Result (cost=367.09..367.10rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1 width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431 width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) -> Result (cost=367.09..367.10 rows=1 width=0) InitPlan -> Limit (cost=0.00..367.09 rows=1 width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND(visit_id = $0)) The page_view table is 829MB in size and has 3,590,185 rows. I let the query run on my test server for about 26hrs without finishing, so I can't provide explain analyze output yet. Does anyone have a better method of separating this data out? --- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes: > [ umpteen million iterations of: ] > -> Limit (cost=0.00..367.09 rows=1 width=8) > -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431width=8) > Index Cond: (stamp < $1) > Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) Perhaps an index on (visit_id, stamp) would help. This one is doing the best it can, but if the visit_id's you want are thinly scattered, it'll still suck... regards, tom lane
On Mon, 18 Jun 2007, Tom Lane wrote: > Jeff Frost <jeff@frostconsultingllc.com> writes: >> [ umpteen million iterations of: ] >> -> Limit (cost=0.00..367.09 rows=1 width=8) >> -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..158215.86 rows=431width=8) >> Index Cond: (stamp < $1) >> Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) > > Perhaps an index on (visit_id, stamp) would help. This one is doing the > best it can, but if the visit_id's you want are thinly scattered, it'll > still suck... Good idea Tom! In fact the planner seems to like that much better: Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) vs Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237) QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- SeqScan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) SubPlan -> Result (cost=1.58..1.59 rows=1width=0) InitPlan -> Limit (cost=0.00..1.58 rows=1 width=8) -> Index ScanBackward using page_view_visit_id_stamp_idx on page_view pv2 (cost=0.00..625.39 rows=397 width=8) Index Cond: ((visit_id = $0) AND (stamp < $1)) Filter: (stamp IS NOT NULL) -> Result (cost=1.58..1.59 rows=1 width=0) InitPlan -> Limit (cost=0.00..1.58 rows=1 width=8) -> Index Scan Backward using page_view_visit_id_stamp_idx on page_view pv2 (cost=0.00..625.39 rows=397 width=8) Index Cond: ((visit_id = $0) AND (stamp < $1)) Filter: (stamp IS NOT NULL) (14 rows) Compared to: ------------------------------------------------------------------------------------------------------------------------------- SeqScan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237) SubPlan -> Result (cost=364.56..364.57rows=1 width=0) InitPlan -> Limit (cost=0.00..364.56 rows=1 width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..153481.58 rows=421 width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) -> Result (cost=364.56..364.57 rows=1 width=0) InitPlan -> Limit (cost=0.00..364.56 rows=1 width=8) -> Index Scan Backward using page_view_stamp_idx on page_view pv2 (cost=0.00..153481.58 rows=421width=8) Index Cond: (stamp < $1) Filter: ((stamp IS NOT NULL) AND(visit_id = $0)) (14 rows) -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
On Sun, 17 Jun 2007, Jeff Frost wrote: > ------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) > SubPlan > -> Result (cost=1.58..1.59 rows=1 width=0) > InitPlan > -> Limit (cost=0.00..1.58 rows=1 width=8) > -> Index Scan Backward using page_view_visit_id_stamp_idx > on page_view pv2 (cost=0.00..625.39 rows=397 width=8) > Index Cond: ((visit_id = $0) AND (stamp < $1)) > Filter: (stamp IS NOT NULL) > -> Result (cost=1.58..1.59 rows=1 width=0) > InitPlan > -> Limit (cost=0.00..1.58 rows=1 width=8) > -> Index Scan Backward using page_view_visit_id_stamp_idx > on page_view pv2 (cost=0.00..625.39 rows=397 width=8) > Index Cond: ((visit_id = $0) AND (stamp < $1)) > Filter: (stamp IS NOT NULL) > (14 rows) > > Compared to: > > > ------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237) > SubPlan > -> Result (cost=364.56..364.57 rows=1 width=0) > InitPlan > -> Limit (cost=0.00..364.56 rows=1 width=8) > -> Index Scan Backward using page_view_stamp_idx on > page_view pv2 (cost=0.00..153481.58 rows=421 width=8) > Index Cond: (stamp < $1) > Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) > -> Result (cost=364.56..364.57 rows=1 width=0) > InitPlan > -> Limit (cost=0.00..364.56 rows=1 width=8) > -> Index Scan Backward using page_view_stamp_idx on > page_view pv2 (cost=0.00..153481.58 rows=421 width=8) > Index Cond: (stamp < $1) > Filter: ((stamp IS NOT NULL) AND (visit_id = $0)) > (14 rows) And throwing the ORDER BY back in reduces the cost even more! QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- SubqueryScan x (cost=0.00..5815824.15 rows=3629753 width=1186) -> Index Scan using page_view_visit_idx on page_view pv1 (cost=0.00..5743229.09 rows=3629753 width=237) SubPlan -> Result (cost=1.51..1.52 rows=1 width=0) InitPlan -> Limit (cost=0.00..1.51 rows=1 width=8) -> Index Scan Backward using page_view_visit_id_stamp_idx on page_view pv2 (cost=0.00..608.41 rows=402 width=8) Index Cond: ((visit_id = $0) AND (stamp < $1)) Filter: (stamp IS NOTNULL) (9 rows) Now we only have to do that index scan once. :-) I had foolishly taken that out to see if the sort was killing me and forgot to put it back in. So now it's: Subquery Scan x (cost=0.00..5815824.15 rows=3629753 width=1186) vs Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) vs Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237) -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
On Sun, 17 Jun 2007, Jeff Frost wrote: > On Mon, 18 Jun 2007, Tom Lane wrote: > >> Jeff Frost <jeff@frostconsultingllc.com> writes: >>> [ umpteen million iterations of: ] >>> -> Limit (cost=0.00..367.09 rows=1 width=8) >>> -> Index Scan Backward using page_view_stamp_idx on >>> page_view pv2 (cost=0.00..158215.86 rows=431 width=8) >>> Index Cond: (stamp < $1) >>> Filter: ((stamp IS NOT NULL) AND (visit_id = >>> $0)) >> >> Perhaps an index on (visit_id, stamp) would help. This one is doing the >> best it can, but if the visit_id's you want are thinly scattered, it'll >> still suck... > > Good idea Tom! In fact the planner seems to like that much better: > > Seq Scan on page_view pv1 (cost=0.00..11529031.34 rows=3580205 width=239) > vs > Seq Scan on page_view pv1 (cost=0.00..2622541458.55 rows=3596473 width=237) Now that this is working in a reasonable amount of time to process the entire data set, I need to work out how to process the new information that comes in every so often and still assign it a correct visit_id until the fix for this can get through QA. The cleanup query looks like this: CREATE TEMP TABLE tmpmaxpvid AS SELECT MAX(id) AS id FROM reporting.page_view_clean; INSERT INTO reporting.page_view_clean SELECT id , CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN nextval('reporting.tracking_cleanup_seq') ELSE currval('reporting.tracking_cleanup_seq') END AS visit_id ,uri , params , stamp , visit_id AS old_visit_id FROM ( SELECT id , visit_id , uri ,params , stamp , ( SELECT MAX(pv2.stamp) FROM page_view pv2 WHERE pv2.visit_id = pv1.visit_id AND pv2.stamp < pv1.stamp ) AS last_stamp FROM page_view pv1 WHERE pv1.stamp < now() - INTERVAL '1 hour' AND pv1.id > ( SELECT CASE WHEN id IS NULL THEN 0 ELSE id END AS id FROM tmpmaxpvid ) ORDER BY pv1.visit_id , pv1.stamp ) x; The problem is the page_views that straddle the time before and after the query get an artificially different visit_id. This case statement is what kills me: CASE WHEN stamp - last_stamp > INTERVAL '1 hour' OR last_stamp IS NULL THEN nextval('reporting.tracking_cleanup_seq') ELSE currval('reporting.tracking_cleanup_seq') END AS visit_id If I change it to this: CASE WHEN last_stamp IS NULL THEN visit_id WHEN stamp - last_stamp > INTERVAL '1 hour' THEN nextval('reporting.tracking_cleanup_seq') ELSE currval('reporting.tracking_cleanup_seq') END AS visit_id Then it gives the first page_view in the sequence the correct visit_id, but then continues on with a different visit_id for the rest. I've also tried changing how I select which data to act on like so: WHERE pv1.id > ( SELECT CASE WHEN id IS NULL THEN 0 ELSE id END AS id FROM tmpmaxpvid ) AND pv1.visit_id IN ( SELECT visit_id FROM page_viewpv3 GROUP BY visit_id HAVING max(stamp) < now() - INTERVAL '1 hour' ) ORDERBY pv1.visit_id , pv1.stamp ) x; But that leaves me skipping some page views because they haven't reached their conclusion and because of the id > clause, I'll never go back to see them. Anyone have any suggestions? -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954