Re: separating improperly grouped page views - Mailing list pgsql-sql

From Jeff Frost
Subject Re: separating improperly grouped page views
Date
Msg-id Pine.LNX.4.64.0707051007410.31414@discord.home.frostconsultingllc.com
Whole thread Raw
In response to Re: separating improperly grouped page views  (Jeff Frost <jeff@frostconsultingllc.com>)
List pgsql-sql
On Sun, 17 Jun 2007, Jeff Frost wrote:

> On Mon, 18 Jun 2007, Tom Lane wrote:
>
>> Jeff Frost <jeff@frostconsultingllc.com> writes:
>>> [ umpteen million iterations of: ]
>>>               ->  Limit  (cost=0.00..367.09 rows=1 width=8)
>>>                     ->  Index Scan Backward using page_view_stamp_idx on 
>>> page_view pv2  (cost=0.00..158215.86 rows=431 width=8)
>>>                           Index Cond: (stamp < $1)
>>>                           Filter: ((stamp IS NOT NULL) AND (visit_id = 
>>> $0))
>> 
>> Perhaps an index on (visit_id, stamp) would help.  This one is doing the
>> best it can, but if the visit_id's you want are thinly scattered, it'll
>> still suck...
>
> Good idea Tom!  In fact the planner seems to like that much better:
>
> Seq Scan on page_view pv1  (cost=0.00..11529031.34 rows=3580205 width=239)
> vs
> Seq Scan on page_view pv1  (cost=0.00..2622541458.55 rows=3596473 width=237)

Now that this is working in a reasonable amount of time to process the entire 
data set, I need to work out how to process the new information that comes in 
every so often and still assign it a correct visit_id until the fix for this 
can get through QA.  The cleanup query looks like this:

CREATE TEMP TABLE tmpmaxpvid AS SELECT MAX(id) AS id FROM 
reporting.page_view_clean;

INSERT INTO reporting.page_view_clean
SELECT id     ,       CASE       WHEN stamp - last_stamp > INTERVAL '1 hour'    OR last_stamp IS NULL THEN
nextval('reporting.tracking_cleanup_seq')      ELSE currval('reporting.tracking_cleanup_seq')       END AS visit_id
,uri     , params     , stamp     , visit_id AS old_visit_id  FROM (    SELECT id        , visit_id        , uri
,params        , stamp        , (        SELECT MAX(pv2.stamp)          FROM page_view pv2        WHERE pv2.visit_id
       = pv1.visit_id          AND pv2.stamp               < pv1.stamp          ) AS last_stamp      FROM page_view pv1
  WHERE pv1.stamp               < now() - INTERVAL '1 hour'      AND pv1.id                  > (        SELECT CASE
                WHEN id IS NULL THEN 0                      ELSE id                      END AS id          FROM
tmpmaxpvid         )    ORDER BY pv1.visit_id        , pv1.stamp       ) x;
 


The problem is the page_views that straddle the time before and after the 
query get an artificially different visit_id.

This case statement is what kills me:
       CASE       WHEN stamp - last_stamp > INTERVAL '1 hour'    OR last_stamp IS NULL THEN
nextval('reporting.tracking_cleanup_seq')      ELSE currval('reporting.tracking_cleanup_seq')       END AS visit_id
 

If I change it to this:
       CASE       WHEN last_stamp IS NULL         THEN visit_id       WHEN stamp - last_stamp > INTERVAL '1 hour'
 THEN nextval('reporting.tracking_cleanup_seq')       ELSE currval('reporting.tracking_cleanup_seq')       END AS
visit_id

Then it gives the first page_view in the sequence the correct visit_id, but 
then continues on with a different visit_id for the rest.

I've also tried changing how I select which data to act on like so:
    WHERE pv1.id                  > (        SELECT CASE              WHEN id IS NULL THEN 0              ELSE id
      END AS id          FROM tmpmaxpvid          )      AND pv1.visit_id IN (        SELECT visit_id          FROM
page_viewpv3        GROUP BY visit_id        HAVING max(stamp)              < now() - INTERVAL '1 hour'          )
ORDERBY pv1.visit_id        , pv1.stamp       ) x;
 

But that leaves me skipping some page views because they haven't reached their 
conclusion and because of the id > clause, I'll never go back to see them.

Anyone have any suggestions?

-- 
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954


pgsql-sql by date:

Previous
From: "Michael Holzman"
Date:
Subject: Re: 'comp.databases.theory'
Next
From: Ranieri Mazili
Date:
Subject: Working with dates