Thread: get distinct + group by then filter
I've a web application and I'm trying to do some reporting on affiliate commission create table tracky_hit ( hitid serial, esid varchar(32), -- related to browser session track_time timestamp, aid varchar(32),-- affiliate code -- some other tracking stuff ); create table tracky_event ( eventid serial, esid varchar(32) references tracky_hit (esid) ); create table tracky_ordergroup_event ( ordergroupid int references ..., eventid int references tracky_event (eventid) ); Now I'd like to pick up the first hit for each esid in a given interval of time for a given aid and relate them with ordergroupid. aid may change across the same esid. Getting the first hit for each esid can be done: select min(hitid) as h from tracky_hit group by esid; or select distinct on (esid) hitid from tracky_hit order by esid, track_time; If I put a where aid='somestuff' right in the above query... I'm not picking up the first hit in an esid. The only way that comes to my mind to solve the problem is applying the condition later in a subquery, but no conditions means a lot of data returned. I've a similar problem with the interval: if I chop in the middle of a session I may not pick up the beginning of each session. Furthermore I've to count session just once even if they cross the boundary of an interval. I could do something like: select oe.ordergroupid from tracky_ordergroup_event oe join tracky_event e on e.eventid=oe.eventid join tracky_hit th onth.esid=e.esid where th.hitid in (select distinct on (esid) hitid from tracky_hit where track_time between ('2009-12-01'::timestamp - interval '1 days') and ('2009-12-01'::timestamp + interval '1 months' + interval'1 days') order by esid, track_time ) and th.aid='someaid' and th.track_time between ('2009-12-01'::timestamp) and ('2009-12-01'::timestamp + interval '1 months'); but this looks awful. Any better way? I'm on 8.3 and no short term plan to move to 8.4 thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
2009/12/17 Ivan Sergio Borgonovo <mail@webthatworks.it>
This imples that tracky_hit.esid is at least UNIQUE.
If tracky_hit.esid is unique, then why same esid can have many aids?
Can you specify more complete schema (at least PKeys would be nice)?
DISTINCT ON seems a good aproach tu such queries.
I've a web application and I'm trying to do some reporting on
affiliate commission
create table tracky_hit (
hitid serial,
esid varchar(32), -- related to browser session
track_time timestamp,
aid varchar(32), -- affiliate code
-- some other tracking stuff
);
create table tracky_event (
eventid serial,
esid varchar(32) references tracky_hit (esid)
This imples that tracky_hit.esid is at least UNIQUE.
);
create table tracky_ordergroup_event (
ordergroupid int references ...,
eventid int references tracky_event (eventid)
);
Now I'd like to pick up the first hit for each esid in a given
interval of time for a given aid and relate them with ordergroupid.
aid may change across the same esid.
If tracky_hit.esid is unique, then why same esid can have many aids?
Can you specify more complete schema (at least PKeys would be nice)?
Getting the first hit for each esid can be done:
select min(hitid) as h
from tracky_hit
group by esid;
or
select distinct on (esid) hitid
from tracky_hit
order by esid, track_time;
DISTINCT ON seems a good aproach tu such queries.
If I put a where aid='somestuff' right in the above query... I'm not
picking up the first hit in an esid.
The only way that comes to my mind to solve the problem is applying
the condition later in a subquery, but no conditions means a lot of
data returned.
I've a similar problem with the interval: if I chop in the middle of
a session I may not pick up the beginning of each session.
Furthermore I've to count session just once even if they cross the
boundary of an interval.
I could do something like:
select oe.ordergroupid from
tracky_ordergroup_event oe
join tracky_event e on e.eventid=oe.eventid
join tracky_hit th on th.esid=e.esid
where th.hitid in
(select distinct on (esid) hitid
from tracky_hit
where track_time between
('2009-12-01'::timestamp - interval '1 days')
and
('2009-12-01'::timestamp + interval '1 months' + interval '1
days')
order by esid, track_time
)
and th.aid='someaid'
and th.track_time between
('2009-12-01'::timestamp)
and
('2009-12-01'::timestamp + interval '1 months');
but this looks awful. Any better way?
I'm on 8.3 and no short term plan to move to 8.4
thanks
--
Ivan Sergio Borgonovo
http://www.webthatworks.it
--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/
On Fri, 18 Dec 2009 13:20:00 +0100 Filip Rembiałkowski <plk.zuber@gmail.com> wrote: > 2009/12/17 Ivan Sergio Borgonovo <mail@webthatworks.it> > > > I've a web application and I'm trying to do some reporting on > > affiliate commission > > > > create table tracky_hit ( > > hitid serial, > > esid varchar(32), -- related to browser session > > track_time timestamp, > > aid varchar(32), -- affiliate code > > -- some other tracking stuff > > ); > > > > create table tracky_event ( > > eventid serial, > > esid varchar(32) references tracky_hit (esid) > This imples that tracky_hit.esid is at least UNIQUE. > ); my fault esid is not unique and that reference was added just to imply that tracky_hit.esid and tracky_event.esid are related. I'm still thinking if it's worth for extra safety to have a create table tracky_esid( esid varchar(32) primary key ); but I can't see any use of it other than recording the session starting time... and I'm not sure it could come handy, since the hit is recorded later, so I couldn't relate hit time and session time. > > create table tracky_ordergroup_event ( > > ordergroupid int references ..., > > eventid int references tracky_event (eventid) > > ); > > > > Now I'd like to pick up the first hit for each esid in a given > > interval of time for a given aid and relate them with > > ordergroupid. > > > > aid may change across the same esid. > > > > If tracky_hit.esid is unique, then why same esid can have many > aids? > > Can you specify more complete schema (at least PKeys would be > nice)? I just made a pk out of (hitid, track_time) The rest of the schema is the same and I have the query below working... I was just wondering if there was a "cleaner" way. > > Getting the first hit for each esid can be done: > > > > select min(hitid) as h > > from tracky_hit > > group by esid; > > > > or > > > > select distinct on (esid) hitid > > from tracky_hit > > order by esid, track_time; > > > DISTINCT ON seems a good aproach tu such queries. Yeah, they both seems to work... but is there any difference? These are the 2 plans: 1)HashAggregate (cost=10.60..11.10 rows=40 width=72) (actualtime=0.041..0.041 rows=4 loops=1) -> Seq Scan on tracky_hit (cost=0.00..10.40rows=40 width=72) (actual time=0.017..0.019 rows=7 loops=1) Total runtime: 0.111 ms (3 rows) 2)Unique (cost=11.46..11.66 rows=40 width=80) (actual time=0.061..0.066 rows=4 loops=1) -> Sort (cost=11.46..11.56 rows=40width=80) (actual time=0.060..0.063 rows=7 loops=1) Sort Key: esid, track_time Sort Method: quicksort Memory:25kB -> Seq Scan on tracky_hit (cost=0.00..10.40 rows=40 width=80) (actual time=0.007..0.010 rows=7 loops=1) Total runtime: 0.102 ms (6 rows) But a) I'm not that skilled in reading plans b) there are nearli no record to test on But considering the plan is pretty different I'd say one should have a major performance advantage over the other. BTW this is working: > > select oe.ordergroupid from > > tracky_ordergroup_event oe > > join tracky_event e on e.eventid=oe.eventid > > join tracky_hit th on th.esid=e.esid > > where th.hitid in > > (select distinct on (esid) hitid > > from tracky_hit > > where track_time between > > ('2009-12-01'::timestamp - interval '1 days') > > and > > ('2009-12-01'::timestamp + interval '1 months' + interval > > '1 days') > > order by esid, track_time > > ) > > and th.aid='someaid' > > and th.track_time between > > ('2009-12-01'::timestamp) > > and > > ('2009-12-01'::timestamp + interval '1 months'); > > > > but this looks awful. Any better way? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it