Thread: Selecting distinct records

Selecting distinct records

From
"Dave [Hawk-Systems]"
Date:
Have a table that is populated by external radius reporting which contains
    name, ipaddress, sessionid, refid, and others

None of the fields are unique, but the combination of those 4 listed should
reasonably be unique. There is a certain type of interaction with external
hardware that results in multple entries into the table, resulting in multiple
duplicate entries.

What we need to to is build a select that omits any duplicate records from the
output, but I can't get a distinct to work given the way we need the data
sorted.

For example,
Name    IPaddress    sessionid    refid
User1    201.201    1234        5678    first user
User2    201.201    1235        5679    same ip as first user
User3    201.202    3234        5670    same sessid as first user
User4    201.203    3236        5678    same refid as first user
User1    201.202    4234        5678    first user new entry
User1    201.202    4234        5678    DUPLICATE
User1    201.202    4234        5678    DUPLICATE
User2    201.203    1234        5671    same ip as user 4
User3    201.204    2234        5672    unique

if we do distinct on sessionid then put a where clause for the user=User1 we end
up missing some of user1's entries where another users identical sessionid
appears first.

What I am looking to do is
- grab every record for $user
- remove any records that have identical ipaddress+sessionid+refid
    ie: turn
        user1        201.102    1234        5678
        user1        201.102    1234        5678
        user1        201.102    1234        5678
    into
        user1        201.102    1234        5678
- then sort the results by date_time or something else

Just can't get it to do all those things at the same time.  Any thoughts, or am
I not making sense?

thanks

Dave



Re: Selecting distinct records

From
"Dave"
Date:
>What I am looking to do is
>- grab every record for $user
>- remove any records that have identical ipaddress+sessionid+refid
>- then sort the results by date_time or something else

this last requirement is where the problem is...

is you do a sum() or order by in the select statement you get and error,  for
example;

select distinct on (sessionid) sum(sessiontime) from logs where
name='joeblowuser' and datetime > 1036040400;

ERROR:  Attribute logs.sessionid must be GROUPed or used in an aggregate
function

Same if you have to order by datetime or something...

Dave



Re: Selecting distinct records

From
"David Busby"
Date:
Dave,
    Try:
select distinct on (sessionid) sum(sessiontime) from logs
where name='joeblowuser' and datetime > 1036040400
group by sessionid.

sum() is an aggerate (sp?) function, that means it munges a field
(sessiontime) from multiple records into one field in one record.  Since you
are also selecting sessionid (from mulitple records) you need to munge it
some how, that munge is accomplished via 'group by'.  From your previous
e-mail it seems that (IMHO) the real problem is that duplicates are getting
inserted via external hardware interaction, this select might be a bandage
on a wound whose true size isn't known...

PostgreSQL docs:
Distinct:
  http://www.postgresql.org/idocs/index.php?queries-select-lists.html
Group By:
  http://www.postgresql.org/idocs/index.php?sql-select.html

/B



----- Original Message -----
From: "Dave" <dave@hawk-systems.com>
To: <pgsql-php@postgresql.org>
Sent: Thursday, November 21, 2002 10:43
Subject: Re: [PHP] Selecting distinct records


> >What I am looking to do is
> >- grab every record for $user
> >- remove any records that have identical ipaddress+sessionid+refid
> >- then sort the results by date_time or something else
>
> this last requirement is where the problem is...
>
> is you do a sum() or order by in the select statement you get and error,
for
> example;
>
> select distinct on (sessionid) sum(sessiontime) from logs where
> name='joeblowuser' and datetime > 1036040400;
>
> ERROR:  Attribute logs.sessionid must be GROUPed or used in an aggregate
> function
>
> Same if you have to order by datetime or something...
>
> Dave
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Re: Selecting distinct records

From
"Dave [Hawk-Systems]"
Date:
>    Try:
>select distinct on (sessionid) sum(sessiontime) from logs
>where name='joeblowuser' and datetime > 1036040400
>group by sessionid.

sorry, I wasn't clear...  the problem arises when we need to sort the data for
output.  postgres demands that the distinct items be first in the ordering
process, which would not allow display by session, name etc...

the obvious result would be to nest the select statements but can't seem to get
that to work either (working with version 7.0.3)

eg/  select * from (select distinct on (sessionid) from logs where....) as
tempname order by sessiondate desc

gives me an error on the second select, so not sure that is a workable solution
either.

>sum() is an aggerate (sp?) function, that means it munges a field
>(sessiontime) from multiple records into one field in one record.  Since you
>are also selecting sessionid (from mulitple records) you need to munge it
>some how, that munge is accomplished via 'group by'.  From your previous
>e-mail it seems that (IMHO) the real problem is that duplicates are getting
>inserted via external hardware interaction, this select might be a bandage
>on a wound whose true size isn't known...

agreed, and that problem has been corrected, but we are dealing with close to a
million records which have these duplicates strewn about within...  rather
annoying, looking for a bandaid in teh select to avoid intensive post-select
processing of the output.

Dave



Re: Selecting distinct records

From
"David Busby"
Date:
Dave,
    How about a job that runs once that cleans the whole thing?
Or:
    [Warning, pseudo PHPish code]

    $rs_sid = select distinct (sessionid) from "logs";
    while ($s = pg_fetch_object($rs_sid)
    {
        $rs_stuff = select * from "logs" where .... and "sessionid" = $s->id
        for (loop of $rs_stuff)
        {
            $total+=$rs_stuff->sessiontime
        }
        echo "Session $s->id has $total seconds used";
    }

This method involves a select for each session (might be slow on millions of
records :) ).  This might work once or twice but I don't see it as a
suitable solution if this has to happen on a daily basis.

    I know that the latest PostgreSQL does nested selects just fine (I've
got one paticular statement that has three nested selects ) and it's
performance isn't as bad as you'd think, well at least after the query
analyzer has seen it once and it gets cached.

/B

----- Original Message -----
From: "Dave [Hawk-Systems]" <dave@hawk-systems.com>
To: "David Busby" <busby@pnts.com>; <pgsql-php@postgresql.org>
Sent: Thursday, November 21, 2002 13:26
Subject: RE: Selecting distinct records


> >    Try:
> >select distinct on (sessionid) sum(sessiontime) from logs
> >where name='joeblowuser' and datetime > 1036040400
> >group by sessionid.
>
> sorry, I wasn't clear...  the problem arises when we need to sort the data
for
> output.  postgres demands that the distinct items be first in the ordering
> process, which would not allow display by session, name etc...
>
> the obvious result would be to nest the select statements but can't seem
to get
> that to work either (working with version 7.0.3)
>
> eg/  select * from (select distinct on (sessionid) from logs where....) as
> tempname order by sessiondate desc
>
> gives me an error on the second select, so not sure that is a workable
solution
> either.
>
> >sum() is an aggerate (sp?) function, that means it munges a field
> >(sessiontime) from multiple records into one field in one record.  Since
you
> >are also selecting sessionid (from mulitple records) you need to munge it
> >some how, that munge is accomplished via 'group by'.  From your previous
> >e-mail it seems that (IMHO) the real problem is that duplicates are
getting
> >inserted via external hardware interaction, this select might be a
bandage
> >on a wound whose true size isn't known...
>
> agreed, and that problem has been corrected, but we are dealing with close
to a
> million records which have these duplicates strewn about within...  rather
> annoying, looking for a bandaid in teh select to avoid intensive
post-select
> processing of the output.
>
> Dave
>