Thread: Selecting distinct records
Have a table that is populated by external radius reporting which contains name, ipaddress, sessionid, refid, and others None of the fields are unique, but the combination of those 4 listed should reasonably be unique. There is a certain type of interaction with external hardware that results in multple entries into the table, resulting in multiple duplicate entries. What we need to to is build a select that omits any duplicate records from the output, but I can't get a distinct to work given the way we need the data sorted. For example, Name IPaddress sessionid refid User1 201.201 1234 5678 first user User2 201.201 1235 5679 same ip as first user User3 201.202 3234 5670 same sessid as first user User4 201.203 3236 5678 same refid as first user User1 201.202 4234 5678 first user new entry User1 201.202 4234 5678 DUPLICATE User1 201.202 4234 5678 DUPLICATE User2 201.203 1234 5671 same ip as user 4 User3 201.204 2234 5672 unique if we do distinct on sessionid then put a where clause for the user=User1 we end up missing some of user1's entries where another users identical sessionid appears first. What I am looking to do is - grab every record for $user - remove any records that have identical ipaddress+sessionid+refid ie: turn user1 201.102 1234 5678 user1 201.102 1234 5678 user1 201.102 1234 5678 into user1 201.102 1234 5678 - then sort the results by date_time or something else Just can't get it to do all those things at the same time. Any thoughts, or am I not making sense? thanks Dave
>What I am looking to do is >- grab every record for $user >- remove any records that have identical ipaddress+sessionid+refid >- then sort the results by date_time or something else this last requirement is where the problem is... is you do a sum() or order by in the select statement you get and error, for example; select distinct on (sessionid) sum(sessiontime) from logs where name='joeblowuser' and datetime > 1036040400; ERROR: Attribute logs.sessionid must be GROUPed or used in an aggregate function Same if you have to order by datetime or something... Dave
Dave, Try: select distinct on (sessionid) sum(sessiontime) from logs where name='joeblowuser' and datetime > 1036040400 group by sessionid. sum() is an aggerate (sp?) function, that means it munges a field (sessiontime) from multiple records into one field in one record. Since you are also selecting sessionid (from mulitple records) you need to munge it some how, that munge is accomplished via 'group by'. From your previous e-mail it seems that (IMHO) the real problem is that duplicates are getting inserted via external hardware interaction, this select might be a bandage on a wound whose true size isn't known... PostgreSQL docs: Distinct: http://www.postgresql.org/idocs/index.php?queries-select-lists.html Group By: http://www.postgresql.org/idocs/index.php?sql-select.html /B ----- Original Message ----- From: "Dave" <dave@hawk-systems.com> To: <pgsql-php@postgresql.org> Sent: Thursday, November 21, 2002 10:43 Subject: Re: [PHP] Selecting distinct records > >What I am looking to do is > >- grab every record for $user > >- remove any records that have identical ipaddress+sessionid+refid > >- then sort the results by date_time or something else > > this last requirement is where the problem is... > > is you do a sum() or order by in the select statement you get and error, for > example; > > select distinct on (sessionid) sum(sessiontime) from logs where > name='joeblowuser' and datetime > 1036040400; > > ERROR: Attribute logs.sessionid must be GROUPed or used in an aggregate > function > > Same if you have to order by datetime or something... > > Dave > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> Try: >select distinct on (sessionid) sum(sessiontime) from logs >where name='joeblowuser' and datetime > 1036040400 >group by sessionid. sorry, I wasn't clear... the problem arises when we need to sort the data for output. postgres demands that the distinct items be first in the ordering process, which would not allow display by session, name etc... the obvious result would be to nest the select statements but can't seem to get that to work either (working with version 7.0.3) eg/ select * from (select distinct on (sessionid) from logs where....) as tempname order by sessiondate desc gives me an error on the second select, so not sure that is a workable solution either. >sum() is an aggerate (sp?) function, that means it munges a field >(sessiontime) from multiple records into one field in one record. Since you >are also selecting sessionid (from mulitple records) you need to munge it >some how, that munge is accomplished via 'group by'. From your previous >e-mail it seems that (IMHO) the real problem is that duplicates are getting >inserted via external hardware interaction, this select might be a bandage >on a wound whose true size isn't known... agreed, and that problem has been corrected, but we are dealing with close to a million records which have these duplicates strewn about within... rather annoying, looking for a bandaid in teh select to avoid intensive post-select processing of the output. Dave
Dave, How about a job that runs once that cleans the whole thing? Or: [Warning, pseudo PHPish code] $rs_sid = select distinct (sessionid) from "logs"; while ($s = pg_fetch_object($rs_sid) { $rs_stuff = select * from "logs" where .... and "sessionid" = $s->id for (loop of $rs_stuff) { $total+=$rs_stuff->sessiontime } echo "Session $s->id has $total seconds used"; } This method involves a select for each session (might be slow on millions of records :) ). This might work once or twice but I don't see it as a suitable solution if this has to happen on a daily basis. I know that the latest PostgreSQL does nested selects just fine (I've got one paticular statement that has three nested selects ) and it's performance isn't as bad as you'd think, well at least after the query analyzer has seen it once and it gets cached. /B ----- Original Message ----- From: "Dave [Hawk-Systems]" <dave@hawk-systems.com> To: "David Busby" <busby@pnts.com>; <pgsql-php@postgresql.org> Sent: Thursday, November 21, 2002 13:26 Subject: RE: Selecting distinct records > > Try: > >select distinct on (sessionid) sum(sessiontime) from logs > >where name='joeblowuser' and datetime > 1036040400 > >group by sessionid. > > sorry, I wasn't clear... the problem arises when we need to sort the data for > output. postgres demands that the distinct items be first in the ordering > process, which would not allow display by session, name etc... > > the obvious result would be to nest the select statements but can't seem to get > that to work either (working with version 7.0.3) > > eg/ select * from (select distinct on (sessionid) from logs where....) as > tempname order by sessiondate desc > > gives me an error on the second select, so not sure that is a workable solution > either. > > >sum() is an aggerate (sp?) function, that means it munges a field > >(sessiontime) from multiple records into one field in one record. Since you > >are also selecting sessionid (from mulitple records) you need to munge it > >some how, that munge is accomplished via 'group by'. From your previous > >e-mail it seems that (IMHO) the real problem is that duplicates are getting > >inserted via external hardware interaction, this select might be a bandage > >on a wound whose true size isn't known... > > agreed, and that problem has been corrected, but we are dealing with close to a > million records which have these duplicates strewn about within... rather > annoying, looking for a bandaid in teh select to avoid intensive post-select > processing of the output. > > Dave >