Thread: DISTINCT to get distinct *substrings*?
Hello to the list,
here's an SQL question, I hope it's not off topic. From a list of URLs I want to get only the distinct values of the *web sites* these URLs belong to, that is everything before and including the 3rd slash, and I think this should be possible within the DB. I would like to say something like
SELECT substring(attribute from '^http://[^/]*/') from pg_atp where attribute like 'http://%'
(which works) but get only the distinct values. SELECT DISTINCT ON substring.. doesn't work. Probably I haven't understood the semantics of the DISTINCT keyword. Can anybody help?
thanks in advance
Christoph
SELECT DISTINCT substring(attribute from '^http://[^/]*/') from pg_atp where attribute like 'http://%'; w/o DISTINCT there should be duplicates (if any) don't use "DISTINCT ON" at all, it's evil :-) (why? http://chernowiki.ru/index.php?node=38#A13) On 8/8/06, Christoph Pingel <ch.pingel@web.de> wrote: > > Hello to the list, > > here's an SQL question, I hope it's not off topic. From a list of URLs I > want to get only the distinct values of the *web sites* these URLs belong > to, that is everything before and including the 3rd slash, and I think this > should be possible within the DB. I would like to say something like > > SELECT substring(attribute from '^http://[^/]*/') from pg_atp where > attribute like 'http://%' > > (which works) but get only the distinct values. SELECT DISTINCT ON > substring.. doesn't work. Probably I haven't understood the semantics of the > DISTINCT keyword. Can anybody help? > > thanks in advance > Christoph > > -- Best regards, Nikolay
DISTINCT ON is extremely useful when you know what you're doing. It's postgres' version of oracle's first_value analytical function, and when you need it, nothing else really suffices. On Tue, 8 Aug 2006, Nikolay Samokhvalov wrote: > SELECT DISTINCT substring(attribute from '^http://[^/]*/') from pg_atp > where attribute like 'http://%'; > > w/o DISTINCT there should be duplicates (if any) > > don't use "DISTINCT ON" at all, it's evil :-) (why? > http://chernowiki.ru/index.php?node=38#A13) > > On 8/8/06, Christoph Pingel <ch.pingel@web.de> wrote: >> >> Hello to the list, >> >> here's an SQL question, I hope it's not off topic. From a list of URLs I >> want to get only the distinct values of the *web sites* these URLs belong >> to, that is everything before and including the 3rd slash, and I think this >> should be possible within the DB. I would like to say something like >> >> SELECT substring(attribute from '^http://[^/]*/') from pg_atp where >> attribute like 'http://%' >> >> (which works) but get only the distinct values. SELECT DISTINCT ON >> substring.. doesn't work. Probably I haven't understood the semantics of >> the >> DISTINCT keyword. Can anybody help? >> >> thanks in advance >> Christoph >> >> > > > -- > Best regards, > Nikolay > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Am 08.08.2006 um 19:49 schrieb Nikolay Samokhvalov: > SELECT DISTINCT substring(attribute from '^http://[^/]*/') from pg_atp > where attribute like 'http://%'; > > w/o DISTINCT there should be duplicates (if any) > > don't use "DISTINCT ON" at all, it's evil :-) (why? > http://chernowiki.ru/index.php?node=38#A13) Thanks for the good advice! From reading this, it seems to be a *really* bad thing. And I didn't get it from the official documentation. :-) ok, SELECT DISTINCT works, and it seems that the results are ordered (by the substring) - is this the default behaviour or just by chance (and probably version dependent)? best regards, Christoph
Thanks for the input, I think I get this now. In my case, the query
SELECT DISTINCT ON (substring(attribute from '^http://[^/]*/')) attribute from pg_atp where attribute like 'http://%'
doesn't get me just the root of the URL, but the whole URL - but only for the first row for each individual root. While
SELECT DISTINCT substring(attribute from '^http://[^/]*/') from pg_atp where attribute like 'http://%'
does what I first intended - get a list of all (distinct) root URLs. Wieder was gelernt. (Learnt something again. :-)
best regards,
Christoph
Am 08.08.2006 um 20:36 schrieb Ben:
DISTINCT ON is extremely useful when you know what you're doing. It's postgres' version of oracle's first_value analytical function, and when you need it, nothing else really suffices.On Tue, 8 Aug 2006, Nikolay Samokhvalov wrote:SELECT DISTINCT substring(attribute from '^http://[^/]*/') from pg_atpwhere attribute like 'http://%';w/o DISTINCT there should be duplicates (if any)don't use "DISTINCT ON" at all, it's evil :-) (why?On 8/8/06, Christoph Pingel <ch.pingel@web.de> wrote:Hello to the list,here's an SQL question, I hope it's not off topic. From a list of URLs Iwant to get only the distinct values of the *web sites* these URLs belongto, that is everything before and including the 3rd slash, and I think thisshould be possible within the DB. I would like to say something likeSELECT substring(attribute from '^http://[^/]*/') from pg_atp whereattribute like 'http://%'(which works) but get only the distinct values. SELECT DISTINCT ONsubstring.. doesn't work. Probably I haven't understood the semantics of theDISTINCT keyword. Can anybody help?thanks in advanceChristoph--Best regards,Nikolay---------------------------(end of broadcast)---------------------------TIP 1: if posting/reading through Usenet, please send an appropriatesubscribe-nomail command to majordomo@postgresql.org so that yourmessage can get through to the mailing list cleanly---------------------------(end of broadcast)---------------------------TIP 2: Don't 'kill -9' the postmaster
Christoph Pingel <ch.pingel@web.de> writes: > Am 08.08.2006 um 19:49 schrieb Nikolay Samokhvalov: >> don't use "DISTINCT ON" at all, it's evil :-) (why? >> http://chernowiki.ru/index.php?node=38#A13 > Thanks for the good advice! From reading this, it seems to be a > *really* bad thing. And I didn't get it from the official > documentation. :-) That page is complaining about DISTINCT ON as it was defined in 1999. It's a lot harder to shoot yourself in the foot now: regression=# select distinct on (ten) hundred from tenk1 order by unique2; ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions I don't deny that it's nonstandard and pretty ugly, but sometimes it's just really hard to solve a problem any other way. regards, tom lane