Thread: Extracting hostname from URI column
Hi, I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column thatholds URLs. This is what I'm trying, but it clearly doesn't do the job. => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where id<10; substr | href ----------------+----------------------------------------------------------texturizer.net | http://texturizer.net/firebird/extensions/texturizer.net| http://texturizer.net/firebird/themes/forums.mozilla | http://forums.mozillazine.org/index.php?c=4www.mozillazin| http://www.mozillazine.org/devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/www.google.com| http://www.google.com/search?&q=%sgroups.google. |http://groups.google.com/groups?scoring=d&q=%swww.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Luckydictionary.ref| http://dictionary.reference.com/search?q=%s The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure? Finally, is this the fastest way to get this data, or is there regex-based function that might be faster? Thanks, Otis
> I'm trying to use substr() and position() functions to extract the > full host name (and later a domain) from a column that holds URLs. substring( href from '.*://\([^/]*)' ); ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/
ogjunk-pgjedan@yahoo.com wrote: > Hi, > > I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column thatholds URLs. > This is what I'm trying, but it clearly doesn't do the job. > > => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url whereid <10; > substr | href > ----------------+---------------------------------------------------------- > texturizer.net | http://texturizer.net/firebird/extensions/ > texturizer.net | http://texturizer.net/firebird/themes/ > forums.mozilla | http://forums.mozillazine.org/index.php?c=4 > www.mozillazin | http://www.mozillazine.org/ > devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/ > www.google.com | http://www.google.com/search?&q=%s > groups.google. | http://groups.google.com/groups?scoring=d&q=%s > www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky > dictionary.ref | http://dictionary.reference.com/search?q=%s > > The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure? > > Finally, is this the fastest way to get this data, or is there regex-based function that might be faster? > > Thanks, > Otis > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > > Try this: test=# select substr(href, position('://' in href)+3, position('/' in substr( href,position('://' in href)+3))-1), href from url; substr | href --------------------------+----------------------------------------------------- ----- devedge.netscape.com | http://devedge.netscape.com/viewsource/2002/bookmark s/ texturizer.net | http://texturizer.net/firebird/extensions/ texturizer.net | http://texturizer.net/firebird/themes/forums.mozillazine.org | http://forums.mozillazine.org/index.php?c=4 www.mozillazine.org | http://www.mozillazine.org/ devedge.netscape.com | http://devedge.netscape.com/viewsource/2002/bookmark s/ www.google.com | http://www.google.com/search?&q=%s groups.google.com | http://groups.google.com/groups?scoring=d&q=%swww.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+L ucky dictionary.reference.com | http://dictionary.reference.com/search?q=%s (10 rows) -- Paul Lambert Database Administrator AutoLedgers
chester c young wrote: >> I'm trying to use substr() and position() functions to extract the >> full host name (and later a domain) from a column that holds URLs. > > substring( href from '.*://\([^/]*)' ); > Ok, your solution looks better than mine... but I have no idea how to interpret that, time to consult some manuals. -- Paul Lambert Database Administrator AutoLedgers
On 9/12/07, Paul Lambert <paul.lambert@autoledgers.com.au> wrote: > > substring( href from '.*://\([^/]*)' ); > Ok, your solution looks better than mine... but I have no idea how to > interpret that, time to consult some manuals. Plain regex.... The key are the parenthesis () ... basically it will omit ANYTHING + two slashes at the beginning of a string. Then it will match everything BUT a slash, and as much of that as possible since regex are greedy by default (hence the host name he was looking for) ... and everything AFTER a slash will be omitted. Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.american.edu/econ/notes/htmlmail.htm
Andrej Ricnik-Bay wrote: > Plain regex.... The key are the parenthesis () ... > basically it will omit ANYTHING + two slashes at the beginning > of a string. Then it will match everything BUT a slash, and as > much of that as possible since regex are greedy by default > (hence the host name he was looking for) ... and everything > AFTER a slash will be omitted. > > > > Cheers, > Andrej > > Thanks - that makes a bit more sense. I'm in the middle of reading chapter 9.3.7 of the manual - POSIX Regular Expressions - which I'm assuming is dealing with this, so it's looking clearer. -- Paul Lambert Database Administrator AutoLedgers
Hi, Thanks, perfect! (though I'll have to look into the regex warning): => select substring( href from '.*://\([^/]*)' ) as hostname from url where id<10; WARNING: nonstandard use of escape in a string literal at character 29 HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. So now I have this: hostname --------------------------texturizer.nettexturizer.netforums.mozillazine.orgwww.mozillazine.orgdevedge.netscape.comwww.google.comgroups.google.comwww.google.comdictionary.reference.com And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname. Somethingmuch like `uniq -c'. Can anyone tell me how that's done or where I should look for info? (I'm not sure what tolook for, that's the problem). Thanks, Otis ----- Original Message ---- From: chester c young <chestercyoung@yahoo.com> To: ogjunk-pgjedan@yahoo.com Cc: sql pgsql <pgsql-sql@postgresql.org> Sent: Tuesday, September 11, 2007 8:42:46 PM Subject: Re: [SQL] Extracting hostname from URI column > I'm trying to use substr() and position() functions to extract the > full host name (and later a domain) from a column that holds URLs. substring( href from '.*://\([^/]*)' ); ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/
> >> I'm trying to use substr() and position() functions to extract the > >> full host name (and later a domain) from a column that holds URLs. > > > > substring( href from '.*://\([^/]*)' ); > > typo: no backslash in front of left paren substring( href from '.*://([^/]*)' ) match up thru // within parens, match anything except / return match within parens ____________________________________________________________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting
ogjunk-pgjedan@yahoo.com wrote: > > And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname. Somethingmuch like `uniq -c'. Can anyone tell me how that's done or where I should look for info? (I'm not sure what tolook for, that's the problem). > > Thanks, > Otis > Just use distinct... test=# select distinct count(*),substring( href from '.*://([^/]*)' ) as domain from url group by domain order by domain; count | domain -------+-------------------------- 3 | devedge.netscape.com 1 | dictionary.reference.com 2 | forums.mozillazine.org 1 | groups.google.com 4 | texturizer.net 11 | www.google.com 2 | www.mozillazine.org (7 rows) -- Paul Lambert Database Administrator AutoLedgers
> And what I'd like is something that would give me the counts for the > number of occurrences of each unique hostname. Something much like > `uniq -c'. Can anyone tell me how that's done or where I should look > for info? (I'm not sure what to look for, that's the problem). > select substring( ... ), count(1) from your_table group by 1; ____________________________________________________________________________________ Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545469
Ah, I figured out what to look for and found my uniq -c solution: select substring( href from '.*://([^/]*)' ) as hostname, count(substring( href from '.*://([^/]*)' )) from url where id<10group by hostname order by count desc; hostname | count --------------------------+-------texturizer.net | 2www.google.com | 2dictionary.reference.com| 1www.mozillazine.org | 1devedge.netscape.com | 1groups.google.com | 1forums.mozillazine.org | 1 Thanks for the quick help with substring func, people! Otis ----- Original Message ---- From: "ogjunk-pgjedan@yahoo.com" <ogjunk-pgjedan@yahoo.com> To: pgsql-sql@postgresql.org Sent: Tuesday, September 11, 2007 11:16:15 PM Subject: Re: [SQL] Extracting hostname from URI column Hi, Thanks, perfect! (though I'll have to look into the regex warning): => select substring( href from '.*://\([^/]*)' ) as hostname from url where id<10; WARNING: nonstandard use of escape in a string literal at character 29 HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. So now I have this: hostname --------------------------texturizer.nettexturizer.netforums.mozillazine.orgwww.mozillazine.orgdevedge.netscape.comwww.google.comgroups.google.comwww.google.comdictionary.reference.com And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname. Somethingmuch like `uniq -c'. Can anyone tell me how that's done or where I should look for info? (I'm not sure what tolook for, that's the problem). Thanks, Otis ----- Original Message ---- From: chester c young <chestercyoung@yahoo.com> To: ogjunk-pgjedan@yahoo.com Cc: sql pgsql <pgsql-sql@postgresql.org> Sent: Tuesday, September 11, 2007 8:42:46 PM Subject: Re: [SQL] Extracting hostname from URI column > I'm trying to use substr() and position() functions to extract the > full host name (and later a domain) from a column that holds URLs. substring( href from '.*://\([^/]*)' ); ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Paul Lambert wrote: > > Just use distinct... > > test=# select distinct count(*),substring( href from '.*://([^/]*)' ) as > domain from url group by domain order by domain; OK so distinct was redundant there... it gives the same results without it. AutoDRS=# select count(*) as occurances,substring( href from '.*://([^/]*)' ) as domain from url group by domain order by occurances desc,domain; occurances | domain ------------+-------------------------- 11 | www.google.com 4 | dictionary.reference.com 4 | texturizer.net 3 | devedge.netscape.com 3 | groups.google.com 2 | forums.mozillazine.org 2 | www.mozillazine.org (7 rows) -- Paul Lambert Database Administrator AutoLedgers
Paul Lambert wrote: > Andrej Ricnik-Bay wrote: > >> Plain regex.... The key are the parenthesis () ... >> basically it will omit ANYTHING + two slashes at the beginning >> of a string. Then it will match everything BUT a slash, and as >> much of that as possible since regex are greedy by default >> (hence the host name he was looking for) ... and everything >> AFTER a slash will be omitted. >> Cheers, >> Andrej > Thanks - that makes a bit more sense. I'm in the middle of reading chapter > 9.3.7 of the manual - POSIX Regular Expressions - which I'm assuming is > dealing with this, so it's looking clearer. A very good resource on regular expressions is "Mastering Regular Expressions" by Jeffrey Friedl, now in its third edition: http://www.oreilly.com/catalog/regex3/ I read the first ed. years ago and it was very illuminating. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ "Acepta los honores y aplausos y perderás tu libertad"
Paul Lambert wrote: > chester c young wrote: >>> I'm trying to use substr() and position() functions to extract the >>> full host name (and later a domain) from a column that holds URLs. >> >> substring( href from '.*://\([^/]*)' ); >> > > Ok, your solution looks better than mine... but I have no idea how to > interpret that, time to consult some manuals. > OK - following on from this, I'm doing a similar thing to the OP to analyze my proxy's traffic (never occured to me to do it in a db until that post) I've done the above regex to pull out the domain part of the URL and am left with results such as: "acvs.mediaonenetwork.net" "profile.ak.facebook.com" "www.bankwest.com.au" What I want to do next is pull out the major domain part of the URL I.e. for the above three records I should end up with "mediaonenetwork.net" "facebook.com" "bankwest.com.au" What would be the best way to do something like that? I assume it won't be a simple regex like the above due to the country codes on the end of some domains. My thought is look at the last portion of the domain, if it's 2 characters long then assume it's a country code and grab the last three sections, if it's not three characters long then assume it's an international domain and grab the last two... but that sounds a bit dodgy. -- Paul Lambert Database Administrator AutoLedgers
"Paul Lambert" <paul.lambert@autoledgers.com.au> writes: > What would be the best way to do something like that? I assume it won't be a > simple regex like the above due to the country codes on the end of some > domains. My thought is look at the last portion of the domain, if it's 2 > characters long then assume it's a country code and grab the last three > sections, if it's not three characters long then assume it's an international > domain and grab the last two... but that sounds a bit dodgy. Not all countries break up their national tld space into sections like .co.uk or .com.au. Canadian domains can be bought directly under .ca like amazon.ca. I think you'll have to store a specific list of tlds and how deep you want to look. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Mon, Sep 17, 2007 at 12:18:56AM +0100, Gregory Stark wrote: > > Not all countries break up their national tld space into sections like .co.uk > or .com.au. Canadian domains can be bought directly under .ca like amazon.ca. > > I think you'll have to store a specific list of tlds and how deep you want to > look. Another way to look at this is that any containing domain is just a domain. So .info is a domain, example.info is a domain, and in case of the existence of host1.some.example.info, some.example.info is a domain. Also, co.uk is just as much a domain as yahoo.co.uk. Some poor misguided people at Netscape badly misunderstood the DNS many years ago, and web browser cookies have been attempting to recover from that misunderstanding ever since (with sometimes serious security implications). There's really no reliable way to deduce common responsibility from the DNS labels, and it's sometimes terrifically important not to make that mistake. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
Paul Lambert wrote: > Paul Lambert wrote: >> chester c young wrote: >>>> I'm trying to use substr() and position() functions to extract the >>>> full host name (and later a domain) from a column that holds URLs. >>> >>> substring( href from '.*://\([^/]*)' ); >>> >> >> Ok, your solution looks better than mine... but I have no idea how to >> interpret that, time to consult some manuals. >> > > OK - following on from this, I'm doing a similar thing to the OP to > analyze my proxy's traffic (never occured to me to do it in a db until > that post) > > I've done the above regex to pull out the domain part of the URL and am > left with results such as: > "acvs.mediaonenetwork.net" > "profile.ak.facebook.com" > "www.bankwest.com.au" > > What I want to do next is pull out the major domain part of the URL > I.e. for the above three records I should end up with > "mediaonenetwork.net" > "facebook.com" > "bankwest.com.au" > > What would be the best way to do something like that? I assume it won't > be a simple regex like the above due to the country codes on the end of > some domains. My thought is look at the last portion of the domain, if > it's 2 characters long then assume it's a country code and grab the last > three sections, if it's not three characters long then assume it's an > international domain and grab the last two... but that sounds a bit dodgy. and csiro.au? There may be a few others in .au too, legacies from before we went commercial.