Thread: Extracting hostname from URI column

Extracting hostname from URI column

From
ogjunk-pgjedan@yahoo.com
Date:
Hi,

I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column
thatholds URLs. 
This is what I'm trying, but it clearly doesn't do the job.

=> select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where
id<10;    substr     |                           href 
----------------+----------------------------------------------------------texturizer.net |
http://texturizer.net/firebird/extensions/texturizer.net| http://texturizer.net/firebird/themes/forums.mozilla |
http://forums.mozillazine.org/index.php?c=4www.mozillazin| http://www.mozillazine.org/devedge.netsca |
http://devedge.netscape.com/viewsource/2002/bookmarks/www.google.com| http://www.google.com/search?&q=%sgroups.google.
|http://groups.google.com/groups?scoring=d&q=%swww.google.com |
http://www.google.com/search?q=%s&btnI=I'm+Feeling+Luckydictionary.ref| http://dictionary.reference.com/search?q=%s 

The 3rd param to the substr function is clearly wrong.  Is it even doable without writing a procedure?

Finally, is this the fastest way to get this data, or is there  regex-based function that might be faster?

Thanks,
Otis




Re: Extracting hostname from URI column

From
chester c young
Date:
> I'm trying to use substr() and position() functions to extract the
> full host name (and later a domain) from a column that holds URLs.

substring( href from '.*://\([^/]*)' );


      
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/


Re: Extracting hostname from URI column

From
Paul Lambert
Date:
ogjunk-pgjedan@yahoo.com wrote:
> Hi,
> 
> I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column
thatholds URLs.
 
> This is what I'm trying, but it clearly doesn't do the job.
> 
> => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url
whereid <10;
 
>      substr     |                           href
> ----------------+----------------------------------------------------------
>  texturizer.net | http://texturizer.net/firebird/extensions/
>  texturizer.net | http://texturizer.net/firebird/themes/
>  forums.mozilla | http://forums.mozillazine.org/index.php?c=4
>  www.mozillazin | http://www.mozillazine.org/
>  devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/
>  www.google.com | http://www.google.com/search?&q=%s
>  groups.google. | http://groups.google.com/groups?scoring=d&q=%s
>  www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky
>  dictionary.ref | http://dictionary.reference.com/search?q=%s
> 
> The 3rd param to the substr function is clearly wrong.  Is it even doable without writing a procedure?
> 
> Finally, is this the fastest way to get this data, or is there  regex-based function that might be faster?
> 
> Thanks,
> Otis
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 
> 

Try this:

test=# select substr(href, position('://' in href)+3, position('/' in 
substr(
href,position('://' in href)+3))-1), href from url;          substr          |                           href

--------------------------+-----------------------------------------------------
----- devedge.netscape.com     | 
http://devedge.netscape.com/viewsource/2002/bookmark
s/ texturizer.net           | http://texturizer.net/firebird/extensions/ texturizer.net           |
http://texturizer.net/firebird/themes/forums.mozillazine.org   | http://forums.mozillazine.org/index.php?c=4
www.mozillazine.org     | http://www.mozillazine.org/ devedge.netscape.com     | 
 
http://devedge.netscape.com/viewsource/2002/bookmark
s/ www.google.com           | http://www.google.com/search?&q=%s groups.google.com        |
http://groups.google.com/groups?scoring=d&q=%swww.google.com           | 
 
http://www.google.com/search?q=%s&btnI=I'm+Feeling+L
ucky dictionary.reference.com | http://dictionary.reference.com/search?q=%s
(10 rows)

-- 
Paul Lambert
Database Administrator
AutoLedgers


Re: Extracting hostname from URI column

From
Paul Lambert
Date:
chester c young wrote:
>> I'm trying to use substr() and position() functions to extract the
>> full host name (and later a domain) from a column that holds URLs.
> 
> substring( href from '.*://\([^/]*)' );
> 

Ok, your solution looks better than mine... but I have no idea how to 
interpret that, time to consult some manuals.

-- 
Paul Lambert
Database Administrator
AutoLedgers



Re: Extracting hostname from URI column

From
"Andrej Ricnik-Bay"
Date:
On 9/12/07, Paul Lambert <paul.lambert@autoledgers.com.au> wrote:
> > substring( href from '.*://\([^/]*)' );

> Ok, your solution looks better than mine... but I have no idea how to
> interpret that, time to consult some manuals.
Plain regex.... The key are the parenthesis () ...
basically it will omit ANYTHING + two slashes at the beginning
of a string.  Then it will match everything BUT a slash, and as
much of that as possible since regex are greedy by default
(hence the host name he was looking for) ... and everything
AFTER a slash will be omitted.



Cheers,
Andrej


-- 
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm


Re: Extracting hostname from URI column

From
Paul Lambert
Date:
Andrej Ricnik-Bay wrote:

> Plain regex.... The key are the parenthesis () ...
> basically it will omit ANYTHING + two slashes at the beginning
> of a string.  Then it will match everything BUT a slash, and as
> much of that as possible since regex are greedy by default
> (hence the host name he was looking for) ... and everything
> AFTER a slash will be omitted.
> 
> 
> 
> Cheers,
> Andrej
> 
> 
Thanks - that makes a bit more sense. I'm in the middle of reading 
chapter 9.3.7 of the manual - POSIX Regular Expressions - which I'm 
assuming is dealing with this, so it's looking clearer.

-- 
Paul Lambert
Database Administrator
AutoLedgers


Re: Extracting hostname from URI column

From
ogjunk-pgjedan@yahoo.com
Date:
Hi,

Thanks, perfect! (though I'll have to look into the regex warning):

=> select substring( href from '.*://\([^/]*)' ) as hostname from url where id<10;
WARNING:  nonstandard use of escape in a string literal at character 29
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.

So now I have this:        hostname

--------------------------texturizer.nettexturizer.netforums.mozillazine.orgwww.mozillazine.orgdevedge.netscape.comwww.google.comgroups.google.comwww.google.comdictionary.reference.com

And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname.
Somethingmuch like `uniq -c'.  Can anyone tell me how that's done or where I should look for info? (I'm not sure what
tolook for, that's the problem). 

Thanks,
Otis


----- Original Message ----
From: chester c young <chestercyoung@yahoo.com>
To: ogjunk-pgjedan@yahoo.com
Cc: sql pgsql <pgsql-sql@postgresql.org>
Sent: Tuesday, September 11, 2007 8:42:46 PM
Subject: Re: [SQL] Extracting hostname from URI column

> I'm trying to use substr() and position() functions to extract the
> full host name (and later a domain) from a column that holds URLs.

substring( href from '.*://\([^/]*)' );



____________________________________________________________________________________
Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/





Re: Extracting hostname from URI column

From
chester c young
Date:
> >> I'm trying to use substr() and position() functions to extract the
> >> full host name (and later a domain) from a column that holds URLs.
> > 
> > substring( href from '.*://\([^/]*)' );
> > 

typo: no backslash in front of left paren
substring( href from '.*://([^/]*)' )

match up thru //
within parens, match anything except /

return match within parens

      
____________________________________________________________________________________
Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online.
http://smallbusiness.yahoo.com/webhosting 


Re: Extracting hostname from URI column

From
Paul Lambert
Date:
ogjunk-pgjedan@yahoo.com wrote:
> 
> And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname.
Somethingmuch like `uniq -c'.  Can anyone tell me how that's done or where I should look for info? (I'm not sure what
tolook for, that's the problem).
 
> 
> Thanks,
> Otis
> 

Just use distinct...

test=# select distinct count(*),substring( href from '.*://([^/]*)' ) as 
domain from url group by domain order by domain; count |          domain
-------+--------------------------     3 | devedge.netscape.com     1 | dictionary.reference.com     2 |
forums.mozillazine.org    1 | groups.google.com     4 | texturizer.net    11 | www.google.com     2 |
www.mozillazine.org
(7 rows)

-- 
Paul Lambert
Database Administrator
AutoLedgers


Re: Extracting hostname from URI column

From
chester c young
Date:
> And what I'd like is something that would give me the counts for the
> number of occurrences of each unique hostname.  Something much like
> `uniq -c'.  Can anyone tell me how that's done or where I should look
> for info? (I'm not sure what to look for, that's the problem).
> 

select substring( ... ), count(1) from your_table group by 1;


      
____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469


Re: Extracting hostname from URI column

From
ogjunk-pgjedan@yahoo.com
Date:
Ah, I figured out what to look for and found my uniq -c solution:

select substring( href from '.*://([^/]*)' ) as hostname, count(substring( href from '.*://([^/]*)' )) from url where
id<10group by hostname order by count desc;        hostname         | count 
--------------------------+-------texturizer.net           |     2www.google.com           |
2dictionary.reference.com|     1www.mozillazine.org      |     1devedge.netscape.com     |     1groups.google.com
|     1forums.mozillazine.org   |     1 

Thanks for the quick help with substring func, people!

Otis


----- Original Message ----
From: "ogjunk-pgjedan@yahoo.com" <ogjunk-pgjedan@yahoo.com>
To: pgsql-sql@postgresql.org
Sent: Tuesday, September 11, 2007 11:16:15 PM
Subject: Re: [SQL] Extracting hostname from URI column

Hi,

Thanks, perfect! (though I'll have to look into the regex warning):

=> select substring( href from '.*://\([^/]*)' ) as hostname from url where id<10;
WARNING:  nonstandard use of escape in a string literal at character 29
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.

So now I have this:        hostname

--------------------------texturizer.nettexturizer.netforums.mozillazine.orgwww.mozillazine.orgdevedge.netscape.comwww.google.comgroups.google.comwww.google.comdictionary.reference.com

And what I'd like is something that would give me the counts for the number of occurrences of each unique hostname.
Somethingmuch like `uniq -c'.  Can anyone tell me how that's done or where I should look for info? (I'm not sure what
tolook for, that's the problem). 

Thanks,
Otis


----- Original Message ----
From: chester c young <chestercyoung@yahoo.com>
To: ogjunk-pgjedan@yahoo.com
Cc: sql pgsql <pgsql-sql@postgresql.org>
Sent: Tuesday, September 11, 2007 8:42:46 PM
Subject: Re: [SQL] Extracting hostname from URI column

> I'm trying to use substr() and position() functions to extract the
> full host name (and later a domain) from a column that holds URLs.

substring( href from '.*://\([^/]*)' );



____________________________________________________________________________________
Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/




---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend





Re: Extracting hostname from URI column

From
Paul Lambert
Date:
Paul Lambert wrote:
> 
> Just use distinct...
> 
> test=# select distinct count(*),substring( href from '.*://([^/]*)' ) as 
> domain from url group by domain order by domain;

OK so distinct was redundant there... it gives the same results without it.

AutoDRS=# select count(*) as occurances,substring( href from 
'.*://([^/]*)' ) as domain from url group by domain order by occurances 
desc,domain; occurances |          domain
------------+--------------------------         11 | www.google.com          4 | dictionary.reference.com          4 |
texturizer.net         3 | devedge.netscape.com          3 | groups.google.com          2 | forums.mozillazine.org
   2 | www.mozillazine.org
 
(7 rows)

-- 
Paul Lambert
Database Administrator
AutoLedgers



Re: Extracting hostname from URI column

From
Alvaro Herrera
Date:
Paul Lambert wrote:
> Andrej Ricnik-Bay wrote:
>
>> Plain regex.... The key are the parenthesis () ...
>> basically it will omit ANYTHING + two slashes at the beginning
>> of a string.  Then it will match everything BUT a slash, and as
>> much of that as possible since regex are greedy by default
>> (hence the host name he was looking for) ... and everything
>> AFTER a slash will be omitted.
>> Cheers,
>> Andrej
> Thanks - that makes a bit more sense. I'm in the middle of reading chapter 
> 9.3.7 of the manual - POSIX Regular Expressions - which I'm assuming is 
> dealing with this, so it's looking clearer.

A very good resource on regular expressions is "Mastering Regular
Expressions" by Jeffrey Friedl, now in its third edition:
http://www.oreilly.com/catalog/regex3/

I read the first ed. years ago and it was very illuminating.

-- 
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"Acepta los honores y aplausos y perderás tu libertad"


Re: Extracting hostname from URI column

From
Paul Lambert
Date:
Paul Lambert wrote:
> chester c young wrote:
>>> I'm trying to use substr() and position() functions to extract the
>>> full host name (and later a domain) from a column that holds URLs.
>>
>> substring( href from '.*://\([^/]*)' );
>>
> 
> Ok, your solution looks better than mine... but I have no idea how to 
> interpret that, time to consult some manuals.
> 

OK - following on from this, I'm doing a similar thing to the OP to 
analyze my proxy's traffic (never occured to me to do it in a db until 
that post)

I've done the above regex to pull out the domain part of the URL and am 
left with results such as:
"acvs.mediaonenetwork.net"
"profile.ak.facebook.com"
"www.bankwest.com.au"

What I want to do next is pull out the major domain part of the URL
I.e. for the above three records I should end up with
"mediaonenetwork.net"
"facebook.com"
"bankwest.com.au"

What would be the best way to do something like that? I assume it won't 
be a simple regex like the above due to the country codes on the end of 
some domains. My thought is look at the last portion of the domain, if 
it's 2 characters long then assume it's a country code and grab the last 
three sections, if it's not three characters long then assume it's an 
international domain and grab the last two... but that sounds a bit dodgy.

-- 
Paul Lambert
Database Administrator
AutoLedgers



Re: Extracting hostname from URI column

From
Gregory Stark
Date:
"Paul Lambert" <paul.lambert@autoledgers.com.au> writes:

> What would be the best way to do something like that? I assume it won't be a
> simple regex like the above due to the country codes on the end of some
> domains. My thought is look at the last portion of the domain, if it's 2
> characters long then assume it's a country code and grab the last three
> sections, if it's not three characters long then assume it's an international
> domain and grab the last two... but that sounds a bit dodgy.

Not all countries break up their national tld space into sections like .co.uk
or .com.au. Canadian domains can be bought directly under .ca like amazon.ca.

I think you'll have to store a specific list of tlds and how deep you want to
look.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Extracting hostname from URI column

From
Andrew Sullivan
Date:
On Mon, Sep 17, 2007 at 12:18:56AM +0100, Gregory Stark wrote:
> 
> Not all countries break up their national tld space into sections like .co.uk
> or .com.au. Canadian domains can be bought directly under .ca like amazon.ca.
> 
> I think you'll have to store a specific list of tlds and how deep you want to
> look.

Another way to look at this is that any containing domain is just a
domain.  So .info is a domain, example.info is a domain, and in case
of the existence of host1.some.example.info, some.example.info is a
domain.  Also, co.uk is just as much a domain as yahoo.co.uk.  

Some poor misguided people at Netscape badly misunderstood the DNS
many years ago, and web browser cookies have been attempting to
recover from that misunderstanding ever since (with sometimes serious
security implications).  There's really no reliable way to deduce
common responsibility from the DNS labels, and it's sometimes
terrifically important not to make that mistake.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The plural of anecdote is not data.    --Roger Brinner


Re: Extracting hostname from URI column

From
John Summerfield
Date:
Paul Lambert wrote:
> Paul Lambert wrote:
>> chester c young wrote:
>>>> I'm trying to use substr() and position() functions to extract the
>>>> full host name (and later a domain) from a column that holds URLs.
>>>
>>> substring( href from '.*://\([^/]*)' );
>>>
>>
>> Ok, your solution looks better than mine... but I have no idea how to 
>> interpret that, time to consult some manuals.
>>
> 
> OK - following on from this, I'm doing a similar thing to the OP to 
> analyze my proxy's traffic (never occured to me to do it in a db until 
> that post)
> 
> I've done the above regex to pull out the domain part of the URL and am 
> left with results such as:
> "acvs.mediaonenetwork.net"
> "profile.ak.facebook.com"
> "www.bankwest.com.au"
> 
> What I want to do next is pull out the major domain part of the URL
> I.e. for the above three records I should end up with
> "mediaonenetwork.net"
> "facebook.com"
> "bankwest.com.au"
> 
> What would be the best way to do something like that? I assume it won't 
> be a simple regex like the above due to the country codes on the end of 
> some domains. My thought is look at the last portion of the domain, if 
> it's 2 characters long then assume it's a country code and grab the last 
> three sections, if it's not three characters long then assume it's an 
> international domain and grab the last two... but that sounds a bit dodgy.

and csiro.au?
There may be a few others in .au too, legacies from before we went 
commercial.