Re: Best method to compare subdomains - Mailing list pgsql-general

From Andrew Sullivan
Subject Re: Best method to compare subdomains
Date
Msg-id 20130116221746.GA211@crankycanuck.ca
Whole thread Raw
In response to Best method to compare subdomains  (Robert James <srobertjames@gmail.com>)
List pgsql-general
On Wed, Jan 16, 2013 at 03:23:30PM -0500, Robert James wrote:
> Is there a recommended, high performance method to check for subdomains?
>
> Something like:
> - www.google.com is subdomain of google.com
> - ilikegoogle.com is not subdomain of google.com
>
> There are many ways to do this (lowercase and reverse the string,
> append a '.' if not there, append a '%', and do a LIKE).  But I'm
> looking for one that will perform well when the master domain list is
> an indexed field in a table, and when the possible subdomain is either
> an individual value, or a field in a table for a join (potentially
> indexed).

Well, the _best_ thing to do would be to convert all the labels to
wire format and compare those, because that way you know you're
matching label by label the way the DNS does.  That sounds like a lot
of work, however, and you probably need to do it in C.

You could find all the label boundaries (in the presentation format,
the dots) and then split out the labels.  I suppose you could put them
into an array and then count backwards in the array to compare the
different labels.

Reversing the string might not actually work, because it's possible that
the labels are just octets and unless you're careful about your locale
you could end up messing that reverse operation up -- oughta be safe in
"C", though.  (Contrary to popular opinion, domain name labels are
not necessarily made of ASCII.)  You can, of course, also force the
labels to be only LDH-labels.

Best,

A

--
Andrew Sullivan
ajs@crankycanuck.ca


pgsql-general by date:

Previous
From: Robert James
Date:
Subject: Best method to compare subdomains
Next
From: Robert James
Date:
Subject: argument of AND must not return a set when using regexp_matches