Re: String Similarity - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: String Similarity
Date
Msg-id 446E3025.8040704@markdilger.com
Whole thread Raw
In response to String Similarity  ("Mark Woodward" <pgsql@mohawksoft.com>)
Responses Re: String Similarity
List pgsql-hackers
Mark Woodward wrote:
> I have a side project that needs to "intelligently" know if two strings
> are contextually similar. Think about how CDDB information is collected
> and sorted. It isn't perfect, but there should be enough information to be
> usable.
> 
> Think about this:
> 
> "pink floyd - dark side of the moon - money"
> "dark side of the moon - pink floyd - money"
> "money - dark side of the moon - pink floyd"
> etc.
> 
> To a human, these strings are almost identical. Similarly:
> 
> "dark floyd of money moon pink side the"
> 
> Is a puzzle to be solved by 13 year old children before the movie starts.
> 
> My post has three questions:
> 
> (1) Does anyone know of an efficient and numerically quantified method of
> detecting these sorts of things? I currently have a fairly inefficient and
> numerically bogus solution that may be the only non-impossible solution
> for the problem.
> 
> (2) Does any one see a need for this feature in PostgreSQL? If so, what
> kind of interface would be best accepted as a patch? I am currently
> returning a match liklihood between 0 and 100;
> 
> (3) Is there also a desire for a Levenshtein distence function for text
> and varchars? I experimented with it, and was forced to write the function
> in item #1.

The Levenshtein distance (also known as "edit distance") won't really give you
what you want above, because operations to transplant whole chunks of the string
aren't supported.  (You can simulate it with inserts and deletes, but you pay
individually for each of them.)  Also, Levenshtein distances don't charge much
for changing a word into a similarly spelled but semantically distinct word,
such as "word" => "work".

What you would want, I think, is some function that recognizes that the whole
substring "pink floyd" has been moved from the beginning to the middle of the
string, and only charges you a small edit cost for having done so.  It would
need to recognize both the word boundaries and the transplants.  Off the top of
my head, I'm not sure how you would achieve that with good runtime
characteristics.  You can go even further and allow synonyms, so that "pink
floyd" is more related to "red floyd" than it is to "large floyd", but for that
sort of thing you would probably need to pull in wordnet.

If you want to notice that two strings contain local similarity, but don't have
an overall good Levenshtein distance, take a look at global vs. local alignment
algorithms used in biological applications.  Local alignment can be achieved in
O(n*m) time, where n and m are the lengths of the two strings, using the
Smith-Waterman algorithm.  (Temple Smith and Michael Waterman).  There are
faster heuristic algorithms, but they don't have the same guarantees.  These
local alignments might tell you something useful as a part of the overall solution.

Hmmm...  I think I like this problem.  Maybe I'll work on it a bit as a contrib
module.

mark


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [OT] MySQL is bad, but THIS bad?
Next
From: Martijn van Oosterhout
Date:
Subject: Re: [OT] MySQL is bad, but THIS bad?