String Similarity - Mailing list pgsql-hackers

From Mark Woodward
Subject String Similarity
Date
Msg-id 18405.24.91.171.78.1148068848.squirrel@mail.mohawksoft.com
Whole thread Raw
Responses Re: String Similarity  (Martijn van Oosterhout <kleptog@svana.org>)
Re: String Similarity  (Andrew Dunstan <andrew@dunslane.net>)
Re: String Similarity  (Mark Dilger <pgsql@markdilger.com>)
Re: String Similarity  ("Greg Sabino Mullane" <greg@turnstep.com>)
Re: String Similarity  (Oleg Bartunov <oleg@sai.msu.su>)
Re: String Similarity  (Christopher Kings-Lynne <chris.kings-lynne@calorieking.com>)
List pgsql-hackers
I have a side project that needs to "intelligently" know if two strings
are contextually similar. Think about how CDDB information is collected
and sorted. It isn't perfect, but there should be enough information to be
usable.

Think about this:

"pink floyd - dark side of the moon - money"
"dark side of the moon - pink floyd - money"
"money - dark side of the moon - pink floyd"
etc.

To a human, these strings are almost identical. Similarly:

"dark floyd of money moon pink side the"

Is a puzzle to be solved by 13 year old children before the movie starts.

My post has three questions:

(1) Does anyone know of an efficient and numerically quantified method of
detecting these sorts of things? I currently have a fairly inefficient and
numerically bogus solution that may be the only non-impossible solution
for the problem.

(2) Does any one see a need for this feature in PostgreSQL? If so, what
kind of interface would be best accepted as a patch? I am currently
returning a match liklihood between 0 and 100;

(3) Is there also a desire for a Levenshtein distence function for text
and varchars? I experimented with it, and was forced to write the function
in item #1.



pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: PL/pgSQL 'i = i + 1' Syntax
Next
From: Martijn van Oosterhout
Date:
Subject: Re: String Similarity