Home > mailing lists

String Similarity - Mailing list pgsql-hackers

From	Mark Woodward
Subject	String Similarity
Date	May 19, 2006 19:49:23
Msg-id	18405.24.91.171.78.1148068848.squirrel@mail.mohawksoft.com Whole thread Raw
Responses	Re: String Similarity Re: String Similarity Re: String Similarity Re: String Similarity Re: String Similarity Re: String Similarity
List	pgsql-hackers

Tree view

I have a side project that needs to "intelligently" know if two strings
are contextually similar. Think about how CDDB information is collected
and sorted. It isn't perfect, but there should be enough information to be
usable.

Think about this:

"pink floyd - dark side of the moon - money"
"dark side of the moon - pink floyd - money"
"money - dark side of the moon - pink floyd"
etc.

To a human, these strings are almost identical. Similarly:

"dark floyd of money moon pink side the"

Is a puzzle to be solved by 13 year old children before the movie starts.

My post has three questions:

(1) Does anyone know of an efficient and numerically quantified method of
detecting these sorts of things? I currently have a fairly inefficient and
numerically bogus solution that may be the only non-impossible solution
for the problem.

(2) Does any one see a need for this feature in PostgreSQL? If so, what
kind of interface would be best accepted as a patch? I am currently
returning a match liklihood between 0 and 100;

(3) Is there also a desire for a Levenshtein distence function for text
and varchars? I experimented with it, and was forced to write the function
in item #1.

pgsql-hackers by date:

From: "Jim C. Nasby"
Date: 19 May 2006, 19:39:52
Subject: Re: PL/pgSQL 'i = i + 1' Syntax

From: Martijn van Oosterhout
Date: 19 May 2006, 19:54:44
Subject: Re: String Similarity

String Similarity - Mailing list pgsql-hackers

Previous

Next