Thread: String Similarity

String Similarity

From
"Mark Woodward"
Date:
I have a side project that needs to "intelligently" know if two strings
are contextually similar. Think about how CDDB information is collected
and sorted. It isn't perfect, but there should be enough information to be
usable.

Think about this:

"pink floyd - dark side of the moon - money"
"dark side of the moon - pink floyd - money"
"money - dark side of the moon - pink floyd"
etc.

To a human, these strings are almost identical. Similarly:

"dark floyd of money moon pink side the"

Is a puzzle to be solved by 13 year old children before the movie starts.

My post has three questions:

(1) Does anyone know of an efficient and numerically quantified method of
detecting these sorts of things? I currently have a fairly inefficient and
numerically bogus solution that may be the only non-impossible solution
for the problem.

(2) Does any one see a need for this feature in PostgreSQL? If so, what
kind of interface would be best accepted as a patch? I am currently
returning a match liklihood between 0 and 100;

(3) Is there also a desire for a Levenshtein distence function for text
and varchars? I experimented with it, and was forced to write the function
in item #1.



Re: String Similarity

From
Martijn van Oosterhout
Date:
On Fri, May 19, 2006 at 04:00:48PM -0400, Mark Woodward wrote:
> (3) Is there also a desire for a Levenshtein distence function for text
> and varchars? I experimented with it, and was forced to write the function
> in item #1.

Postgres already has a Levenshtein distence function, see fuzzystrmatch
in contrib. Whatever you come up with might fit in well there...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: String Similarity

From
Andrew Dunstan
Date:
Mark Woodward wrote:
>
> (3) Is there also a desire for a Levenshtein distence function for text
> and varchars? I experimented with it, and was forced to write the function
> in item #1.
>
>   


fuzzystrmatch in contrib already has a Levenshtein function.

cheers

andrew


Re: String Similarity

From
Mark Dilger
Date:
Mark Woodward wrote:
> I have a side project that needs to "intelligently" know if two strings
> are contextually similar. Think about how CDDB information is collected
> and sorted. It isn't perfect, but there should be enough information to be
> usable.
> 
> Think about this:
> 
> "pink floyd - dark side of the moon - money"
> "dark side of the moon - pink floyd - money"
> "money - dark side of the moon - pink floyd"
> etc.
> 
> To a human, these strings are almost identical. Similarly:
> 
> "dark floyd of money moon pink side the"
> 
> Is a puzzle to be solved by 13 year old children before the movie starts.
> 
> My post has three questions:
> 
> (1) Does anyone know of an efficient and numerically quantified method of
> detecting these sorts of things? I currently have a fairly inefficient and
> numerically bogus solution that may be the only non-impossible solution
> for the problem.
> 
> (2) Does any one see a need for this feature in PostgreSQL? If so, what
> kind of interface would be best accepted as a patch? I am currently
> returning a match liklihood between 0 and 100;
> 
> (3) Is there also a desire for a Levenshtein distence function for text
> and varchars? I experimented with it, and was forced to write the function
> in item #1.

The Levenshtein distance (also known as "edit distance") won't really give you
what you want above, because operations to transplant whole chunks of the string
aren't supported.  (You can simulate it with inserts and deletes, but you pay
individually for each of them.)  Also, Levenshtein distances don't charge much
for changing a word into a similarly spelled but semantically distinct word,
such as "word" => "work".

What you would want, I think, is some function that recognizes that the whole
substring "pink floyd" has been moved from the beginning to the middle of the
string, and only charges you a small edit cost for having done so.  It would
need to recognize both the word boundaries and the transplants.  Off the top of
my head, I'm not sure how you would achieve that with good runtime
characteristics.  You can go even further and allow synonyms, so that "pink
floyd" is more related to "red floyd" than it is to "large floyd", but for that
sort of thing you would probably need to pull in wordnet.

If you want to notice that two strings contain local similarity, but don't have
an overall good Levenshtein distance, take a look at global vs. local alignment
algorithms used in biological applications.  Local alignment can be achieved in
O(n*m) time, where n and m are the lengths of the two strings, using the
Smith-Waterman algorithm.  (Temple Smith and Michael Waterman).  There are
faster heuristic algorithms, but they don't have the same guarantees.  These
local alignments might tell you something useful as a part of the overall solution.

Hmmm...  I think I like this problem.  Maybe I'll work on it a bit as a contrib
module.

mark


Re: String Similarity

From
"Mark Woodward"
Date:
> Mark Woodward wrote:
>> I have a side project that needs to "intelligently" know if two strings
>> are contextually similar. Think about how CDDB information is collected
>> and sorted. It isn't perfect, but there should be enough information to
>> be
>> usable.
>>
>> Think about this:
>>
>> "pink floyd - dark side of the moon - money"
>> "dark side of the moon - pink floyd - money"
>> "money - dark side of the moon - pink floyd"
>> etc.
>>
>> To a human, these strings are almost identical. Similarly:
>>
>> "dark floyd of money moon pink side the"
>>
>> Is a puzzle to be solved by 13 year old children before the movie
>> starts.
[snip]
>
> Hmmm...  I think I like this problem.  Maybe I'll work on it a bit as a
> contrib
> module.

I *have* a working function, but it is not very efficient and it is not
what I would call numerically predictable. And it does find the various
sub-strings between the two strings in question.

Email me offline and we can make something for contrib.


Re: String Similarity

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> I have a side project that needs to "intelligently" know if two strings
> are contextually similar.

The examples you gave seem heavy on word order and whitespace consideration,
before applying any algorithms. Here's a quick perl version that does the
job:

CREATE OR REPLACE FUNCTION matchval(text,text)
RETURNS INT LANGUAGE plperlu AS
$$ 
use strict;
use String::Approx 'adist'; 
my $uno = join ' ', sort split /\s+/ => lc shift;
my $dos = join ' ', sort split /\s+/ => lc shift; 
return adist(length $uno<length $dos ? ($uno,$dos) : ($dos,$uno));

$$;

Some sample runs: 
SELECT matchval('pink floyd - dark side of the moon - money', 'dark side of the moon - pink floyd - money');
SELECT matchval('dark floyd of money moon pink side the', 'Money - dark side of the moon - Pink Floyd');
SELECT matchval('dark floyd of money moon pink side the', 'monee - drk sidez of da moon - pink floyd');
SELECT matchval('dark floyd of money moon pink side the', 'pink floyd - animals');
SELECT matchval('dark floyd of money moon pink side the', 'walking on the moon - the police');

The above returns 0, 0, 6, 10, and 17; a score of 0 is an exact match.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200605191835
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iD8DBQFEbktUvJuQZxSWSsgRAiCtAJ9nlpqGxlYnimDPp8t5XQsc8y9RywCfZZL6
iU9iPnxHaWOvYCUD7+rK8Do=
=zo3T
-----END PGP SIGNATURE-----




Re: String Similarity

From
Josh Berkus
Date:
> > I have a side project that needs to "intelligently" know if two
> > strings are contextually similar.

Also check out the "fuzzystrmatch" module in /contrib, which offers 
soundex, metaphone and levenschtein functions.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: String Similarity

From
"Mark Woodward"
Date:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>> I have a side project that needs to "intelligently" know if two strings
>> are contextually similar.
>
> The examples you gave seem heavy on word order and whitespace
> consideration,
> before applying any algorithms. Here's a quick perl version that does the
> job:

[SNIP]

This is a case where the example was too simple to explain the problem,
sorry. I have an implementation of Oracle's "contains" function for
PostgreSQL, and it does basically what you are doing, and, in fact, also
has Mohawk Software Extensions (LOL) that provide metaphone. The problem
is that parsing white space realy isn't reliable. Sometimes it is
pinkfloyd-darksideofthemoon.

Also, I have been thinking of other applications.

I have a piece of code that does this:

apps$ ./stratest "pink foyd dark side of the moon money" "money dark side
of the moon pink floyd"
Match:  dark side of the moon
Match: pink f
Match: money
Match: oyd

apps$ ./stratest "pinkfoyddarksideofthemoonmoney"
"moneydarksideofthemoonpinkfloyd"
Match: darksideofthemoon
Match: pinkf
Match: money
Match: oyd

I need to come up with a numerically sane way of taking this information
and understanding overall "similarity."


Re: String Similarity

From
Oleg Bartunov
Date:
Get pg_trgm http://www.sai.msu.su/~megera/oddmuse/index.cgi/ReadmeTrgm
It doesn't depends on language.

Oleg
On Fri, 19 May 2006, Mark Woodward wrote:

> I have a side project that needs to "intelligently" know if two strings
> are contextually similar. Think about how CDDB information is collected
> and sorted. It isn't perfect, but there should be enough information to be
> usable.
>
> Think about this:
>
> "pink floyd - dark side of the moon - money"
> "dark side of the moon - pink floyd - money"
> "money - dark side of the moon - pink floyd"
> etc.
>
> To a human, these strings are almost identical. Similarly:
>
> "dark floyd of money moon pink side the"
>
> Is a puzzle to be solved by 13 year old children before the movie starts.
>
> My post has three questions:
>
> (1) Does anyone know of an efficient and numerically quantified method of
> detecting these sorts of things? I currently have a fairly inefficient and
> numerically bogus solution that may be the only non-impossible solution
> for the problem.
>
> (2) Does any one see a need for this feature in PostgreSQL? If so, what
> kind of interface would be best accepted as a patch? I am currently
> returning a match liklihood between 0 and 100;
>
> (3) Is there also a desire for a Levenshtein distence function for text
> and varchars? I experimented with it, and was forced to write the function
> in item #1.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: String Similarity

From
"Mark Woodward"
Date:
> Get pg_trgm http://www.sai.msu.su/~megera/oddmuse/index.cgi/ReadmeTrgm
> It doesn't depends on language.

That's an interesting approach.

This is what I got:

apps$ ./stratest "pink floyd dark side of the moon money" "dark side of
the moon pink floyd"
Match: dark side of the moon
Match: pink floyd
Similarity: 89

One function finds the substring runs, in descending order of length,
between the two strings. After the function, I have number of runs, length
of best run, total number of characters matched.

Without going into too lengthy description, while space and punctuation
are not reliable. Like this "pinkfloyd" or "pink floyd" "darkside" or
"dark side"

Humans are VERY good at seeing these things, computers, pardon, suck.

What I was hoping someone had was a function that could find the substring
runs in something less than a strlen1*strlen2 number of operations and a
numerically sane way of representing the similarity or difference.


Re: String Similarity

From
"Mark Woodward"
Date:
> What I was hoping someone had was a function that could find the substring
> runs in something less than a strlen1*strlen2 number of operations and a
> numerically sane way of representing the similarity or difference.

Acually, it is more like strlen1*strlen2*N, where N is the number of valid
runs.

Unless someone has a GREAT algorithm, I think it will always be at least
strlen1*strlen2. The amount of processing for N is the question. Is N *
(strlen1*strlen2) less than sorting an array of N elements, scanning
through those elements and eliminating duplicate character matches?

Depending on the max value of N, I could save all the runs, sort by max
length, then exclude based on overlapp, but it isn't clear that this is a
performance win unless the strings are long, even then, I'm not completely
convinced as N still has some strlen ramifications for removing
duplicates.


Re: String Similarity

From
Christopher Kings-Lynne
Date:
Try contrib/pg_trgm...

Chris

Mark Woodward wrote:
> I have a side project that needs to "intelligently" know if two strings
> are contextually similar. Think about how CDDB information is collected
> and sorted. It isn't perfect, but there should be enough information to be
> usable.
> 
> Think about this:
> 
> "pink floyd - dark side of the moon - money"
> "dark side of the moon - pink floyd - money"
> "money - dark side of the moon - pink floyd"
> etc.
> 
> To a human, these strings are almost identical. Similarly:
> 
> "dark floyd of money moon pink side the"
> 
> Is a puzzle to be solved by 13 year old children before the movie starts.
> 
> My post has three questions:
> 
> (1) Does anyone know of an efficient and numerically quantified method of
> detecting these sorts of things? I currently have a fairly inefficient and
> numerically bogus solution that may be the only non-impossible solution
> for the problem.
> 
> (2) Does any one see a need for this feature in PostgreSQL? If so, what
> kind of interface would be best accepted as a patch? I am currently
> returning a match liklihood between 0 and 100;
> 
> (3) Is there also a desire for a Levenshtein distence function for text
> and varchars? I experimented with it, and was forced to write the function
> in item #1.
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

-- 
Christopher Kings-Lynne

Technical Manager
CalorieKing
Tel: +618.9389.8777
Fax: +618.9389.8444
chris.kings-lynne@calorieking.com
www.calorieking.com



Re: String Similarity

From
"Mark Woodward"
Date:
> Try contrib/pg_trgm...

Tri-graphs are interesting, and I'll try to reconsider whether they fit or
not, ut I suspect that do not. (You are the second to recommend it)

Anything based on a word parser is probably not appropriate, the example I
first gave is a little misleading in that it is not the whole of the
problem. Consider this:

"pinkfloyd darkside of the moon - money"

Again, we humans see that this string is almost identical to the others.

I have a working system right now, and it strips all non "alnum()" out of
the strings, then searches the strings for runs, and compiles a list of
runs from longest to shortest.

The algorithm is strlen1*strlen2*N where N is the number of runs detected.
As you can see it is merely brute force.

Secondly, there is a subtle difference between comparing a known string
for which you are searching and comparing two arbitrary strings. The
"known" string is assumed to be in some sort of regular form and the
string to be compared must break down into that form based on your
alorithm. When trying to understand the similarity of two arbitrary
strings, you don't always know what similarities are or what the parsing
rules should be.

ThisIsSomethinThatHumansCanReadButSpaceBasedParsersCanNot.
tHISiSaLSOsOMETHING
Yo uca nalmos trea dthi s

can you see the similarity in these two strings?
CanYouSeeTheSimilarityInTheseTwoStrings?

Ideally I would metahone the individual words, strip out all the white
spaces, and find the run lengths that compare in the two strings, and
calculate the similarity based on the number and size of the runs. I do
not currently metaphone (It isn't clear to me that endcases can be handled
correctly) and I'm not sure how to best calculate similarity. I've been
trying best run, total match, and a host of others, but haven't found one
I really like.

>
> Chris
>
> Mark Woodward wrote:
>> I have a side project that needs to "intelligently" know if two strings
>> are contextually similar. Think about how CDDB information is collected
>> and sorted. It isn't perfect, but there should be enough information to
>> be
>> usable.
>>
>> Think about this:
>>
>> "pink floyd - dark side of the moon - money"
>> "dark side of the moon - pink floyd - money"
>> "money - dark side of the moon - pink floyd"
>> etc.
>>
>> To a human, these strings are almost identical. Similarly:
>>
>> "dark floyd of money moon pink side the"
>>
>> Is a puzzle to be solved by 13 year old children before the movie
>> starts.
>>
>> My post has three questions:
>>
>> (1) Does anyone know of an efficient and numerically quantified method
>> of
>> detecting these sorts of things? I currently have a fairly inefficient
>> and
>> numerically bogus solution that may be the only non-impossible solution
>> for the problem.
>>
>> (2) Does any one see a need for this feature in PostgreSQL? If so, what
>> kind of interface would be best accepted as a patch? I am currently
>> returning a match liklihood between 0 and 100;
>>
>> (3) Is there also a desire for a Levenshtein distence function for text
>> and varchars? I experimented with it, and was forced to write the
>> function
>> in item #1.
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>        subscribe-nomail command to majordomo@postgresql.org so that your
>>        message can get through to the mailing list cleanly
>
> --
> Christopher Kings-Lynne
>
> Technical Manager
> CalorieKing
> Tel: +618.9389.8777
> Fax: +618.9389.8444
> chris.kings-lynne@calorieking.com
> www.calorieking.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>



Re: String Similarity

From
"Pang Zaihu"
Date:
Hello!
Would you like to give me a simple introduction of Levenshtein distence function?
Thank you!
 
 On 2006-05-19 19:54, Martijn van Oosterhout wrote: >  On Fri, May 19, 2006 at 04:00:48PM -0400, Mark Woodward wrote: >  >  (3) Is there also a desire for a Levenshtein distence function for text >  >  and varchars? I experimented with it, and was forced to write the function >  >  in item #1. >   >  Postgres already has a Levenshtein distence function, see fuzzystrmatch >  in contrib. Whatever you come up with might fit in well there... >   >  Have a nice day, >  >  From each according to his ability. To each according to his ability to litigate. >

Re: String Similarity

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Sep 26, 2006 at 09:09:33AM +0800, Pang Zaihu wrote:
> Hello!
> Would you like to give me a simple introduction of Levenshtein distence function?

Better than I could explain:
 <http://en.wikipedia.org/wiki/Levenshtein_distance>

> Thank you!

Thank Wikipedia ;-)

HTH
- -- tomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFG1UIBcgs9XrR2kYRAr42AJ0TjRnUBqmogcKg12mXRVFl6oAjqQCeP/hw
HmqRS+AANLP9eNbNIWp7jOM=
=FHks
-----END PGP SIGNATURE-----