Home > mailing lists

Re: String Similarity - Mailing list pgsql-hackers

From	Mark Woodward
Subject	Re: String Similarity
Date	May 20, 2006 10:00:51
Msg-id	18626.24.91.171.78.1148130741.squirrel@mail.mohawksoft.com Whole thread Raw
In response to	Re: String Similarity ("Mark Woodward" <pgsql@mohawksoft.com>)
List	pgsql-hackers

Tree view

> What I was hoping someone had was a function that could find the substring
> runs in something less than a strlen1*strlen2 number of operations and a
> numerically sane way of representing the similarity or difference.

Acually, it is more like strlen1*strlen2*N, where N is the number of valid
runs.

Unless someone has a GREAT algorithm, I think it will always be at least
strlen1*strlen2. The amount of processing for N is the question. Is N *
(strlen1*strlen2) less than sorting an array of N elements, scanning
through those elements and eliminating duplicate character matches?

Depending on the max value of N, I could save all the runs, sort by max
length, then exclude based on overlapp, but it isn't clear that this is a
performance win unless the strings are long, even then, I'm not completely
convinced as N still has some strlen ramifications for removing
duplicates.

pgsql-hackers by date:

From: "Dawid Kuroczko"
Date: 20 May 2006, 09:29:12
Subject: Re: [OT] MySQL is bad, but THIS bad?

From: "Mark Woodward"
Date: 20 May 2006, 11:30:10
Subject: Re: [OT] MySQL is bad, but THIS bad?

Re: String Similarity - Mailing list pgsql-hackers

Previous

Next