Thread: Levenshtein Distance with more than 255 characters

Levenshtein Distance with more than 255 characters

From
"Janek Sendrowski"
Date:
Hi,

 

I'm searching for an optimized Levenshtein Distance like Postgresql's. My problem is that I want to compare
stringswith a length over 255 characters. 

Does anyone know a solution?

 

Janek Sendrowski

Re: Levenshtein Distance with more than 255 characters

From
Szymon Guz
Date:

On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:
Hi,
 
I'm searching for an optimized Levenshtein Distance like Postgresql's. My problem is that I want to compare strings with a length over 255 characters.
Does anyone know a solution?
 
Janek Sendrowski


Hi,
I'm not sure there is anything different from what you've found in core/contribs. But you can always use pg/plpython or pg/plperl procedure with some external library calculating the distance.

regards
Szymon

Re: Levenshtein Distance with more than 255 characters

From
Tom Lane
Date:
Szymon Guz <mabewlun@gmail.com> writes:
> On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:
>> I'm searching for an optimized Levenshtein Distance like Postgresql's. My
>> problem is that I want to compare strings with a length over 255 characters.
>> Does anyone know a solution?

> I'm not sure there is anything different from what you've found in
> core/contribs. But you can always use pg/plpython or pg/plperl procedure
> with some external library calculating the distance.

Well, you could just rebuild the fuzzystrmatch module with a different
value for MAX_LEVENSHTEIN_STRLEN.  The comments in the code note that the
comparison cost is roughly O(N^2) in the string length, and the reason for
having a limit at all is to ensure the function runtime doesn't get out of
hand --- but it seems likely to me that 255 is an unnecessarily
conservative limit.  If you wanted to do a few tests and report back on
just how slow it can get, we might be persuaded to raise the stock
setting.

            regards, tom lane


Re: Levenshtein Distance with more than 255 characters

From
Szymon Guz
Date:
On 6 September 2013 08:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Szymon Guz <mabewlun@gmail.com> writes:
> On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:
>> I'm searching for an optimized Levenshtein Distance like Postgresql's. My
>> problem is that I want to compare strings with a length over 255 characters.
>> Does anyone know a solution?

> I'm not sure there is anything different from what you've found in
> core/contribs. But you can always use pg/plpython or pg/plperl procedure
> with some external library calculating the distance.

Well, you could just rebuild the fuzzystrmatch module with a different
value for MAX_LEVENSHTEIN_STRLEN.  The comments in the code note that the
comparison cost is roughly O(N^2) in the string length, and the reason for
having a limit at all is to ensure the function runtime doesn't get out of
hand --- but it seems likely to me that 255 is an unnecessarily
conservative limit.  If you wanted to do a few tests and report back on
just how slow it can get, we might be persuaded to raise the stock
setting.

                        regards, tom lane


I've checked that and I think we could raise the limit without any problem

I've set 
#define MAX_LEVENSHTEIN_STRLEN 255 * 255

I was using the levenshtein function comparing two strings of the same length

Strings length: 640
Difference: 531
Time: 2.5ms

Strings length: 1920
Difference: 1258
Time: 20ms


Strings length: 5760
Difference: 1811
Time: 146ms


regards,
Szymon