Home > mailing lists

Re: multibyte charater set in levenshtein function - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: multibyte charater set in levenshtein function
Date	August 2, 2010 20:53:15
Msg-id	AANLkTimZcAjwCJyRgwKaM3Ms_0ipO8NteVCnfgKCf7RL@mail.gmail.com Whole thread Raw
In response to	Re: multibyte charater set in levenshtein function (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Responses	Re: multibyte charater set in levenshtein function
List	pgsql-hackers

Tree view

2010/8/2 Alexander Korotkov <aekorotkov@gmail.com>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
> SELECT SUM(levenshtein(a, 'ańs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'дом')) FROM words2;
> SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
>     (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.

Hey, that's really cool.  It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

mb_levenshtein_rmh-v3.patch

pgsql-hackers by date:

From: Robert Haas
Date: 02 August 2010, 20:21:13
Subject: Re: patch for check constraints using multiple inheritance

From: Mike Fowler
Date: 02 August 2010, 21:16:50
Subject: Re: review: xml_is_well_formed

Re: multibyte charater set in levenshtein function - Mailing list pgsql-hackers

Attachment

Previous

Next