Home > mailing lists

Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function) - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date	September 28, 2010 02:52:47
Msg-id	AANLkTinin8HjBeJyMdZCaGRTr8C1O4OawLJnQcrg9-W4@mail.gmail.com Whole thread Raw
In response to	Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function) (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
List	pgsql-hackers

Tree view

On Mon, Sep 27, 2010 at 8:42 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> The second idea is to make values in matrix possible greater. I analyze what
> exactly is matrix in this case. It is sum of original matrix, which
> represent distances between prefixes of strings, and matrix, which represent
> cost of insertions or deletions for moving to diagonal, which containing
> bottom right cell. There is an example of second matrix summand:
>
>       k  i  t  t  e  n
>    1  2  3  4  5  6  7
> s  0  1  2  3  4  5  6
> i  1  0  1  2  3  4  5
> t  2  1  0  1  2  3  4
> t  3  2  1  0  1  2  3
> i  4  3  2  1  0  1  2
> n  5  4  3  2  1  0  1
> g  6  5  4  3  2  1  0
>
> And an example of resulting matrix:
>
>       k  i  t  t  e  n
>    1  3  5  7  9  11 13
> s  1  2  4  6  8  10 12
> i  3  2  2  4  6  8  10
> t  5  4  2  2  4  6  8
> t  7  6  4  2  2  4  6
> i  9  8  6  4  2  3  5
> n  11 10 8  6  4  3  3
> g  13 12 10 8  6  5  3
>
> The resulting matrix saves important property of original matrix, that cell
> value always greater or equal than values, which are used for it's
> calculation.

Hmm.  So if I understand correctly (and it's possible that I don't),
the idea here is that for each cell in the matrix, we store the sum of
the following two quantities:

1. The cost of transforming an initial substring of s into an initial
substring of t (as before), and
2. The smallest imaginable cost of transforming the remaining portion
of s into the remaining portion of t, based on the difference in the
string lengths.

Thus, any cell whose value is already > max_d can't be relevant to the
final result.

The code seems a bit complex, though.  In particular, I'm wondering if
we couldn't simplify things by leaving the meaning of the cells in the
matrix unchanged and just using this calculation to adjust the lower
and upper bounds for each scan.  Perhaps instead of
curr/prev_left/right we could call them min_i and max_i, and after
each inner loop from min_i to max_i we could try to increment min_i
and decrement max_i based on whether cur[min/max_i] + the smallest
possible residual cost > max_d.  Then you've got to also increment
max_i once for each value of j.  Does that make any sense or am I all
wet?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

pgsql-hackers by date:

From: Itagaki Takahiro
Date: 28 September 2010, 02:05:15
Subject: Re: I: About "Our CLUSTER implementation is pessimal" patch

From: Robert Haas
Date: 28 September 2010, 03:08:50
Subject: Re: Hot Standby tuning for btree_xlog_vacuum()

Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function) - Mailing list pgsql-hackers

Previous

Next