Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date
Msg-id AANLkTinin8HjBeJyMdZCaGRTr8C1O4OawLJnQcrg9-W4@mail.gmail.com
Whole thread Raw
In response to Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
List pgsql-hackers
On Mon, Sep 27, 2010 at 8:42 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> The second idea is to make values in matrix possible greater. I analyze what
> exactly is matrix in this case. It is sum of original matrix, which
> represent distances between prefixes of strings, and matrix, which represent
> cost of insertions or deletions for moving to diagonal, which containing
> bottom right cell. There is an example of second matrix summand:
>
>       k  i  t  t  e  n
>    1  2  3  4  5  6  7
> s  0  1  2  3  4  5  6
> i  1  0  1  2  3  4  5
> t  2  1  0  1  2  3  4
> t  3  2  1  0  1  2  3
> i  4  3  2  1  0  1  2
> n  5  4  3  2  1  0  1
> g  6  5  4  3  2  1  0
>
> And an example of resulting matrix:
>
>       k  i  t  t  e  n
>    1  3  5  7  9  11 13
> s  1  2  4  6  8  10 12
> i  3  2  2  4  6  8  10
> t  5  4  2  2  4  6  8
> t  7  6  4  2  2  4  6
> i  9  8  6  4  2  3  5
> n  11 10 8  6  4  3  3
> g  13 12 10 8  6  5  3
>
> The resulting matrix saves important property of original matrix, that cell
> value always greater or equal than values, which are used for it's
> calculation.

Hmm.  So if I understand correctly (and it's possible that I don't),
the idea here is that for each cell in the matrix, we store the sum of
the following two quantities:

1. The cost of transforming an initial substring of s into an initial
substring of t (as before), and
2. The smallest imaginable cost of transforming the remaining portion
of s into the remaining portion of t, based on the difference in the
string lengths.

Thus, any cell whose value is already > max_d can't be relevant to the
final result.

The code seems a bit complex, though.  In particular, I'm wondering if
we couldn't simplify things by leaving the meaning of the cells in the
matrix unchanged and just using this calculation to adjust the lower
and upper bounds for each scan.  Perhaps instead of
curr/prev_left/right we could call them min_i and max_i, and after
each inner loop from min_i to max_i we could try to increment min_i
and decrement max_i based on whether cur[min/max_i] + the smallest
possible residual cost > max_d.  Then you've got to also increment
max_i once for each value of j.  Does that make any sense or am I all
wet?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


pgsql-hackers by date:

Previous
From: Itagaki Takahiro
Date:
Subject: Re: I: About "Our CLUSTER implementation is pessimal" patch
Next
From: Robert Haas
Date:
Subject: Re: Hot Standby tuning for btree_xlog_vacuum()