Re: multibyte charater set in levenshtein function - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: multibyte charater set in levenshtein function
Date
Msg-id AANLkTinEIfNbX4cZx5GfH2iHbyRIAcMfCsx6hlm5QIj7@mail.gmail.com
Whole thread Raw
In response to Re: multibyte charater set in levenshtein function  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
Hello Hackers!

I have extended my patch by introducing levenshtein_less_equal function. This function have additional argument max_d and stops calculating when distance exceeds max_d. With low values of max_d function works much faster than original one. 

The example of original levenshtein function usage:

test=# select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') <= 2 order by dist;
    word     | dist 
-------------+------
 consistent  |    0
 insistent   |    2
 consistency |    2
 coexistent  |    2
 consistence |    2
(5 rows)

test=# explain analyze select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') <= 2 order by dist;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual time=203.652..203.658 rows=5 loops=1)
   Sort Key: (levenshtein(word, 'consistent'::text))
   Sort Method:  quicksort  Memory: 25kB
   ->  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual time=19.019..203.601 rows=5 loops=1)
         Filter: (levenshtein(word, 'consistent'::text) <= 2)
 Total runtime: 203.723 ms
(6 rows)

Example of levenshtein_less_equal usage in this case:

test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) <= 2 order by dist;
    word     | dist 
-------------+------
 consistent  |    0
 insistent   |    2
 consistency |    2
 coexistent  |    2
 consistence |    2

test=# explain analyze select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) <= 2 order by dist;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual time=42.198..42.203 rows=5 loops=1)
   Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2))
   Sort Method:  quicksort  Memory: 25kB
   ->  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual time=5.391..42.143 rows=5 loops=1)
         Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) <= 2)
 Total runtime: 42.292 ms
(6 rows)

In the example above levenshtein_less_equal works about 5 times faster.

With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Using multidimensional indexes in ordinal queries
Next
From: Greg Smith
Date:
Subject: Re: Idea for getting rid of VACUUM FREEZE on cold pages