Thread: Index for Levenshtein distance

Index for Levenshtein distance

From
"Janek Sendrowski"
Date:
Hi,

I'm searching for a suitable Index for my query witch compares Strings with the Levenshtein distance.

I read that a prefix index would fit, but I don't know how to build it. I only know that its supported by Gist.

I couldn't find an instructions so I hope someone can help me.

 

Best regards,
Janek

 

 

Index for Levenshtein distance (better format)

From
"Janek Sendrowski"
Date:
Hi,

Im searching for a suitable Index for my query witch compares Strings with the Levenshtein distance.

I read that a prefix index would fit, but I dont know how to build it. I only know that its supported by Gist.

I couldn't find an instructions so I hope someone can help me.
 
Best regards,
Janek
 
 


Re: Index for Levenshtein distance (better format)

From
Giuseppe Broccolo
Date:
Hi Janek,

Il 21/07/2013 16:46, Janek Sendrowski ha scritto:
> Hi,
>
> Im searching for a suitable Index for my query witch compares Strings with the Levenshtein distance.
>
> I read that a prefix index would fit, but I dont know how to build it. I only know that its supported by Gist.
>
> I couldn't find an instructions so I hope someone can help me.

Consider this simple example: suppose you have a table "table1" with a
column "string" of type TEXT, and you are interested to find "string" in
your table equal to 'ciao' basing your query in Levenshtein distance.
Then, improve your query creating an index based on gist.

First of all, create extensions to use levenshtein() function, and gist
indexing for types different from PostGIS objects:

CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION btree_gist;

You can check that all works fine trying a query like:

SELECT string FROM table1 WHERE levenshtein(string,'ciao') = 0;

which finds also "string" equal to 'ciao'. Create now the index to
improve this query:

CREATE INDEX lev_idx ON table1 USING GIST(levenshtein(string,'ciao'));

And relaunch the same query. I made a simple check with a very simple
table1 example with ten raws using EXPLAIN ANALYSE to exploit the
performances: query time changes from 1.077 ms to 0.677 ms after gist
index creation.

Hope it helps,

Giuseppe.

--
Giuseppe Broccolo - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
giuseppe.broccolo@2ndQuadrant.it  |www.2ndQuadrant.it



Fastest Index/Algorithm to find similar sentences

From
"Janek Sendrowski"
Date:
Hi,

I'm searching for an algorithm/Index to find similar sentences in a database.

The Fulltextsearch is not really suitable because it doesn't have a tolerance.

The Levenshtein-distance ist to slow.

I also tried pg_trgm module, which works with tri-grams, but it's also very slow with 100.000+ rows.

I hope someone can help, I can't really find sth. which is fast enough.

Best regards,
Janek
 
 


Re: Fastest Index/Algorithm to find similar sentences

From
Dann Corbit
Date:
Of course, you can use regular expressions and LIKE.  Without understanding the structure of your database, I don't
knowif that can be made efficient.  For a collection of sentences, I suspect it would get complicated.  It would
probablybe slow.  I guess that what you want to do will be hard to perform  in an efficient manner using a standard
relationaldatabase with commonly used functions such as LIKE and REGEX.
 

Perhaps one of the bioinformatics projects like PostBIO or PostBIS can be adapted to suit your needs.  They deal with
quicklyfinding similar sequences that are very complex, but they are designed specifically for DNA sequences.
 
Just a thought.

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Janek Sendrowski
Sent: Thursday, July 25, 2013 3:55 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Fastest Index/Algorithm to find similar sentences

Hi,

I'm searching for an algorithm/Index to find similar sentences in a database.

The Fulltextsearch is not really suitable because it doesn't have a tolerance.

The Levenshtein-distance ist to slow.

I also tried pg_trgm module, which works with tri-grams, but it's also very slow with 100.000+ rows.

I hope someone can help, I can't really find sth. which is fast enough.

Best regards,
Janek
 
 


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Fastest Index/Algorithm to find similar sentences

From
Amit Langote
Date:
On Fri, Jul 26, 2013 at 7:54 AM, Janek Sendrowski <janek12@web.de> wrote:
> Hi,
>
> I'm searching for an algorithm/Index to find similar sentences in a database.
>
> The Fulltextsearch is not really suitable because it doesn't have a tolerance.
>
> The Levenshtein-distance ist to slow.
>
> I also tried pg_trgm module, which works with tri-grams, but it's also very slow with 100.000+ rows.
>
> I hope someone can help, I can't really find sth. which is fast enough.
>

Have you tried pg_bigm (a bi-gram based implementation)? It's still in
development phase, but you could give it a try and see if it can
perform better where pg_trgm can not.


--
Amit Langote


Re: Fastest Index/Algorithm to find similar sentences

From
"Janek Sendrowski"
Date:
Thanks for your answers.

@Amit Langote: I had a look and found out that pg_bigm doesn't support similar matches 

@Dann Corbit: The idea with the sequences makes sence. I had a look and I'm not sure, if they support similar sequences

Janek


Re: Fastest Index/Algorithm to find similar sentences

From
Sergey Konoplev
Date:
On Thu, Jul 25, 2013 at 3:54 PM, Janek Sendrowski <janek12@web.de> wrote:
> The Fulltextsearch is not really suitable because it doesn't have a tolerance.

What do you exactly mean by tolerance here?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com


Re: Fastest Index/Algorithm to find similar sentences

From
"Janek Sendrowski"
Date:
Hi Sergey Konoplev,
 
If I'm searching for a sentence like "The tiger is the largest cat species" for example.
 
I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the
sentenceswith only three or even two of these words. 
 
Janek


Re: Fastest Index/Algorithm to find similar sentences

From
Sergey Konoplev
Date:
On Sat, Jul 27, 2013 at 10:04 AM, Janek Sendrowski <janek12@web.de> wrote:
> If I'm searching for a sentence like "The tiger is the largest cat species" for example.
> I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the
sentenceswith only three or even two of these words. 

You can use & (AND), | (OR), and ! (NOT) operators in tsquery, so you
can achieve what you want just like this:

[local]:5432 grayhemp@grayhemp=# select to_tsquery('tiger | largest |
cat | species') @@ to_tsvector('The tiger is the largest cat');
 ?column?
----------
 t

Or may be I understand something wrong again?

>
> Janek
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com


Re: Fastest Index/Algorithm to find similar sentences

From
Dann Corbit
Date:
I worked on a library project once that needed to perform similarity searches.

The first thing needed was to construct a word dictionary where there was a number corresponding to each word.
1, 'aardvark'
...
99999, 'zygote'

Then you need a list of stop words like 'AND', 'THE':
https://en.wikipedia.org/wiki/Stop_words

Then, you write a sentence parser that turns words into their numbers
So now, a bibliography entry (for example) will be a vector of numbers.

You can query with things like wordcount, word x NEAR word y, etc.
If the database supports it, you can also query with bitmap indexes.
I have not used the PostgreSQL bitmap indexes much, but they look like they might be quite useful:
http://wiki.postgresql.org/wiki/Bitmap_Indexes

We used something called ALA library parsing rules that stripped off special characters, made capitalization uniform,
etc.
http://www.ala.org/tools/guidelines/standardsguidelines
Something like this project was the outcome:
http://www.ala.org/lita/ital/21/4/su

You might look into library software.  Maybe you can find something useful here:
http://www.loc.gov/marc/marctools.html

I see that there are some sourceforge MARC record projects:
http://sourceforge.net/directory/os:windows/freshness:recently-updated/?q=marc%20records




Re: Fastest Index/Algorithm to find similar sentences

From
Beena Emerson
Date:
On Sat, Jul 27, 2013 at 10:34 PM, Janek Sendrowski <janek12@web.de> wrote:
Hi Sergey Konoplev,
 
If I'm searching for a sentence like "The tiger is the largest cat species" for example.
 
I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the sentences with only three or even two of these words.
 
Janek


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Hi,

You may use similarity functions of pg_trgm.

Example:
=# \d+ test
                        Table "public.test"
 Column | Type | Modifiers | Storage  | Stats target | Description 
--------+------+-----------+----------+--------------+-------------
 col    | text |           | extended |              | 
Indexes:
    "test_idx" gin (col gin_trgm_ops)
Has OIDs: no

# SELECT * FROM test;
                   col                   
-----------------------------------------
 The tiger is the largest cat species
 The cheetah is the fastest  cat species
 The peacock is the largest bird species
(3 rows)

=# SELECT show_limit();
 show_limit 
------------
        0.3
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
  FROM test WHERE col % 'The tiger is the largest cat species'
  ORDER BY sml DESC, col;
                   col                   |   sml    
-----------------------------------------+----------
 The tiger is the largest cat species    |        1
 The peacock is the largest bird species | 0.511111
 The cheetah is the fastest  cat species | 0.466667
(3 rows)

=# SELECT set_limit(0.5);
 set_limit 
-----------
       0.5
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
  FROM test WHERE col % 'The tiger is the largest cat species'
  ORDER BY sml DESC, col;
                   col                   |   sml    
-----------------------------------------+----------
 The tiger is the largest cat species    |        1
 The peacock is the largest bird species | 0.511111
(2 rows)

=# SELECT set_limit(0.9);
 set_limit 
-----------
       0.9
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
  FROM test WHERE col % 'The tiger is the largest cat species'
  ORDER BY sml DESC, col;
                 col                  | sml 
--------------------------------------+-----
 The tiger is the largest cat species |   1
(1 row)


When you set a higher limit, you get more exact matches.


--
Beena Emerson

Re: Fastest Index/Algorithm to find similar sentences

From
Beena Emerson
Date:

I am sorry, I just re-read your mail and realized  you have already tried with pg_trgm.



On Wed, Jul 31, 2013 at 7:23 PM, Beena Emerson <memissemerson@gmail.com> wrote:
On Sat, Jul 27, 2013 at 10:34 PM, Janek Sendrowski <janek12@web.de> wrote:
Hi Sergey Konoplev,
 
If I'm searching for a sentence like "The tiger is the largest cat species" for example.
 
I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the sentences with only three or even two of these words.
 
Janek


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Hi,

You may use similarity functions of pg_trgm.

Example:
=# \d+ test
                        Table "public.test"
 Column | Type | Modifiers | Storage  | Stats target | Description 
--------+------+-----------+----------+--------------+-------------
 col    | text |           | extended |              | 
Indexes:
    "test_idx" gin (col gin_trgm_ops)
Has OIDs: no

# SELECT * FROM test;
                   col                   
-----------------------------------------
 The tiger is the largest cat species
 The cheetah is the fastest  cat species
 The peacock is the largest bird species
(3 rows)

=# SELECT show_limit();
 show_limit 
------------
        0.3
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
  FROM test WHERE col % 'The tiger is the largest cat species'
  ORDER BY sml DESC, col;
                   col                   |   sml    
-----------------------------------------+----------
 The tiger is the largest cat species    |        1
 The peacock is the largest bird species | 0.511111
 The cheetah is the fastest  cat species | 0.466667
(3 rows)

=# SELECT set_limit(0.5);
 set_limit 
-----------
       0.5
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
  FROM test WHERE col % 'The tiger is the largest cat species'
  ORDER BY sml DESC, col;
                   col                   |   sml    
-----------------------------------------+----------
 The tiger is the largest cat species    |        1
 The peacock is the largest bird species | 0.511111
(2 rows)

=# SELECT set_limit(0.9);
 set_limit 
-----------
       0.9
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
  FROM test WHERE col % 'The tiger is the largest cat species'
  ORDER BY sml DESC, col;
                 col                  | sml 
--------------------------------------+-----
 The tiger is the largest cat species |   1
(1 row)


When you set a higher limit, you get more exact matches.


--
Beena Emerson




--
Beena Emerson

Re: Fastest Index/Algorithm to find similar sentences

From
Kevin Grittner
Date:
Janek Sendrowski <janek12@web.de> wrote:

> I also tried pg_trgm module, which works with tri-grams, but it's
> also very slow with 100.000+ rows.

Hmm.  I found the pg_trgm module very fast for name searches with
millions of rows *as long as I used KNN-GiST techniques*.  Were you
careful to do so?  Check out the "Index Support" section of this
page:

http://www.postgresql.org/docs/current/static/pgtrgm.html

While I have not tested this technique with a column containing
sentences, I would expect it to work well.  As a quick
confirmation, I imported the text form of War and Peace into a
table, with one row per *line* (because that was easier than
parsing sentence boundaries for a quick test).  That was over
65,000 rows.

test=# select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3;
 lineno |                               linetext                                
--------+-----------------------------------------------------------------------
   9082 | The gentlemen assembled at Bilibin's were young, wealthy, gay society
  36575 | "Gentlemen, you are crushing me!..."
  55997 | * "Good day, gentlemen."
(3 rows)

test=# explain analyze select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3;
                                                               QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.28..0.80 rows=3 width=53) (actual time=37.986..38.002 rows=3 loops=1)
   ->  Index Scan using wap_text on war_and_peace  (cost=0.28..11216.42 rows=65007 width=53) (actual
time=37.984..37.999rows=3 loops=1) 
         Order By: (linetext <-> 'young wealthy gay gentlemen'::text)
 Total runtime: 38.180 ms
(4 rows)

To me, 38 milliseconds to search War and Peace for best matches to
a text string seems reasonable; I'm not sure what you're looking
for, since you didn't give any numbers.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Fastest Index/Algorithm to find similar sentences

From
Sergey Konoplev
Date:
On Sun, Aug 11, 2013 at 6:20 AM, Janek Sendrowski <janek12@web.de> wrote:
> the ranking functions are nice, but if I search for one more word, which is not included in the sentence I'm
searchingfor, It doesn't find the sentence. 

I have already wrote you earlier about it. You can solve the problem
by using & (AND), | (OR), and ! (NOT) operators in tsquery.

select to_tsquery('tiger | largest | cat | species') @@
to_tsvector('The tiger is the largest cat');
 ?column?
----------
 t

The query contains one word more than the sentence (the word is
"species"), and it successfully finds it.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: Fastest Index/Algorithm to find similar sentences

From
"andres.pascal"
Date:
As I understand, you need to search strings by similarity (using levenshtein
or any other metric distance). In that case, you can use metric indexes like
FHQT or FQA (this is better if you are using a relational database, like
postgres). But they are not implemented yet in most DBMS, so you need to
program the index. It s not too hard, but you need to understand the base
concepts. You can look for "searching in metric spaces" to read about it. If
you are in a hurry, you can mail me and maybe I can help you.

Andres.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Index-for-Levenshtein-distance-tp5764546p5768127.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Fastest Index/Algorithm to find similar sentences

From
Merlin Moncure
Date:
On Fri, Aug 2, 2013 at 10:25 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Janek Sendrowski <janek12@web.de> wrote:
>
>> I also tried pg_trgm module, which works with tri-grams, but it's
>> also very slow with 100.000+ rows.
>
> Hmm.  I found the pg_trgm module very fast for name searches with
> millions of rows *as long as I used KNN-GiST techniques*.  Were you
> careful to do so?  Check out the "Index Support" section of this
> page:
>
> http://www.postgresql.org/docs/current/static/pgtrgm.html
>
> While I have not tested this technique with a column containing
> sentences, I would expect it to work well.  As a quick
> confirmation, I imported the text form of War and Peace into a
> table, with one row per *line* (because that was easier than
> parsing sentence boundaries for a quick test).  That was over
> 65,000 rows.

+ 1 this.  pg_trgm is black magic.  search time (when using index) is
mostly dependent on number of trigrams in search string vs average
number of trigrams in database.

merlin