Thread: Index for Levenshtein distance
Hi, I'm searching for a suitable Index for my query witch compares Strings with the Levenshtein distance. I read that a prefix index would fit, but I don't know how to build it. I only know that its supported by Gist. I couldn't find an instructions so I hope someone can help me. Best regards, Janek
Hi, Im searching for a suitable Index for my query witch compares Strings with the Levenshtein distance. I read that a prefix index would fit, but I dont know how to build it. I only know that its supported by Gist. I couldn't find an instructions so I hope someone can help me. Best regards, Janek
Hi Janek, Il 21/07/2013 16:46, Janek Sendrowski ha scritto: > Hi, > > Im searching for a suitable Index for my query witch compares Strings with the Levenshtein distance. > > I read that a prefix index would fit, but I dont know how to build it. I only know that its supported by Gist. > > I couldn't find an instructions so I hope someone can help me. Consider this simple example: suppose you have a table "table1" with a column "string" of type TEXT, and you are interested to find "string" in your table equal to 'ciao' basing your query in Levenshtein distance. Then, improve your query creating an index based on gist. First of all, create extensions to use levenshtein() function, and gist indexing for types different from PostGIS objects: CREATE EXTENSION fuzzystrmatch; CREATE EXTENSION btree_gist; You can check that all works fine trying a query like: SELECT string FROM table1 WHERE levenshtein(string,'ciao') = 0; which finds also "string" equal to 'ciao'. Create now the index to improve this query: CREATE INDEX lev_idx ON table1 USING GIST(levenshtein(string,'ciao')); And relaunch the same query. I made a simple check with a very simple table1 example with ten raws using EXPLAIN ANALYSE to exploit the performances: query time changes from 1.077 ms to 0.677 ms after gist index creation. Hope it helps, Giuseppe. -- Giuseppe Broccolo - 2ndQuadrant Italy PostgreSQL Training, Services and Support giuseppe.broccolo@2ndQuadrant.it |www.2ndQuadrant.it
Hi, I'm searching for an algorithm/Index to find similar sentences in a database. The Fulltextsearch is not really suitable because it doesn't have a tolerance. The Levenshtein-distance ist to slow. I also tried pg_trgm module, which works with tri-grams, but it's also very slow with 100.000+ rows. I hope someone can help, I can't really find sth. which is fast enough. Best regards, Janek
Of course, you can use regular expressions and LIKE. Without understanding the structure of your database, I don't knowif that can be made efficient. For a collection of sentences, I suspect it would get complicated. It would probablybe slow. I guess that what you want to do will be hard to perform in an efficient manner using a standard relationaldatabase with commonly used functions such as LIKE and REGEX. Perhaps one of the bioinformatics projects like PostBIO or PostBIS can be adapted to suit your needs. They deal with quicklyfinding similar sequences that are very complex, but they are designed specifically for DNA sequences. Just a thought. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Janek Sendrowski Sent: Thursday, July 25, 2013 3:55 PM To: pgsql-general@postgresql.org Subject: [GENERAL] Fastest Index/Algorithm to find similar sentences Hi, I'm searching for an algorithm/Index to find similar sentences in a database. The Fulltextsearch is not really suitable because it doesn't have a tolerance. The Levenshtein-distance ist to slow. I also tried pg_trgm module, which works with tri-grams, but it's also very slow with 100.000+ rows. I hope someone can help, I can't really find sth. which is fast enough. Best regards, Janek -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Fri, Jul 26, 2013 at 7:54 AM, Janek Sendrowski <janek12@web.de> wrote: > Hi, > > I'm searching for an algorithm/Index to find similar sentences in a database. > > The Fulltextsearch is not really suitable because it doesn't have a tolerance. > > The Levenshtein-distance ist to slow. > > I also tried pg_trgm module, which works with tri-grams, but it's also very slow with 100.000+ rows. > > I hope someone can help, I can't really find sth. which is fast enough. > Have you tried pg_bigm (a bi-gram based implementation)? It's still in development phase, but you could give it a try and see if it can perform better where pg_trgm can not. -- Amit Langote
Thanks for your answers. @Amit Langote: I had a look and found out that pg_bigm doesn't support similar matches @Dann Corbit: The idea with the sequences makes sence. I had a look and I'm not sure, if they support similar sequences Janek
On Thu, Jul 25, 2013 at 3:54 PM, Janek Sendrowski <janek12@web.de> wrote: > The Fulltextsearch is not really suitable because it doesn't have a tolerance. What do you exactly mean by tolerance here? -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA Profile: http://www.linkedin.com/in/grayhemp Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979 Skype: gray-hemp Jabber: gray.ru@gmail.com
Hi Sergey Konoplev, If I'm searching for a sentence like "The tiger is the largest cat species" for example. I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the sentenceswith only three or even two of these words. Janek
On Sat, Jul 27, 2013 at 10:04 AM, Janek Sendrowski <janek12@web.de> wrote: > If I'm searching for a sentence like "The tiger is the largest cat species" for example. > I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the sentenceswith only three or even two of these words. You can use & (AND), | (OR), and ! (NOT) operators in tsquery, so you can achieve what you want just like this: [local]:5432 grayhemp@grayhemp=# select to_tsquery('tiger | largest | cat | species') @@ to_tsvector('The tiger is the largest cat'); ?column? ---------- t Or may be I understand something wrong again? > > Janek > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA Profile: http://www.linkedin.com/in/grayhemp Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979 Skype: gray-hemp Jabber: gray.ru@gmail.com
I worked on a library project once that needed to perform similarity searches. The first thing needed was to construct a word dictionary where there was a number corresponding to each word. 1, 'aardvark' ... 99999, 'zygote' Then you need a list of stop words like 'AND', 'THE': https://en.wikipedia.org/wiki/Stop_words Then, you write a sentence parser that turns words into their numbers So now, a bibliography entry (for example) will be a vector of numbers. You can query with things like wordcount, word x NEAR word y, etc. If the database supports it, you can also query with bitmap indexes. I have not used the PostgreSQL bitmap indexes much, but they look like they might be quite useful: http://wiki.postgresql.org/wiki/Bitmap_Indexes We used something called ALA library parsing rules that stripped off special characters, made capitalization uniform, etc. http://www.ala.org/tools/guidelines/standardsguidelines Something like this project was the outcome: http://www.ala.org/lita/ital/21/4/su You might look into library software. Maybe you can find something useful here: http://www.loc.gov/marc/marctools.html I see that there are some sourceforge MARC record projects: http://sourceforge.net/directory/os:windows/freshness:recently-updated/?q=marc%20records
On Sat, Jul 27, 2013 at 10:34 PM, Janek Sendrowski <janek12@web.de> wrote:
--
Hi Sergey Konoplev,
If I'm searching for a sentence like "The tiger is the largest cat species" for example.
I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the sentences with only three or even two of these words.
Janek
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi,
You may use similarity functions of pg_trgm.
Example:
=# \d+ test
Table "public.test"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+----------+--------------+-------------
col | text | | extended | |
Indexes:
"test_idx" gin (col gin_trgm_ops)
Has OIDs: no
# SELECT * FROM test;
col
-----------------------------------------
The tiger is the largest cat species
The cheetah is the fastest cat species
The peacock is the largest bird species
(3 rows)
=# SELECT show_limit();
show_limit
------------
0.3
(1 row)
=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
FROM test WHERE col % 'The tiger is the largest cat species'
ORDER BY sml DESC, col;
col | sml
-----------------------------------------+----------
The tiger is the largest cat species | 1
The peacock is the largest bird species | 0.511111
The cheetah is the fastest cat species | 0.466667
(3 rows)
=# SELECT set_limit(0.5);
set_limit
-----------
0.5
(1 row)
=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
FROM test WHERE col % 'The tiger is the largest cat species'
ORDER BY sml DESC, col;
col | sml
-----------------------------------------+----------
The tiger is the largest cat species | 1
The peacock is the largest bird species | 0.511111
(2 rows)
=# SELECT set_limit(0.9);
set_limit
-----------
0.9
(1 row)
=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS sml
FROM test WHERE col % 'The tiger is the largest cat species'
ORDER BY sml DESC, col;
col | sml
--------------------------------------+-----
The tiger is the largest cat species | 1
(1 row)
When you set a higher limit, you get more exact matches.
Beena Emerson
I am sorry, I just re-read your mail and realized you have already tried with pg_trgm.
On Wed, Jul 31, 2013 at 7:23 PM, Beena Emerson <memissemerson@gmail.com> wrote:
On Sat, Jul 27, 2013 at 10:34 PM, Janek Sendrowski <janek12@web.de> wrote:Hi Sergey Konoplev,
If I'm searching for a sentence like "The tiger is the largest cat species" for example.
I can only find the sentences, which include the words "tiger, largest, cat, species", but I also like to have the sentences with only three or even two of these words.
Janek
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-generalHi,You may use similarity functions of pg_trgm.Example:=# \d+ testTable "public.test"Column | Type | Modifiers | Storage | Stats target | Description--------+------+-----------+----------+--------------+-------------col | text | | extended | |Indexes:"test_idx" gin (col gin_trgm_ops)Has OIDs: no# SELECT * FROM test;col-----------------------------------------The tiger is the largest cat speciesThe cheetah is the fastest cat speciesThe peacock is the largest bird species(3 rows)=# SELECT show_limit();show_limit------------0.3(1 row)=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS smlFROM test WHERE col % 'The tiger is the largest cat species'ORDER BY sml DESC, col;col | sml-----------------------------------------+----------The tiger is the largest cat species | 1The peacock is the largest bird species | 0.511111The cheetah is the fastest cat species | 0.466667(3 rows)=# SELECT set_limit(0.5);set_limit-----------0.5(1 row)=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS smlFROM test WHERE col % 'The tiger is the largest cat species'ORDER BY sml DESC, col;col | sml-----------------------------------------+----------The tiger is the largest cat species | 1The peacock is the largest bird species | 0.511111(2 rows)=# SELECT set_limit(0.9);set_limit-----------0.9(1 row)=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS smlFROM test WHERE col % 'The tiger is the largest cat species'ORDER BY sml DESC, col;col | sml--------------------------------------+-----The tiger is the largest cat species | 1(1 row)When you set a higher limit, you get more exact matches.--Beena Emerson
Beena Emerson
Janek Sendrowski <janek12@web.de> wrote: > I also tried pg_trgm module, which works with tri-grams, but it's > also very slow with 100.000+ rows. Hmm. I found the pg_trgm module very fast for name searches with millions of rows *as long as I used KNN-GiST techniques*. Were you careful to do so? Check out the "Index Support" section of this page: http://www.postgresql.org/docs/current/static/pgtrgm.html While I have not tested this technique with a column containing sentences, I would expect it to work well. As a quick confirmation, I imported the text form of War and Peace into a table, with one row per *line* (because that was easier than parsing sentence boundaries for a quick test). That was over 65,000 rows. test=# select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3; lineno | linetext --------+----------------------------------------------------------------------- 9082 | The gentlemen assembled at Bilibin's were young, wealthy, gay society 36575 | "Gentlemen, you are crushing me!..." 55997 | * "Good day, gentlemen." (3 rows) test=# explain analyze select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.28..0.80 rows=3 width=53) (actual time=37.986..38.002 rows=3 loops=1) -> Index Scan using wap_text on war_and_peace (cost=0.28..11216.42 rows=65007 width=53) (actual time=37.984..37.999rows=3 loops=1) Order By: (linetext <-> 'young wealthy gay gentlemen'::text) Total runtime: 38.180 ms (4 rows) To me, 38 milliseconds to search War and Peace for best matches to a text string seems reasonable; I'm not sure what you're looking for, since you didn't give any numbers. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Aug 11, 2013 at 6:20 AM, Janek Sendrowski <janek12@web.de> wrote: > the ranking functions are nice, but if I search for one more word, which is not included in the sentence I'm searchingfor, It doesn't find the sentence. I have already wrote you earlier about it. You can solve the problem by using & (AND), | (OR), and ! (NOT) operators in tsquery. select to_tsquery('tiger | largest | cat | species') @@ to_tsvector('The tiger is the largest cat'); ?column? ---------- t The query contains one word more than the sentence (the word is "species"), and it successfully finds it. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
As I understand, you need to search strings by similarity (using levenshtein or any other metric distance). In that case, you can use metric indexes like FHQT or FQA (this is better if you are using a relational database, like postgres). But they are not implemented yet in most DBMS, so you need to program the index. It s not too hard, but you need to understand the base concepts. You can look for "searching in metric spaces" to read about it. If you are in a hurry, you can mail me and maybe I can help you. Andres. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Index-for-Levenshtein-distance-tp5764546p5768127.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Fri, Aug 2, 2013 at 10:25 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Janek Sendrowski <janek12@web.de> wrote: > >> I also tried pg_trgm module, which works with tri-grams, but it's >> also very slow with 100.000+ rows. > > Hmm. I found the pg_trgm module very fast for name searches with > millions of rows *as long as I used KNN-GiST techniques*. Were you > careful to do so? Check out the "Index Support" section of this > page: > > http://www.postgresql.org/docs/current/static/pgtrgm.html > > While I have not tested this technique with a column containing > sentences, I would expect it to work well. As a quick > confirmation, I imported the text form of War and Peace into a > table, with one row per *line* (because that was easier than > parsing sentence boundaries for a quick test). That was over > 65,000 rows. + 1 this. pg_trgm is black magic. search time (when using index) is mostly dependent on number of trigrams in search string vs average number of trigrams in database. merlin