Thread: [GENERAL] tgrm index for word_similarity
Hello,
I want to use Postgres for a fuzzy auto-suggest search field. As the user will be typing their search phrase, Postgres will show a list of items that fuzzy-matches what they typed so far, ordered by popularity (ntile(20)) and distance, i.e. 1 - word_similarity().
I created a Materialized View with two columns: name text, popularity int.
My query at the moment is:
SELECT name, popularity
FROM temp.items3_v
,(values ('some phrase'::text)) consts(input)
WHERE true
and word_similarity(input, name) > 0.01 -- be lenient as some names are 75 characters long and we want to match even on a few characters of input
ORDER BY 2, input <<-> name
Which seems to yield pretty good results, but takes over 40+ ms on a table that's not that large.
So I tried to add a GIN trgm index on `name`:
CREATE INDEX temp_items3_v_tgrm_item_name ON temp.items3_v USING GIN(name gin_trgm_ops);
But it is not used:
QUERY PLAN |
-----------------------------------------------------------------------------------------------------------------|
Sort (cost=264.42..269.91 rows=2198 width=43) (actual time=41.060..41.117 rows=1044 loops=1) |
Sort Key: items3_v.popularity, (('kandels'::text <<-> items3_v.name)) |
Sort Method: quicksort Memory: 149kB |
-> Seq Scan on items3_v (cost=0.00..142.41 rows=2198 width=43) (actual time=0.217..40.471 rows=1044 loops=1) |
Filter: (word_similarity('kandels'::text, name) > '0.01'::double precision) |
Rows Removed by Filter: 5550 |
Planning time: 0.149 ms |
Execution time: 41.308 ms |
What index would be good for that kind of query?
Thanks,
Igal Sapir
Lucee Core Developer
Lucee.org
On 10/19/2017 4:54 PM, Igal @ Lucee.org wrote: > I want to use Postgres for a fuzzy auto-suggest search field. As the > user will be typing their search phrase, Postgres will show a list of > items that fuzzy-matches what they typed so far, ordered by popularity > (ntile(20)) and distance, i.e. 1 - word_similarity(). > > I created a Materialized View with two columns: name text, popularity int. > > My query at the moment is: > > SELECT name, popularity > FROM temp.items3_v > ,(values ('some phrase'::text)) consts(input) > WHERE true > and word_similarity(input, name) > 0.01 -- be lenient as some > names are 75 characters long and we want to match even on a few > characters of input > ORDER BY 2, input <<-> name > > I tried to add a GIN trgm index on `name`: > > CREATE INDEX temp_items3_v_tgrm_item_name ON temp.items3_v USING > GIN(name gin_trgm_ops); > > But it is not used > > What index would be good for that kind of query? I see that when I use LIKE or ILIKE the index is used, but I lose all of the "fuzzy" benefits by doing that. Is there any type of INDEX or even building my own COLUMN of trgm that can help speed my word_similarity() results? When used in auto-suggest there are usually several queries for each user in a relatively short period of time, so speed is important. Thanks, Igal -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Thu, Oct 19, 2017 at 04:54:19PM -0700, Igal @ Lucee.org wrote: > > My query at the moment is: > > SELECT name, popularity > FROM temp.items3_v > ,(values ('some phrase'::text)) consts(input) > WHERE true > and word_similarity(input, name) > 0.01 -- be lenient as some names > are 75 characters long and we want to match even on a few characters of > input > ORDER BY 2, input <<-> name > PostgreSQL doesn't use index scan with functions within WHERE clause. So you always need to use operators instead. You can try <% operator and pg_trgm.word_similarity_threshold variable: =# SET pg_trgm.word_similarity_threshold TO 0.1; =# SELECT name, popularityFROM temp.items3_v ,(values ('some phrase'::text)) consts(input)WHERE input <% nameORDER BY2, input <<-> name; -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
PostgreSQL doesn't use index scan with functions within WHERE clause. So you always need to use operators instead. You can try <% operator and pg_trgm.word_similarity_threshold variable: =# SET pg_trgm.word_similarity_threshold TO 0.1; =# SELECT name, popularityFROM temp.items3_v ,(values ('some phrase'::text)) consts(input)WHERE input <% nameORDER BY 2, input <<-> name;
Thank you, your solution does show that the index is used when I do `explain analyze`, and makes the query finish in about 20ms so it's about 1.5 - 2 times faster than without the index, but that raises a few questions for me:
1) I thought that the whole idea behind indexes on expressions is that the index would be used in a WHERE clause? See https://www.postgresql.org/docs/10/static/indexes-expressional.html - Am I missing something?
2) A query with `WHERE input <% name` utilizes the index, but a query without a WHERE clause at all does not?
3) What happens if I do not create an index at all? Does the query that I run in 30 - 40ms, the one that does not utilize an index, creates all of the tri-grams on the fly each time that it runs? Would it be possible for me to create a TABLE or a VIEW with the tri-grams so that there is no need to create them each time the query runs?
Thanks,
Igal Sapir
Lucee Core Developer
Lucee.org
On Sat, Oct 21, 2017 at 10:01:56PM -0700, Igal @ Lucee.org wrote: > > 1) I thought that the whole idea behind indexes on expressions is that the > index would be used in a WHERE clause? See > https://www.postgresql.org/docs/10/static/indexes-expressional.html - Am I > missing something? > I think the idea is a little bit different. It is about computing index entries only once, during index creation. During scan PostgreSQL doesn't compute such entries every time. I am not very good at PostgreSQL's planner. But I know that PostgreSQL uses index scan for pg_trgm only with %, <%, ~~, ~~*, ~, ~* operators. pg_trgm's operator classes (which should be implemented for index scan) are designed in this way. > 2) A query with `WHERE input <% name` utilizes the index, but a query > without a WHERE clause at all does not? Because sequential scan is cheaper here than index scan. > > 3) What happens if I do not create an index at all? Does the query that I > run in 30 - 40ms, the one that does not utilize an index, creates all of the > tri-grams on the fly each time that it runs? Would it be possible for me to > create a TABLE or a VIEW with the tri-grams so that there is no need to > create them each time the query runs? > As far as I know you can't do it nowadays. You can't create an trigram column, as you can do it for FTS, you can create an tsvector column. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general