Thread: [GENERAL] tgrm index for word_similarity

[GENERAL] tgrm index for word_similarity

From

"Igal @ Lucee.org"

Date:

20 October 2017, 02:54:19

Hello,

I want to use Postgres for a fuzzy auto-suggest search field. As the user will be typing their search phrase, Postgres will show a list of items that fuzzy-matches what they typed so far, ordered by popularity (ntile(20)) and distance, i.e. 1 - word_similarity().

I created a Materialized View with two columns: name text, popularity int.

My query at the moment is:

SELECT name, popularity FROM temp.items3_v ,(values ('some phrase'::text)) consts(input) WHERE true and word_similarity(input, name) > 0.01 -- be lenient as some names are 75 characters long and we want to match even on a few characters of input ORDER BY 2, input <<-> name

Which seems to yield pretty good results, but takes over 40+ ms on a table that's not that large.

So I tried to add a GIN trgm index on `name`:

CREATE INDEX temp_items3_v_tgrm_item_name ON temp.items3_v USING GIN(name gin_trgm_ops);

But it is not used:

QUERY PLAN |-----------------------------------------------------------------------------------------------------------------|Sort (cost=264.42..269.91 rows=2198 width=43) (actual time=41.060..41.117 rows=1044 loops=1) | Sort Key: items3_v.popularity, (('kandels'::text <<-> items3_v.name)) | Sort Method: quicksort Memory: 149kB | -> Seq Scan on items3_v (cost=0.00..142.41 rows=2198 width=43) (actual time=0.217..40.471 rows=1044 loops=1) | Filter: (word_similarity('kandels'::text, name) > '0.01'::double precision) | Rows Removed by Filter: 5550 |Planning time: 0.149 ms |Execution time: 41.308 ms |

What index would be good for that kind of query?

Thanks,

Igal Sapir
Lucee Core Developer
Lucee.org

Re: [GENERAL] tgrm index for word_similarity

From

"Igal @ Lucee.org"

Date:

20 October 2017, 21:22:06

On 10/19/2017 4:54 PM, Igal @ Lucee.org wrote:
> I want to use Postgres for a fuzzy auto-suggest search field.  As the 
> user will be typing their search phrase, Postgres will show a list of 
> items that fuzzy-matches what they typed so far, ordered by popularity 
> (ntile(20)) and distance, i.e. 1 - word_similarity().
>
> I created a Materialized View with two columns: name text, popularity int.
>
> My query at the moment is:
>
>     SELECT name, popularity
>     FROM   temp.items3_v
>           ,(values ('some phrase'::text)) consts(input)
>     WHERE  true
>         and word_similarity(input, name) > 0.01  -- be lenient as some 
> names are 75 characters long and we want to match even on a few 
> characters of input
>     ORDER BY 2, input <<-> name
>
> I tried to add a GIN trgm index on `name`:
>
>     CREATE INDEX temp_items3_v_tgrm_item_name ON temp.items3_v USING 
> GIN(name gin_trgm_ops);
>
> But it is not used
>
> What index would be good for that kind of query?

I see that when I use LIKE or ILIKE the index is used, but I lose all of 
the "fuzzy" benefits by doing that.

Is there any type of INDEX or even building my own COLUMN of trgm that 
can help speed my word_similarity() results?  When used in auto-suggest 
there are usually several queries for each user in a relatively short 
period of time, so speed is important.

Thanks,


Igal



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] tgrm index for word_similarity

From

Arthur Zakirov

Date:

21 October 2017, 15:01:05

On Thu, Oct 19, 2017 at 04:54:19PM -0700, Igal @ Lucee.org wrote:
> 
> My query at the moment is:
> 
>     SELECT name, popularity
>     FROM   temp.items3_v
>           ,(values ('some phrase'::text)) consts(input)
>     WHERE  true
>         and word_similarity(input, name) > 0.01  -- be lenient as some names
> are 75 characters long and we want to match even on a few characters of
> input
>     ORDER BY 2, input <<-> name
> 

PostgreSQL doesn't use index scan with functions within WHERE clause. So
you always need to use operators instead. You can try <% operator and
pg_trgm.word_similarity_threshold variable:

=# SET pg_trgm.word_similarity_threshold TO 0.1;
=# SELECT name, popularityFROM temp.items3_v    ,(values ('some phrase'::text)) consts(input)WHERE input <% nameORDER
BY2, input <<-> name;
 

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] tgrm index for word_similarity

From

"Igal @ Lucee.org"

Date:

22 October 2017, 08:01:56

On 10/21/2017 5:01 AM, Arthur Zakirov wrote:

PostgreSQL doesn't use index scan with functions within WHERE clause. So
you always need to use operators instead. You can try <% operator and
pg_trgm.word_similarity_threshold variable:

=# SET pg_trgm.word_similarity_threshold TO 0.1;
=# SELECT name, popularityFROM temp.items3_v	,(values ('some phrase'::text)) consts(input)WHERE input <% nameORDER BY 2, input <<-> name;

Thank you, your solution does show that the index is used when I do `explain analyze`, and makes the query finish in about 20ms so it's about 1.5 - 2 times faster than without the index, but that raises a few questions for me:

1) I thought that the whole idea behind indexes on expressions is that the index would be used in a WHERE clause? See https://www.postgresql.org/docs/10/static/indexes-expressional.html - Am I missing something?

2) A query with `WHERE input <% name` utilizes the index, but a query without a WHERE clause at all does not?

3) What happens if I do not create an index at all? Does the query that I run in 30 - 40ms, the one that does not utilize an index, creates all of the tri-grams on the fly each time that it runs? Would it be possible for me to create a TABLE or a VIEW with the tri-grams so that there is no need to create them each time the query runs?

Thanks,

Igal Sapir
Lucee Core Developer
Lucee.org

Re: [GENERAL] tgrm index for word_similarity

From

Arthur Zakirov

Date:

22 October 2017, 22:58:33

On Sat, Oct 21, 2017 at 10:01:56PM -0700, Igal @ Lucee.org wrote:
> 
> 1) I thought that the whole idea behind indexes on expressions is that the
> index would be used in a WHERE clause?  See
> https://www.postgresql.org/docs/10/static/indexes-expressional.html - Am I
> missing something?
>

I think the idea is a little bit different. It is about computing index
entries only once, during index creation. During scan PostgreSQL doesn't
compute such entries every time.
I am not very good at PostgreSQL's planner. But I know that PostgreSQL
uses index scan for pg_trgm only with %, <%, ~~, ~~*, ~, ~* operators.
pg_trgm's operator classes (which should be implemented for index scan) are designed in this way.
> 2) A query with `WHERE input <% name` utilizes the index, but a query
> without a WHERE clause at all does not?

Because sequential scan is cheaper here than index scan.

> 
> 3) What happens if I do not create an index at all?  Does the query that I
> run in 30 - 40ms, the one that does not utilize an index, creates all of the
> tri-grams on the fly each time that it runs?  Would it be possible for me to
> create a TABLE or a VIEW with the tri-grams so that there is no need to
> create them each time the query runs?
> 

As far as I know you can't do it nowadays. You can't create an trigram
column, as you can do it for FTS, you can create an tsvector column.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general