Thread: Question regarding specifics of GIN and pg_trgm performance and potential use of show_trgm to improve it

Hello PostgreSQL community,

I am trying to improve performance of using similarity based queries on a large datasets and I would like to confirm my understanding of how GIN indexes work and how pg_trgm uses them.

If I understand it correctly, using GIN index is always a two step process: first, potential matches are searched in the index; then as a second step tuples are actually fetched and rechecked if they really match the query. This two step process can lead to degraded performance when the index scan matches too many elements that then are read from disk only to be dropped as non-matching during the recheck phase. Is that understanding correct?

Now to the issue... pg_trgm's similarity search can use similarity operator % to search for "similar" documents. Concept of "similarity" is based on a similarity of trigram array extracted from the query string and trigram arrays of searched values. This concept is quite tricky in a sense that just by matching trigrams in GIN index PostgreSQL can not tell if the final value will match or not as it does not know how many trigrams overall are there in that value... 

Consider following example:

CREATE TABLE test (id SERIAL, value TEXT);
CREATE INDEX test_idx ON test USING GIN (value gin_trgm_ops);
INSERT INTO test (value) SELECT 'lorem ipsum ' || id || repeat('foo bar', CAST(random() * 100 AS INT)) FROM generate_series(1, 100000) source(id);

SET pg_trgm.similarity_threshold TO 0.5;
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=2024.08..2062.86 rows=10 width=374) (actual time=2261.727..2261.728 rows=0 loops=1)
   Recheck Cond: (value % 'lorem'::text)
   Rows Removed by Index Recheck: 100000
   Heap Blocks: exact=5025
   Buffers: shared hit=5636
   ->  Bitmap Index Scan on test_idx  (cost=0.00..2024.08 rows=10 width=0) (actual time=19.242..19.242 rows=100000 loops=1)
         Index Cond: (value % 'lorem'::text)
         Buffers: shared hit=611
 Planning:
   Buffers: shared hit=1
 Planning Time: 2.417 ms
 Execution Time: 2261.765 ms
(12 rows)


If I understand this correctly the index scan really matches all 100000 items that are then read from disk only to be discarded during the recheck. So 2 seconds of doing all that work to return zero results (and I was lucky in my example to only have shared buffer hits, so no real disk I/O).

Is my understanding correct that this happens only because pg_trgm is not able to actually determine if the matched item from the index search is actually much much longer than the query? Is there any way how the performance can be improved in this case? I thought that I can store number of trigrams in the index, but that is not being used by the query planner:

CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops, array_length(show_trgm(value), 1));

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem' AND array_length(show_trgm(value), 1) < array_length(show_trgm('lorem'), 1) / 0.5;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=56.08..94.96 rows=3 width=376) (actual time=2273.225..2273.226 rows=0 loops=1)
   Recheck Cond: (value % 'lorem'::text)
   Rows Removed by Index Recheck: 100000
   Filter: ((array_length(show_trgm(value), 1))::numeric < 12.0000000000000000)
   Heap Blocks: exact=5025
   Buffers: shared hit=5134
   ->  Bitmap Index Scan on test_idx3  (cost=0.00..56.08 rows=10 width=0) (actual time=15.945..15.946 rows=100000 loops=1)
         Index Cond: (value % 'lorem'::text)
         Buffers: shared hit=109
 Planning:
   Buffers: shared hit=3
 Planning Time: 2.394 ms
 Execution Time: 2273.256 ms
(13 rows)


Thank you for any sort of insight into this.

Regards,
Pavel
On 5/24/23 22:28, Pavel Horal wrote:
> Hello PostgreSQL community,
> 
> I am trying to improve performance of using similarity based queries on
> a large datasets and I would like to confirm my understanding of how GIN
> indexes work and how pg_trgm uses them.
> 
> If I understand it correctly, using GIN index is always a two step
> process: first, potential matches are searched in the index; then as a
> second step tuples are actually fetched and rechecked if they really
> match the query. This two step process can lead to degraded performance
> when the index scan matches too many elements that then are read from
> disk only to be dropped as non-matching during the recheck phase. *Is
> that understanding correct?*
> 

Correct. GIN gives you "candidate" rows, but the recheck may still
eliminate those.

> Now to the issue... pg_trgm's similarity search can use similarity
> operator % to search for "similar" documents. Concept of "similarity" is
> based on a similarity of trigram array extracted from the query string
> and trigram arrays of searched values. This concept is quite tricky in a
> sense that just by matching trigrams in GIN index PostgreSQL can not
> tell if the final value will match or not as it does not know how many
> trigrams overall are there in that value... 
> 
> Consider following example:
> 
> CREATE TABLE test (id SERIAL, value TEXT);
> CREATE INDEX test_idx ON test USING GIN (value gin_trgm_ops);
> INSERT INTO test (value) SELECT 'lorem ipsum ' || id || repeat('foo
> bar', CAST(random() * 100 AS INT)) FROM generate_series(1, 100000)
> source(id);
> 
> SET pg_trgm.similarity_threshold TO 0.5;
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem';
>                                                          QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on test  (cost=2024.08..2062.86 rows=10 width=374)
> (actual time=2261.727..2261.728 rows=0 loops=1)
>    Recheck Cond: (value % 'lorem'::text)
>    Rows Removed by Index Recheck: 100000
>    Heap Blocks: exact=5025
>    Buffers: shared hit=5636
>    ->  Bitmap Index Scan on test_idx  (cost=0.00..2024.08 rows=10
> width=0) (actual time=19.242..19.242 rows=100000 loops=1)
>          Index Cond: (value % 'lorem'::text)
>          Buffers: shared hit=611
>  Planning:
>    Buffers: shared hit=1
>  Planning Time: 2.417 ms
>  Execution Time: 2261.765 ms
> (12 rows)
> 
> 
> If I understand this correctly the *index scan really matches all 100000
> items that are then read from disk* only *to be discarded during the
> recheck*. So 2 seconds of doing all that work to return zero results
> (and I was lucky in my example to only have shared buffer hits, so no
> real disk I/O).
> 
> *Is my understanding correct that this happens only because pg_trgm is
> not able to actually determine if the matched item from the index search
> is actually much much longer than the query?* 

Yes, I think that's mostly correct understanding. The trouble here is
that the posting list for "lorem" is very long - it contains TID for
pretty much every row in the table. We can't calculate similarity at
that point from trigrams alone, so we have to fetch the rows.

> Is there any way how the
> performance can be improved in this case? I thought that I can store
> number of trigrams in the index, but that is not being used by the query
> planner:
> 
> CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops,
> array_length(show_trgm(value), 1));
> 
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem' AND
> array_length(show_trgm(value), 1) < array_length(show_trgm('lorem'), 1)
> / 0.5;
>                                                         QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on test  (cost=56.08..94.96 rows=3 width=376) (actual
> time=2273.225..2273.226 rows=0 loops=1)
>    Recheck Cond: (value % 'lorem'::text)
>    Rows Removed by Index Recheck: 100000
>    Filter: ((array_length(show_trgm(value), 1))::numeric <
> 12.0000000000000000)
>    Heap Blocks: exact=5025
>    Buffers: shared hit=5134
>    ->  Bitmap Index Scan on test_idx3  (cost=0.00..56.08 rows=10
> width=0) (actual time=15.945..15.946 rows=100000 loops=1)
>          Index Cond: (value % 'lorem'::text)
>          Buffers: shared hit=109
>  Planning:
>    Buffers: shared hit=3
>  Planning Time: 2.394 ms
>  Execution Time: 2273.256 ms
> (13 rows)
> 
> 
> Thank you for any sort of insight into this.

I don't think indexing the number of trigrams like this can help, and
I'm not sure how to improve this (at least for the built-in GIN). It
seem similarity searches are bound to be proportional to the most
frequent trigram in the query.

I wonder if the "newer" GIN variants like RUM [1] could improve this,
but I don't think it has trgm opclasses.


regards

[1] https://github.com/postgrespro/rum

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, May 24, 2023 at 4:35 PM Pavel Horal <pavel.horal@orchitech.cz> wrote:

I didn't see your email when first sent, and stumbled upon it while searching for something else.  But it still might be worthwhile commenting even after all of this time.
 
 
Is my understanding correct that this happens only because pg_trgm is not able to actually determine if the matched item from the index search is actually much much longer than the query? Is there any way how the performance can be improved in this case? I thought that I can store number of trigrams in the index, but that is not being used by the query planner:

CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops, array_length(show_trgm(value), 1));

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem' AND array_length(show_trgm(value), 1) < array_length(show_trgm('lorem'), 1) / 0.5;

The main problem here is of expression type.  You have an index using an expression returning an int, while you are comparing it to an expression returning a numeric.  That inhibits the use of the index over that expression.

Just casting the type when creating the index is enough (given your test case) to get this to do what you want:

CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops, (array_length(show_trgm(value), 1)::numeric));

However, it would probably be more efficient to partition the table on the trigram count, rather than adding that count to the index.  Then it could just skip any partition with too many trigrams.

Cheers,

Jeff