Re: Patch: pg_trgm: gin index scan performance for similarity search - Mailing list pgsql-hackers
From | Teodor Sigaev |
---|---|
Subject | Re: Patch: pg_trgm: gin index scan performance for similarity search |
Date | |
Msg-id | 567D1624.6090702@sigaev.ru Whole thread Raw |
In response to | Patch: pg_trgm: gin index scan performance for similarity search (Fornaroli Christophe <cfornaro@gmail.com>) |
Responses |
Re: Patch: pg_trgm: gin index scan performance for
similarity search
|
List | pgsql-hackers |
Good catch, committed. Fornaroli Christophe wrote: > Hi, > > I think that we can improve the gin index scan performance for similarity search > defined in the pg_trgm extension. The similarity function is (for the default > case where DIVUNION is defined in the code): > > count / (len1 + len2 - count) >= trgm_limit > > where > len1 is the number of unique trigrams for the first string, > len2 is the same number for the second string, > count is the number of common trigrams between both strings, > trgm_limit is a user specfied limit in [0, 1]. > > The code used to determine if a tuple may match the query string is: > > case SimilarityStrategyNumber: > /* Count the matches */ > ntrue = 0; > for (i = 0; i < nkeys; i++) > { > if (check[i] != GIN_FALSE) > ntrue++; > } > #ifdef DIVUNION > res = (nkeys == ntrue) ? GIN_MAYBE : (((((float4) ntrue) / > ((float4) (nkeys - ntrue))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); > #else > res = (nkeys == 0) ? GIN_FALSE : (((((float4) ntrue) / ((float4) > nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); > #endif > > where > ntrue is the number of common trigrams in both strings, > nkeys is the number of trigrams in the search string. > > This code uses this upper bound for the similarity: ntrue / (nkeys - ntrue). But > if there is ntrue trigrams in common, we know that the indexed string is at > least ntrue trigrams long. We can then use a more aggressive upper bound: ntrue > / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is a patch that changes this. > > Here are some performance gains with this test case: > > create table foo as select > substring(md5(random()::text) for random() * 5) || '123' as bar > from generate_series(1,1000000); > > create index on foo using gin (bar gin_trgm_ops); > > patched: > > test=# explain analyze select count(*) from foo where bar % 'abc123'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual > time=807.434..807.435 rows=1 loops=1) > -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual > time=109.893..787.261 rows=54746 loops=1) > Recheck Cond: (bar % 'abc123'::text) > Rows Removed by Index Recheck: 55125 > Heap Blocks: exact=4514 > -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000 > width=0) (actual time=108.456..108.456 rows=109871 loops=1) > Index Cond: (bar % 'abc123'::text) > Planning time: 0.353 ms > Execution time: 807.593 ms > (9 rows) > > test=# explain analyze select count(*) from foo where bar % 'abcdef'; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual time=4.829..4.830 > rows=1 loops=1) > -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual > time=3.512..4.794 rows=5 loops=1) > Recheck Cond: (bar % 'abcdef'::text) > Rows Removed by Index Recheck: 137 > Heap Blocks: exact=139 > -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000 > width=0) (actual time=3.355..3.355 rows=142 loops=1) > Index Cond: (bar % 'abcdef'::text) > Planning time: 0.363 ms > Execution time: 5.061 ms > (9 rows) > > > master: > > test=# explain analyze select count(*) from foo where bar % 'abc123'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual > time=6416.554..6416.554 rows=1 loops=1) > -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual > time=484.359..6389.819 rows=54746 loops=1) > Recheck Cond: (bar % 'abc123'::text) > Rows Removed by Index Recheck: 945250 > Heap Blocks: exact=4514 > -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000 > width=0) (actual time=482.677..482.677 rows=999996 loops=1) > Index Cond: (bar % 'abc123'::text) > Planning time: 0.359 ms > Execution time: 6416.945 ms > (9 rows) > > test=# explain analyze select count(*) from foo where bar % 'abcdef'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual time=30.678..30.679 > rows=1 loops=1) > -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual > time=9.020..30.643 rows=5 loops=1) > Recheck Cond: (bar % 'abcdef'::text) > Rows Removed by Index Recheck: 2789 > Heap Blocks: exact=2110 > -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000 > width=0) (actual time=7.696..7.696 rows=2794 loops=1) > Index Cond: (bar % 'abcdef'::text) > Planning time: 0.254 ms > Execution time: 30.809 ms > (9 rows) > > > Cheers, > > Christophe > > > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
pgsql-hackers by date: