Thread: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
pg_trgm word_similarity query does not use index for input strings longer than 8 characters
From
pgsql-performance@jhacker.de
Date:
Hello, recently I wrote a query that provides suggestions from a Postgres table. It should be able to work despite smaller typos and thus I chose to use the pg_trgm extension (https://www.postgresql.org/docs/current/pgtrgm.html). When measuring the performance, I observed great differences in the query time, depending on the input string. Analysis showed that Postgres sometimes used the created indexes and sometimes it didn't, even though it would provide a considerable speedup. In the included test case the degradation occurs for all input strings of length 8 or longer, for shorter strings the index is used. My questions: Why doesn't the query planner choose to use the index? Can I make Postgres use the index, and if so, how? I understand that trying to outsmart the planner is generally a bad idea. Maybe the query can be rewritten or there are some parameters that could be tweaked. ## Setup Information Hardware: Intel i5-8250U, 8GB RAM, encrypted SSD, no RAID $ uname -a Linux 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Software: OS: Ubuntu 20.04 Postgres: PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit The Postgres docker image was used. Docker: Docker version 20.10.5, build 55c4c88 Image used: postgres:14.1 Configuration: The config file was not changed. name | current_setting | source ----------------------------+--------------------+---------------------- application_name | psql | client client_encoding | UTF8 | client DateStyle | ISO, MDY | configuration file default_text_search_config | pg_catalog.english | configuration file dynamic_shared_memory_type | posix | configuration file enable_seqscan | off | session lc_messages | en_US.utf8 | configuration file lc_monetary | en_US.utf8 | configuration file lc_numeric | en_US.utf8 | configuration file lc_time | en_US.utf8 | configuration file listen_addresses | * | configuration file log_timezone | Etc/UTC | configuration file max_connections | 100 | configuration file max_stack_depth | 2MB | environment variable max_wal_size | 1GB | configuration file min_wal_size | 80MB | configuration file shared_buffers | 128MB | configuration file TimeZone | Etc/UTC | configuration file ## Test Case The test case creates a simple table and fills it with 10000 identical entries. The query is executed twice with an 8 character string, once with sequential scans enabled, and once with sequential scans disabled. The first query does not use the index, even if the second query shows that it would be much faster. docker run --name postgres -e POSTGRES_PASSWORD=postgres -d postgres:14.1 docker exec -it postgres bash psql -U postgres CREATE EXTENSION pg_trgm; CREATE TABLE song ( artist varchar(20), title varchar(20) ); INSERT INTO song (artist, title) SELECT 'artist','title' FROM generate_series(1,10000); CREATE INDEX artist_trgm ON song USING GIN (artist gin_trgm_ops); CREATE INDEX title_trgm ON song USING GIN (title gin_trgm_ops); -- Tips from https://wiki.postgresql.org/wiki/Slow_Query_Questions ANALYZE; VACUUM; REINDEX TABLE song; \set query '12345678' -- This query is slow EXPLAIN ANALYZE SELECT song.artist, song.title FROM song WHERE (song.artist %> :'query' OR song.title %> :'query') ; set enable_seqscan=off; -- This query is fast EXPLAIN ANALYZE SELECT song.artist, song.title FROM song WHERE (song.artist %> :'query' OR song.title %> :'query') ; ## Additional Test Case Info Schemata: Table "public.song" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------+-----------------------+-----------+----------+---------+----------+-------------+--------------+------------- artist | character varying(20) | | | | extended | | | title | character varying(20) | | | | extended | | | Indexes: "artist_trgm" gin (artist gin_trgm_ops) "title_trgm" gin (title gin_trgm_ops) Access method: heap Index "public.artist_trgm" Column | Type | Key? | Definition | Storage | Stats target --------+---------+------+------------+---------+-------------- artist | integer | yes | artist | plain | gin, for table "public.song" Index "public.title_trgm" Column | Type | Key? | Definition | Storage | Stats target --------+---------+------+------------+---------+-------------- title | integer | yes | title | plain | gin, for table "public.song" Table Metadata: postgres=# SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='song'; relname | relpages | reltuples | relallvisible | relkind | relnatts | relhassubclass | reloptions | pg_table_size ---------+----------+-----------+---------------+---------+----------+----------------+------------+--------------- song | 55 | 10000 | 55 | r | 2 | f | | 483328 EXPLAIN ANALYZE of the "slow" query QUERY PLAN --------------------------------------------------------------------------------------------------- Seq Scan on song (cost=0.00..205.00 rows=1 width=13) (actual time=68.896..68.897 rows=0 loops=1) Filter: (((artist)::text %> '12345678'::text) OR ((title)::text %> '12345678'::text)) Rows Removed by Filter: 10000 Planning Time: 0.304 ms Execution Time: 68.928 ms EXPLAIN ANALYZE of the "fast" query QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on song (cost=288.00..292.02 rows=1 width=13) (actual time=0.023..0.024 rows=0 loops=1) Recheck Cond: (((artist)::text %> '12345678'::text) OR ((title)::text %> '12345678'::text)) -> BitmapOr (cost=288.00..288.00 rows=1 width=0) (actual time=0.022..0.023 rows=0 loops=1) -> Bitmap Index Scan on artist_trgm (cost=0.00..144.00 rows=1 width=0) (actual time=0.013..0.014 rows=0 loops=1) Index Cond: ((artist)::text %> '12345678'::text) -> Bitmap Index Scan on title_trgm (cost=0.00..144.00 rows=1 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: ((title)::text %> '12345678'::text) Planning Time: 0.224 ms Execution Time: 0.052 ms The behaviour is identical when using similarity instead of word_similarity. GIN indexes were chosen because the table is queried far more often than it is updated. I tried increasing shared_buffers, effective_cache_size or work_mem to no avail. Any help would be greatly appreciated. Regards Jonathan
Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
From
Laurenz Albe
Date:
On Tue, 2021-11-30 at 22:38 +0100, pgsql-performance@jhacker.de wrote: > ## Setup Information > Hardware: Intel i5-8250U, 8GB RAM, encrypted SSD, no RAID > [...] > > Configuration: > The config file was not changed. > [...] > > ## Test Case > [...] > CREATE EXTENSION pg_trgm; > > CREATE TABLE song ( > artist varchar(20), > title varchar(20) > ); > > INSERT INTO song (artist, title) > SELECT 'artist','title' > FROM generate_series(1,10000); > > CREATE INDEX artist_trgm ON song USING GIN (artist gin_trgm_ops); > CREATE INDEX title_trgm ON song USING GIN (title gin_trgm_ops); > > -- Tips from https://wiki.postgresql.org/wiki/Slow_Query_Questions > ANALYZE; > VACUUM; > REINDEX TABLE song; > > \set query '12345678' > > -- This query is slow > EXPLAIN ANALYZE > SELECT song.artist, song.title > FROM song > WHERE (song.artist %> :'query' OR song.title %> :'query') > ; > > set enable_seqscan=off; > > -- This query is fast > EXPLAIN ANALYZE > SELECT song.artist, song.title > FROM song > WHERE (song.artist %> :'query' OR song.title %> :'query') > ; The table is quite small; with a bigger table, the test would be more meaningful. Since you have SSDs, you should tune "random_page_cost = 1.1". This makes the planner prefer index scans, and it leads to the index scan being chosen in your case. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
From
Tom Lane
Date:
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Tue, 2021-11-30 at 22:38 +0100, pgsql-performance@jhacker.de wrote: >> INSERT INTO song (artist, title) >> SELECT 'artist','title' >> FROM generate_series(1,10000); >> >> \set query '12345678' >> >> -- This query is slow >> EXPLAIN ANALYZE >> SELECT song.artist, song.title >> FROM song >> WHERE (song.artist %> :'query' OR song.title %> :'query') >> ; > The table is quite small; with a bigger table, the test would be more meaningful. Yeah, this test case seems very unrealistic, both as to table size and as to the lack of variability of the table entries. I think the latter is causing the indexscans to take less time than they otherwise might, because none of the extracted trigrams find any matches. > Since you have SSDs, you should tune "random_page_cost = 1.1". Right. Poking at gincostestimate a bit, I see that for this operator the indexscan cost estimate is basically driven by the number of trigrams extracted from the query string (nine in this test case) and the index size; those lead to a predicted number of index page fetches that's then scaled by random_page_cost. That's coming out to make it look more expensive than the seqscan. It's actually not more expensive, but that's partially because page fetch costs are really zero in this test case (everything will stay in shared buffers the whole time), and partially because the unrealistic data pattern is leading to not having to look at as much of the index as gincostestimate expected. In general, it appears correct that longer query strings lead to a higher index cost estimate, because they produce more trigrams so there's more work for the index match to do. (At some level, a longer query means more work in the seqscan case too; but our cost models are inadequate to predict that.) regards, tom lane
Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
From
pgsql-performance@jhacker.de
Date:
Thank you both a lot for the insights and your input. > Yeah, this test case seems very unrealistic, both as to table size > and as to the lack of variability of the table entries. The example was based on real data with a more complicated query which prompted me to investigate the issue. The distinction between slow and fast queries is not as clear cut as with the generated data, but the general problem remains. >> Since you have SSDs, you should tune "random_page_cost = 1.1". I tested different values of random_page_cost with various queries. Too small values increased the execution time again, due to too eager index usage. I identified the optimum for my use case at 1.4. This solved my problem, thanks. Regards Jonathan On 07.12.21 18:08, Tom Lane wrote: > Laurenz Albe <laurenz.albe@cybertec.at> writes: >> On Tue, 2021-11-30 at 22:38 +0100, pgsql-performance@jhacker.de wrote: >>> INSERT INTO song (artist, title) >>> SELECT 'artist','title' >>> FROM generate_series(1,10000); >>> >>> \set query '12345678' >>> >>> -- This query is slow >>> EXPLAIN ANALYZE >>> SELECT song.artist, song.title >>> FROM song >>> WHERE (song.artist %> :'query' OR song.title %> :'query') >>> ; > >> The table is quite small; with a bigger table, the test would be more meaningful. > > Yeah, this test case seems very unrealistic, both as to table size > and as to the lack of variability of the table entries. I think the > latter is causing the indexscans to take less time than they otherwise > might, because none of the extracted trigrams find any matches. > >> Since you have SSDs, you should tune "random_page_cost = 1.1". > > Right. Poking at gincostestimate a bit, I see that for this > operator the indexscan cost estimate is basically driven by the > number of trigrams extracted from the query string (nine in this > test case) and the index size; those lead to a predicted number > of index page fetches that's then scaled by random_page_cost. > That's coming out to make it look more expensive than the seqscan. > It's actually not more expensive, but that's partially because > page fetch costs are really zero in this test case (everything > will stay in shared buffers the whole time), and partially because > the unrealistic data pattern is leading to not having to look at > as much of the index as gincostestimate expected. > > In general, it appears correct that longer query strings lead to a > higher index cost estimate, because they produce more trigrams so > there's more work for the index match to do. (At some level, a > longer query means more work in the seqscan case too; but our cost > models are inadequate to predict that.) > > regards, tom lane >