Thread: FTS performance issue - planner problem identified (but only partially resolved)
FTS performance issue - planner problem identified (but only partially resolved)
From
Stefan Keller
Date:
Hi At 2013/2/8 I wrote: > I have problems with the performance of FTS in a query like this: > > SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ > plainto_tsquery('english', 'good'); > > It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped). > The planner obviously always chooses table scan Now, I've identified (but only partially resolved) the issue: Here are my comments: Thats the query in question (see commented log below): select id,title,left(content,100) from fulltextsearch where plainto_tsquery('pg_catalog.english','good') @@ to_tsvector('pg_catalog.english',content); After having created the GIN index, the FTS query unexpectedly is fast because planner chooses "Bitmap Index Scan". After the index statistics have been updated, the same query becomes slow. Only when using the "trick" with the function in the WHERE clause. I think GIST does'nt change anything. select id,title,left(content,100) from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query where query @@ to_tsvector('pg_catalog.english',content); => This hint should mentioned in the docs! Then, setting enable_seqscan to off makes original query fast again. But that's a setting I want to avoid in a multi-user database. Finally, setting random_page_cost to 1 helps also - but I don't like this setting neither. => To me the planner should be updated to recognize immutable plainto_tsquery() function in the WHERE clause and choose "Bitmap Index Scan" at the first place. What do you think? Yours, Stefan ---- Lets look at table fulltextsearch: movies=# \d fulltextsearch Table "public.fulltextsearch" Column | Type | Modifiers ---------+---------+------------------------------------------------------------- id | integer | not null default nextval('fulltextsearch_id_seq'::regclass) docid | integer | default 0 title | text | content | text | not null movies=# CREATE INDEX fulltextsearch_gincontent ON fulltextsearch USING gin(to_tsvector('pg_catalog.english',content)); movies=# SELECT * FROM pg_class c WHERE relname LIKE 'fullt%'; oid | name | kind | tuples | pages | allvisible | toastrelid | hasindex --------+---------------------------+------+-------------+-------+------------+------------+---------- 476289 | fulltextsearch | r | 27886 | 555 | 0 | 476293 | t 503080 | fulltextsearch_gincontent | i | 8.97135e+06 | 11133 | 0 | 0 | f 476296 | fulltextsearch_id_seq | S | 1 | 1 | 0 | 0 | f 503075 | fulltextsearch_pkey | i | 27886 | 79 | 0 | 0 | f (4 rows) => fulltextsearch_gincontent has an arbitrary large number of tuples (statistics is wrong and not yet updated) movies=# explain (analyze,costs,timing,buffers) select id,title,left(content,100) from fulltextsearch where plainto_tsquery('pg_catalog.english','good') @@ to_tsvector('pg_catalog.english',content); => Unexpectedly, the query is fast! See query plan http://explain.depesz.com/s/ewn Let's update the statistics: movies=# VACUUM ANALYZE VERBOSE fulltextsearch ; SELECT * FROM pg_class c WHERE relname LIKE 'fullt%'; oid | name | kind | tuples | pages | allvisible | toastrelid | hasindex --------+---------------------------+------+--------+-------+------------+------------+---------- 476289 | fulltextsearch | r | 27886 | 555 | 555 | 476293 | t 503080 | fulltextsearch_gincontent | i | 27886 | 11133 | 0 | 0 | f 476296 | fulltextsearch_id_seq | S | 1 | 1 | 0 | 0 | f 503075 | fulltextsearch_pkey | i | 27886 | 79 | 0 | 0 | f (4 rows) => Now after having update statistics (see especially tuples of fulltextsearch_gincontent ) the original query is slow! See query plan http://explain.depesz.com/s/MQ60 Now, let's reformulate the original query and move the function call to plainto_tsquery to the FROM clause: movies=# explain (analyze,costs,timing,buffers) select id,title,left(content,100) from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query where query @@ to_tsvector('pg_catalog.english',content); => This special query is fast again! See query plan http://explain.depesz.com/s/FVT Setting enable_seqscan to off makes query fast again: See query plan http://explain.depesz.com/s/eOr Finally, setting random_page_cost to 1 helps also (default is 4): movies=# set enable_seqscan to default; movies=# set random_page_cost to 1.0; => Query is fast. See query plan http://explain.depesz.com/s/M5Ke ----
Re: FTS performance issue - planner problem identified (but only partially resolved)
From
Stefan Keller
Date:
Hi Sorry, referring to GIST index in my mail before was no good idea. The bottom line still is, that the query (as recommended by the docs) and the planner don't choose the index which makes it slow - unless the original query... > select id,title,left(content,100) > from fulltextsearch > where plainto_tsquery('pg_catalog.english','good') @@ > to_tsvector('pg_catalog.english',content); is reformulated by this > select id,title,left(content,100) > from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query > where query @@ > to_tsvector('pg_catalog.english',content); ... using default values for enable_seqscan and set random_page_cost. Yours, S. 2013/7/19 Stefan Keller <sfkeller@gmail.com>: > Hi > > At 2013/2/8 I wrote: >> I have problems with the performance of FTS in a query like this: >> >> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ >> plainto_tsquery('english', 'good'); >> >> It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped). >> The planner obviously always chooses table scan > > Now, I've identified (but only partially resolved) the issue: Here are > my comments: > > Thats the query in question (see commented log below): > > select id,title,left(content,100) > from fulltextsearch > where plainto_tsquery('pg_catalog.english','good') @@ > to_tsvector('pg_catalog.english',content); > > After having created the GIN index, the FTS query unexpectedly is fast > because planner chooses "Bitmap Index Scan". After the index > statistics have been updated, the same query becomes slow. Only when > using the "trick" with the function in the WHERE clause. I think GIST > does'nt change anything. > > select id,title,left(content,100) > from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query > where query @@ to_tsvector('pg_catalog.english',content); > > => This hint should mentioned in the docs! > > Then, setting enable_seqscan to off makes original query fast again. > But that's a setting I want to avoid in a multi-user database. > Finally, setting random_page_cost to 1 helps also - but I don't like > this setting neither. > > => To me the planner should be updated to recognize immutable > plainto_tsquery() function in the WHERE clause and choose "Bitmap > Index Scan" at the first place. > > What do you think? > > Yours, Stefan > > > ---- > Lets look at table fulltextsearch: > > movies=# \d fulltextsearch > Table "public.fulltextsearch" > Column | Type | Modifiers > ---------+---------+------------------------------------------------------------- > id | integer | not null default nextval('fulltextsearch_id_seq'::regclass) > docid | integer | default 0 > title | text | > content | text | not null > > movies=# CREATE INDEX fulltextsearch_gincontent ON fulltextsearch > USING gin(to_tsvector('pg_catalog.english',content)); > > movies=# SELECT * FROM pg_class c WHERE relname LIKE 'fullt%'; > oid | name | kind | tuples | pages | > allvisible | toastrelid | hasindex > --------+---------------------------+------+-------------+-------+------------+------------+---------- > 476289 | fulltextsearch | r | 27886 | 555 | > 0 | 476293 | t > 503080 | fulltextsearch_gincontent | i | 8.97135e+06 | 11133 | > 0 | 0 | f > 476296 | fulltextsearch_id_seq | S | 1 | 1 | > 0 | 0 | f > 503075 | fulltextsearch_pkey | i | 27886 | 79 | > 0 | 0 | f > (4 rows) > > => fulltextsearch_gincontent has an arbitrary large number of tuples > (statistics is wrong and not yet updated) > > movies=# > explain (analyze,costs,timing,buffers) > select id,title,left(content,100) > from fulltextsearch > where plainto_tsquery('pg_catalog.english','good') @@ > to_tsvector('pg_catalog.english',content); > => Unexpectedly, the query is fast! > See query plan http://explain.depesz.com/s/ewn > > Let's update the statistics: > > movies=# VACUUM ANALYZE VERBOSE fulltextsearch ; > > SELECT * FROM pg_class c WHERE relname LIKE 'fullt%'; > oid | name | kind | tuples | pages | > allvisible | toastrelid | hasindex > --------+---------------------------+------+--------+-------+------------+------------+---------- > 476289 | fulltextsearch | r | 27886 | 555 | > 555 | 476293 | t > 503080 | fulltextsearch_gincontent | i | 27886 | 11133 | > 0 | 0 | f > 476296 | fulltextsearch_id_seq | S | 1 | 1 | > 0 | 0 | f > 503075 | fulltextsearch_pkey | i | 27886 | 79 | > 0 | 0 | f > (4 rows) > > => Now after having update statistics (see especially tuples of > fulltextsearch_gincontent ) the original query is slow! > See query plan http://explain.depesz.com/s/MQ60 > > Now, let's reformulate the original query and move the function call > to plainto_tsquery to the FROM clause: > > movies=# explain (analyze,costs,timing,buffers) > select id,title,left(content,100) > from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query > where query @@ to_tsvector('pg_catalog.english',content); > => This special query is fast again! See query plan > http://explain.depesz.com/s/FVT > > Setting enable_seqscan to off makes query fast again: See query plan > http://explain.depesz.com/s/eOr > > Finally, setting random_page_cost to 1 helps also (default is 4): > > movies=# set enable_seqscan to default; > movies=# set random_page_cost to 1.0; > => Query is fast. See query plan http://explain.depesz.com/s/M5Ke > > ----
Re: FTS performance issue - planner problem identified (but only partially resolved)
From
Marc Mamin
Date:
> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ > plainto_tsquery('english', 'good'); > > It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped). > The planner obviously always chooses table scan Hello, A probable reason for the time difference is the cost for decompressing toasted content. At lest in 8.3, the planner was not good at estimating it. I'm getting better overall performances since I've stopped collect statistic on tsvectors. An alternative would have been to disallow compression on them. I'm aware this is a drastic way and would not recommend it without testing. The benefit may depend on the type of data youare indexing. In our use case these are error logs with many java stack traces, hence with many lexemes poorly discriminative. see: http://www.postgresql.org/message-id/27953.1329434125@sss.pgh.pa.us as a comment on http://www.postgresql.org/message-id/C4DAC901169B624F933534A26ED7DF310861B363@JENMAIL01.ad.intershop.net regards, Marc Mamin
Re: FTS performance issue - planner problem identified (but only partially resolved)
From
Stefan Keller
Date:
Hi Marc Thanks a lot for your hint! You mean doing a "SET track_counts (true);" for the whole session? That would be ok if it would be possible just for the gin index. It's obviously an issue of the planner estimation costs. The data I'm speaking about ("movies") has a text attribute which has a length of more than 8K so it's obviously having to do with detoasting. But the thoughts about @@ operators together with this GIN index seem also to be valid. I hope this issue is being tracked in preparation for 9.3. Regards, Stefan 2013/7/19 Marc Mamin <M.Mamin@intershop.de>: > >> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ >> plainto_tsquery('english', 'good'); >> >> It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped). >> The planner obviously always chooses table scan > > > Hello, > > A probable reason for the time difference is the cost for decompressing toasted content. > At least in 8.3, the planner was not good at estimating it. > > I'm getting better overall performances since I've stopped collect statistic on tsvectors. > An alternative would have been to disallow compression on them. > > I'm aware this is a drastic way and would not recommend it without testing. The benefit may depend on the type of datayou are indexing. > In our use case these are error logs with many java stack traces, hence with many lexemes poorly discriminative. > > see: http://www.postgresql.org/message-id/27953.1329434125@sss.pgh.pa.us > as a comment on > http://www.postgresql.org/message-id/C4DAC901169B624F933534A26ED7DF310861B363@JENMAIL01.ad.intershop.net > > regards, > > Marc Mamin
Re: FTS performance issue - planner problem identified (but only partially resolved)
From
Marc Mamin
Date:
________________________________________ Von: Stefan Keller [sfkeller@gmail.com] >Gesendet: Samstag, 20. Juli 2013 01:55 > >Hi Marc > >Thanks a lot for your hint! > >You mean doing a "SET track_counts (true);" for the whole session? No, I mean ALTER TABLE <table> ALTER <ts_vector_column> SET STATISTICS 0; And remove existing statistics DELETE FROM pg_catalog.pg_statistic where starelid='<table>':: regclass AND staattnum = (SELECT attnum FROM pg_attribute WHERE attrelid = '<table>':: regclass AND attname = '<ts_vector_column>'::name ) But you should first try to find out which proportion of your ts queries are faster when using a table scan as they will probably not happen anymore afterwards ! (Except if further columns on your table 'FullTextSearch' are considered by the planner) >That would be ok if it would be possible just for the gin index. > >It's obviously an issue of the planner estimation costs. >The data I'm speaking about ("movies") has a text attribute which has >a length of more than 8K so it's obviously having to do with >detoasting. >But the thoughts about @@ operators together with this GIN index seem >also to be valid. > >I hope this issue is being tracked in preparation for 9.3. > >Regards, Stefan > > >2013/7/19 Marc Mamin <M.Mamin@intershop.de>: >> >>> SELECT * FROM FullTextSearch WHERE content_tsv_gin @@ >>> plainto_tsquery('english', 'good'); >>> >>> It's slow (> 30 sec.) for some GB (27886 html files, originally 73 MB zipped). >>> The planner obviously always chooses table scan >> >> >> Hello, >> >> A probable reason for the time difference is the cost for decompressing toasted content. >> At least in 8.3, the planner was not good at estimating it. >> >> I'm getting better overall performances since I've stopped collect statistic on tsvectors. >> An alternative would have been to disallow compression on them. >> >> I'm aware this is a drastic way and would not recommend it without testing. The benefit may depend on the type of datayou are indexing. >> In our use case these are error logs with many java stack traces, hence with many lexemes poorly discriminative. >> >> see: http://www.postgresql.org/message-id/27953.1329434125@sss.pgh.pa.us >> as a comment on >> http://www.postgresql.org/message-id/C4DAC901169B624F933534A26ED7DF310861B363@JENMAIL01.ad.intershop.net >> >> regards, >> >> Marc Mamin
Re: FTS performance issue - planner problem identified (but only partially resolved)
From
Kevin Grittner
Date:
Stefan Keller <sfkeller@gmail.com> wrote: > Finally, setting random_page_cost to 1 helps also - but I don't > like this setting neither. Well, you should learn to like whichever settings best model your actual costs given your level of caching and your workload. ;-) FWIW, I have found page costs less volatile and easier to tune with cpu_tuple_cost increased. I just always start by bumping that to 0.03. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FTS performance issue - planner problem identified (but only partially resolved)
From
Stefan Keller
Date:
Hi Kevin Well, you're right :-) But my use cases are un-specific "by design" since I'm using FTS as a general purpose function. So I still propose to enhance the planner too as Tom Lane and your colleague suggest based on repeated similar complaints [1]. Yours, Stefan [1] http://www.postgresql.org/message-id/CA+TgmoZgQBeu2KN305hwDS+aXW7YP0YN9vZwBsbWA8Unst+cew@mail.gmail.com 2013/7/29 Kevin Grittner <kgrittn@ymail.com>: > Stefan Keller <sfkeller@gmail.com> wrote: > >> Finally, setting random_page_cost to 1 helps also - but I don't >> like this setting neither. > > Well, you should learn to like whichever settings best model your > actual costs given your level of caching and your workload. ;-) > FWIW, I have found page costs less volatile and easier to tune > with cpu_tuple_cost increased. I just always start by bumping > that to 0.03. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company