Thread: sequential scan unduly favored over text search gin index
I have a tsvector column docvector and a gin index on it docmeta1_docvector_idx I have a simple query "select * from docmeta1 where docvector @@ plainto_tsquery('english', 'free');" I find that the planner chooses a sequential scan of the table even when the index performs orders of magnitude. I set random_page_cost = 1.0 for the database to favor index use. However, I still see that the cost estimate for sequential scan of the entire table (23000) is cheaper than the cost of using the index (33000). The time taken for sequential access is 5200 ms and for index usage is only 85 ms. Details here: postgres version 9.0.2 statistics on docvector is set to 10000 and as you can see the row estimates are fine. lawdb=# \d docmeta1 Table "public.docmeta1" Column | Type | Modifiers -------------+-----------+----------- tid | integer | not null docweight | integer | doctype | integer | publishdate | date | covertids | integer[] | titlevector | tsvector | docvector | tsvector | Indexes: "docmeta1_pkey" PRIMARY KEY, btree (tid) "docmeta1_date_idx" btree (publishdate) "docmeta1_docvector_idx" gin (docvector) "docmeta1_title_idx" gin (titlevector) lawdb=# SELECT relpages, reltuples FROM pg_class WHERE relname ='docmeta1'; relpages | reltuples ----------+----------- 18951 | 329940 lawdb=# explain analyze select * from docmeta1 where docvector @@ plainto_tsquery('english', 'free'); QUERY PLAN -------------------------------------------------------------------------------- ----------------------------------- Seq Scan on docmeta1 (cost=0.00..23075.25 rows=35966 width=427) (actual time=0 .145..5189.556 rows=35966 loops=1) Filter: (docvector @@ '''free'''::tsquery) Total runtime: 5196.231 ms (3 rows) lawdb=# set enable_seqscan = off; SET lawdb=# explain analyze select * from docmeta1 where docvector @@ plainto_tsquery('english', 'free'); QUERY PLAN -------------------------------------------------------------------------------- ----------------------------------------------------------- Bitmap Heap Scan on docmeta1 (cost=14096.25..33000.83 rows=35966 width=427) (a ctual time=9.543..82.754 rows=35966 loops=1) Recheck Cond: (docvector @@ '''free'''::tsquery) -> Bitmap Index Scan on docmeta1_docvector_idx (cost=0.00..14087.26 rows=35 966 width=0) (actual time=8.059..8.059 rows=35967 loops=1) Index Cond: (docvector @@ '''free'''::tsquery) Total runtime: 85.304 ms (5 rows) -Sushant.
Sushant Sinha <sushant354@gmail.com> wrote: > I have a tsvector column docvector and a gin index on it > docmeta1_docvector_idx > > I have a simple query "select * from docmeta1 where docvector @@ > plainto_tsquery('english', 'free');" > > I find that the planner chooses a sequential scan of the table > even when the index performs orders of magnitude. Did you ANALYZE the table after loading the data and building the index? -Kevin
On Mon, 2011-06-20 at 10:58 -0500, Kevin Grittner wrote: > Sushant Sinha <sushant354@gmail.com> wrote: > > > I have a tsvector column docvector and a gin index on it > > docmeta1_docvector_idx > > > > I have a simple query "select * from docmeta1 where docvector @@ > > plainto_tsquery('english', 'free');" > > > > I find that the planner chooses a sequential scan of the table > > even when the index performs orders of magnitude. > > Did you ANALYZE the table after loading the data and building the > index? Yes and I mentioned that the row estimates are correct, which indicate that the problem is somewhere else. -Sushant. > -Kevin
On 2011-06-20 17:38, Sushant Sinha wrote: > I have a tsvector column docvector and a gin index on it > docmeta1_docvector_idx > > I have a simple query "select * from docmeta1 where docvector @@ > plainto_tsquery('english', 'free');" > > I find that the planner chooses a sequential scan of the table even when > the index performs orders of magnitude. I set random_page_cost = 1.0 for > the database to favor index use. However, I still see that the cost > estimate for sequential scan of the entire table (23000) is cheaper than > the cost of using the index (33000). The time taken for sequential > access is 5200 ms and for index usage is only 85 ms. The cost-estimation code for gin-indices are not good in 9.0, this has hugely been improved in 9.1 http://git.postgresql.org/gitweb?p=postgresql.git&a=search&h=HEAD&st=commit&s=gincost I think the individual patches apply quite cleanly to 9.0 as far as I remember. -- Jesper
Dne 20.6.2011 18:04, Sushant Sinha napsal(a): > > On Mon, 2011-06-20 at 10:58 -0500, Kevin Grittner wrote: >> Sushant Sinha <sushant354@gmail.com> wrote: >> >>> I have a tsvector column docvector and a gin index on it >>> docmeta1_docvector_idx >>> >>> I have a simple query "select * from docmeta1 where docvector @@ >>> plainto_tsquery('english', 'free');" >>> >>> I find that the planner chooses a sequential scan of the table >>> even when the index performs orders of magnitude. >> >> Did you ANALYZE the table after loading the data and building the >> index? > Yes and I mentioned that the row estimates are correct, which indicate > that the problem is somewhere else. Hi, I agree the estimates are damn precise in this case (actually the estimates are exact). The problem is the planner thinks the seq scan is about 30% cheaper than the bitmap index scan. I guess you could poke the planner towards the bitmap scan by lowering the random_page_cost (the default value is 4, I'd say lowering it to 2 should do the trick). But be careful, this will influence all the other queries! Those values should somehow reflect the hardware of your system (type of drives, amount of RAM, etc.) so you have to test the effects. regards Tomas
> > I agree the estimates are damn precise in this case (actually the > estimates are exact). The problem is the planner thinks the seq scan is > about 30% cheaper than the bitmap index scan. > > I guess you could poke the planner towards the bitmap scan by lowering > the random_page_cost (the default value is 4, I'd say lowering it to 2 > should do the trick). The numbers that I gave was after setting random_page_cost = 1.0 After this I don't know what to do. -Sushant.
Sushant Sinha <sushant354@gmail.com> writes: >> I guess you could poke the planner towards the bitmap scan by lowering >> the random_page_cost (the default value is 4, I'd say lowering it to 2 >> should do the trick). > The numbers that I gave was after setting random_page_cost = 1.0 After > this I don't know what to do. I think part of the issue here is that the @@ operator is expensive, and so evaluating it once per row is expensive, but the pg_proc.procost setting for it doesn't adequately reflect that. You could experiment with tweaking that setting ... regards, tom lane
On Mon, Jun 20, 2011 at 8:38 AM, Sushant Sinha <sushant354@gmail.com> wrote: > > postgres version 9.0.2 > statistics on docvector is set to 10000 and as you can see the row > estimates are fine. > > lawdb=# \d docmeta1 > Table "public.docmeta1" > Column | Type | Modifiers > -------------+-----------+----------- > tid | integer | not null > docweight | integer | > doctype | integer | > publishdate | date | > covertids | integer[] | > titlevector | tsvector | > docvector | tsvector | > Indexes: > "docmeta1_pkey" PRIMARY KEY, btree (tid) > "docmeta1_date_idx" btree (publishdate) > "docmeta1_docvector_idx" gin (docvector) > "docmeta1_title_idx" gin (titlevector) > > lawdb=# SELECT relpages, reltuples FROM pg_class WHERE relname > ='docmeta1'; > relpages | reltuples > ----------+----------- > 18951 | 329940 What the are sizes of associated toast tables for the tsvector columns? > > lawdb=# explain analyze select * from docmeta1 where docvector @@ > plainto_tsquery('english', 'free'); It would be nice to see the results of explain (analyze, buffers). Cheers, Jeff
On Mon, Jun 20, 2011 at 10:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sushant Sinha <sushant354@gmail.com> writes: >>> I guess you could poke the planner towards the bitmap scan by lowering >>> the random_page_cost (the default value is 4, I'd say lowering it to 2 >>> should do the trick). > >> The numbers that I gave was after setting random_page_cost = 1.0 After >> this I don't know what to do. > > I think part of the issue here is that the @@ operator is expensive, > and so evaluating it once per row is expensive, but the pg_proc.procost > setting for it doesn't adequately reflect that. You could experiment > with tweaking that setting ... In something I was testing a couple months ago, by far the biggest expense of the @@ operator in a full table scan was in crawling through the entire toast table (and not necessarily in sequential order) in order to get the tsvector data on which to apply the operator. So increasing the cost of @@ might very well be the best immediate solution, but should the cost estimation code be changed to explicitly take page reads associated with toast into account, so that cost of @@ itself and can remain a CPU based estimate rather than an amalgam of CPU and IO? Cheers, Jeff