Thread: planner row-estimates for tsvector seems horribly wrong

planner row-estimates for tsvector seems horribly wrong

From
Sushant Sinha
Date:
I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

I see a similar problem reported here but thought it was fixed in 9.0
which I am running. 

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01389.php

Here is the version info and detailed planner output for all the three
queries:


select version();

version                                                     
PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4, 64-bit


Case I: FOR A NON-MATCHING WORD
===============================

explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;
          QUERY
 
PLAN                                                             
Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=0.058..0.058 rows=1 loops=1)  ->  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=0.055..0.055 rows=0 loops=1)        ->  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.005..0.005 rows=1 loops=1)        ->  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=0.046..0.046 rows=0 loops=1)              Recheck Cond: (docmeta.docvector @@
qdoc.qdoc)             ->  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
 
rows=4843 width=0) (actual time=0.044..0.044 rows=0 loops=1)                    Index Cond: (docmeta.docvector @@
qdoc.qdoc)Totalruntime: 0.092 ms
 

CASE II: FOR A MEDIUM-MATCHING WORD
===================================explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'quit') as qdoc where  docvector @@ qdoc;
           QUERY
 
PLAN                                                                 
Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=1222.856..1222.857 rows=1 loops=1)  ->  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=639.275..1212.460 rows=88545 loops=1)        ->  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.006..0.007 rows=1 loops=1)        ->  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=639.264..1196.542 rows=88545 loops=1)              Recheck Cond: (docmeta.docvector
@@qdoc.qdoc)              ->  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
 
rows=4843 width=0) (actual time=621.877..621.877 rows=88545 loops=1)                    Index Cond: (docmeta.docvector
@@qdoc.qdoc)Total runtime: 1222.907 ms
 


Case II: FOR A HIGH-MATCHING WORD
=================================

explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'j') as qdoc where  docvector @@ qdoc;
        QUERY
 
PLAN                                                                  
Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=742.857..742.858 rows=1 loops=1)  ->  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=126.804..660.895 rows=726985 loops=1)        ->  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.004..0.006 rows=1 loops=1)        ->  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=126.795..530.422 rows=726985 loops=1)              Recheck Cond: (docmeta.docvector
@@qdoc.qdoc)              ->  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
 
rows=4843 width=0) (actual time=113.742..113.742 rows=726985 loops=1)                    Index Cond: (docmeta.docvector
@@qdoc.qdoc)Total runtime: 742.906 ms
 

Thanks,
Sushant.



Re: planner row-estimates for tsvector seems horribly wrong

From
Jan Urbański
Date:
On 24/10/10 14:44, Sushant Sinha wrote:
> I am using gin index on a tsvector and doing basic search. I see the
> row-estimate of the planner to be horribly wrong. It is returning
> row-estimate as 4843 for all queries whether it matches zero rows, a
> medium number of rows (88,000) or a large number of rows (726,000).
> 
> The table has roughly a million docs.
> 
> I see a similar problem reported here but thought it was fixed in 9.0
> which I am running. 

Hi,

what's your default statistics target? Could you attach the output of

select * from pg_stats where tablename = '<the-tablename>';

and

select typanalyze from pg_type where typname = 'tsvector';

Thanks,
Jan


Re: planner row-estimates for tsvector seems horribly wrong

From
Jan Urbański
Date:
On 24/10/10 14:44, Sushant Sinha wrote:
> I am using gin index on a tsvector and doing basic search. I see the
> row-estimate of the planner to be horribly wrong. It is returning
> row-estimate as 4843 for all queries whether it matches zero rows, a
> medium number of rows (88,000) or a large number of rows (726,000).
> 
> The table has roughly a million docs.

> explain analyze select count(*) from  docmeta,
> plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;

OK, forget my previous message. The problem is that you are doing a join
using @@ as the operator for the join condition, so the planner uses the
operator's join selectivity estimate. For @@ the tsmatchjoinsel function
simply returns 0.005.

Try doing:

explain analyze select count(*) from docmeta where docvector @@
plainto_tsquery('english', 'dyfdfdf');

It should help.

Cheers,
Jan


Re: planner row-estimates for tsvector seems horribly wrong

From
Sushant Sinha
Date:
Thanks a ton Jan! It works quite correctly. But many tsearch tutorials
ask tsquery to be placed in 'from' statement and that can cause bad
plan. Isn't it possible to return the correct number for a join with the
query as well?

-Sushant.

On Sun, 2010-10-24 at 15:07 +0200, Jan Urbański wrote:
> On 24/10/10 14:44, Sushant Sinha wrote:
> > I am using gin index on a tsvector and doing basic search. I see the
> > row-estimate of the planner to be horribly wrong. It is returning
> > row-estimate as 4843 for all queries whether it matches zero rows, a
> > medium number of rows (88,000) or a large number of rows (726,000).
> > 
> > The table has roughly a million docs.
> 
> > explain analyze select count(*) from  docmeta,
> > plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;
> 
> OK, forget my previous message. The problem is that you are doing a join
> using @@ as the operator for the join condition, so the planner uses the
> operator's join selectivity estimate. For @@ the tsmatchjoinsel function
> simply returns 0.005.
> 
> Try doing:
> 
> explain analyze select count(*) from docmeta where docvector @@
> plainto_tsquery('english', 'dyfdfdf');
> 
> It should help.
> 
> Cheers,
> Jan