Thread: sequential scan unduly favored over text search gin index

sequential scan unduly favored over text search gin index

From

Sushant Sinha

Date:

20 June 2011, 12:45:22

I have a tsvector column docvector and a gin index on it
docmeta1_docvector_idx

I have a simple query "select * from docmeta1 where docvector @@
plainto_tsquery('english', 'free');"

I find that the planner chooses a sequential scan of the table even when
the index performs orders of magnitude. I set random_page_cost = 1.0 for
the database to favor index use. However, I still see that the cost
estimate for sequential scan of the entire table (23000) is cheaper than
the cost of using the index (33000). The time taken for sequential
access is 5200 ms and for index usage is only 85 ms.

Details here:

postgres version 9.0.2
statistics on docvector is set to 10000 and as you can see the row
estimates are fine.

lawdb=# \d docmeta1
       Table "public.docmeta1"
   Column    |   Type    | Modifiers
-------------+-----------+-----------
 tid         | integer   | not null
 docweight   | integer   |
 doctype     | integer   |
 publishdate | date      |
 covertids   | integer[] |
 titlevector | tsvector  |
 docvector   | tsvector  |
Indexes:
    "docmeta1_pkey" PRIMARY KEY, btree (tid)
    "docmeta1_date_idx" btree (publishdate)
    "docmeta1_docvector_idx" gin (docvector)
    "docmeta1_title_idx" gin (titlevector)

lawdb=# SELECT relpages, reltuples FROM pg_class WHERE relname
='docmeta1';
relpages | reltuples
----------+-----------
    18951 |    329940


lawdb=# explain analyze select * from docmeta1 where docvector @@
plainto_tsquery('english', 'free');
                                                    QUERY
PLAN

--------------------------------------------------------------------------------
-----------------------------------
 Seq Scan on docmeta1  (cost=0.00..23075.25 rows=35966 width=427)
(actual time=0
.145..5189.556 rows=35966 loops=1)
   Filter: (docvector @@ '''free'''::tsquery)
 Total runtime: 5196.231 ms
(3 rows)

lawdb=# set enable_seqscan = off;
SET
lawdb=# explain analyze select * from docmeta1 where docvector @@
plainto_tsquery('english', 'free');
                                                                QUERY
PLAN

--------------------------------------------------------------------------------
-----------------------------------------------------------
 Bitmap Heap Scan on docmeta1  (cost=14096.25..33000.83 rows=35966
width=427) (a
ctual time=9.543..82.754 rows=35966 loops=1)
   Recheck Cond: (docvector @@ '''free'''::tsquery)
   ->  Bitmap Index Scan on docmeta1_docvector_idx  (cost=0.00..14087.26
rows=35
966 width=0) (actual time=8.059..8.059 rows=35967 loops=1)
         Index Cond: (docvector @@ '''free'''::tsquery)
 Total runtime: 85.304 ms
(5 rows)


-Sushant.

Re: sequential scan unduly favored over text search gin index

From

"Kevin Grittner"

Date:

20 June 2011, 12:58:32

Sushant Sinha <sushant354@gmail.com> wrote:

> I have a tsvector column docvector and a gin index on it
> docmeta1_docvector_idx
>
> I have a simple query "select * from docmeta1 where docvector @@
> plainto_tsquery('english', 'free');"
>
> I find that the planner chooses a sequential scan of the table
> even when the index performs orders of magnitude.

Did you ANALYZE the table after loading the data and building the
index?

-Kevin

Re: sequential scan unduly favored over text search gin index

From

Sushant Sinha

Date:

20 June 2011, 13:04:15

On Mon, 2011-06-20 at 10:58 -0500, Kevin Grittner wrote:
> Sushant Sinha <sushant354@gmail.com> wrote:
>
> > I have a tsvector column docvector and a gin index on it
> > docmeta1_docvector_idx
> >
> > I have a simple query "select * from docmeta1 where docvector @@
> > plainto_tsquery('english', 'free');"
> >
> > I find that the planner chooses a sequential scan of the table
> > even when the index performs orders of magnitude.
>
> Did you ANALYZE the table after loading the data and building the
> index?
Yes and I mentioned that the row estimates are correct, which indicate
that the problem is somewhere else.

-Sushant.

> -Kevin

Re: sequential scan unduly favored over text search gin index

From

Jesper Krogh

Date:

20 June 2011, 15:59:24

On 2011-06-20 17:38, Sushant Sinha wrote:
> I have a tsvector column docvector and a gin index on it
> docmeta1_docvector_idx
>
> I have a simple query "select * from docmeta1 where docvector @@
> plainto_tsquery('english', 'free');"
>
> I find that the planner chooses a sequential scan of the table even when
> the index performs orders of magnitude. I set random_page_cost = 1.0 for
> the database to favor index use. However, I still see that the cost
> estimate for sequential scan of the entire table (23000) is cheaper than
> the cost of using the index (33000). The time taken for sequential
> access is 5200 ms and for index usage is only 85 ms.
The cost-estimation code for gin-indices are not good in 9.0, this has
hugely been improved in 9.1

http://git.postgresql.org/gitweb?p=postgresql.git&a=search&h=HEAD&st=commit&s=gincost

I think the individual patches apply quite cleanly to 9.0 as far
as I remember.

--
Jesper

Re: sequential scan unduly favored over text search gin index

From

Tomas Vondra

Date:

20 June 2011, 16:01:48

Dne 20.6.2011 18:04, Sushant Sinha napsal(a):
>
> On Mon, 2011-06-20 at 10:58 -0500, Kevin Grittner wrote:
>> Sushant Sinha <sushant354@gmail.com> wrote:
>>
>>> I have a tsvector column docvector and a gin index on it
>>> docmeta1_docvector_idx
>>>
>>> I have a simple query "select * from docmeta1 where docvector @@
>>> plainto_tsquery('english', 'free');"
>>>
>>> I find that the planner chooses a sequential scan of the table
>>> even when the index performs orders of magnitude.
>>
>> Did you ANALYZE the table after loading the data and building the
>> index?
> Yes and I mentioned that the row estimates are correct, which indicate
> that the problem is somewhere else.

Hi,

I agree the estimates are damn precise in this case (actually the
estimates are exact). The problem is the planner thinks the seq scan is
about 30% cheaper than the bitmap index scan.

I guess you could poke the planner towards the bitmap scan by lowering
the random_page_cost (the default value is 4, I'd say lowering it to 2
should do the trick).

But be careful, this will influence all the other queries! Those values
should somehow reflect the hardware of your system (type of drives,
amount of RAM, etc.) so you have to test the effects.

regards
Tomas

Re: sequential scan unduly favored over text search gin index

From

Sushant Sinha

Date:

20 June 2011, 23:25:52

>
> I agree the estimates are damn precise in this case (actually the
> estimates are exact). The problem is the planner thinks the seq scan is
> about 30% cheaper than the bitmap index scan.
>
> I guess you could poke the planner towards the bitmap scan by lowering
> the random_page_cost (the default value is 4, I'd say lowering it to 2
> should do the trick).

The numbers that I gave was after setting random_page_cost = 1.0 After
this I don't know what to do.

-Sushant.

Re: sequential scan unduly favored over text search gin index

From

Tom Lane

Date:

21 June 2011, 02:53:30

Sushant Sinha <sushant354@gmail.com> writes:
>> I guess you could poke the planner towards the bitmap scan by lowering
>> the random_page_cost (the default value is 4, I'd say lowering it to 2
>> should do the trick).

> The numbers that I gave was after setting random_page_cost = 1.0 After
> this I don't know what to do.

I think part of the issue here is that the @@ operator is expensive,
and so evaluating it once per row is expensive, but the pg_proc.procost
setting for it doesn't adequately reflect that.  You could experiment
with tweaking that setting ...

            regards, tom lane

Re: sequential scan unduly favored over text search gin index

From

Jeff Janes

Date:

30 June 2011, 00:42:01

On Mon, Jun 20, 2011 at 8:38 AM, Sushant Sinha <sushant354@gmail.com> wrote:
>
> postgres version 9.0.2
> statistics on docvector is set to 10000 and as you can see the row
> estimates are fine.
>
> lawdb=# \d docmeta1
>       Table "public.docmeta1"
>   Column    |   Type    | Modifiers
> -------------+-----------+-----------
>  tid         | integer   | not null
>  docweight   | integer   |
>  doctype     | integer   |
>  publishdate | date      |
>  covertids   | integer[] |
>  titlevector | tsvector  |
>  docvector   | tsvector  |
> Indexes:
>    "docmeta1_pkey" PRIMARY KEY, btree (tid)
>    "docmeta1_date_idx" btree (publishdate)
>    "docmeta1_docvector_idx" gin (docvector)
>    "docmeta1_title_idx" gin (titlevector)
>
> lawdb=# SELECT relpages, reltuples FROM pg_class WHERE relname
> ='docmeta1';
> relpages | reltuples
> ----------+-----------
>    18951 |    329940


What the are sizes of associated toast tables for the tsvector columns?

>
> lawdb=# explain analyze select * from docmeta1 where docvector @@
> plainto_tsquery('english', 'free');

It would be nice to see the results of explain (analyze, buffers).

Cheers,

Jeff

Re: sequential scan unduly favored over text search gin index

From

Jeff Janes

Date:

30 June 2011, 00:59:32

On Mon, Jun 20, 2011 at 10:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
>>> I guess you could poke the planner towards the bitmap scan by lowering
>>> the random_page_cost (the default value is 4, I'd say lowering it to 2
>>> should do the trick).
>
>> The numbers that I gave was after setting random_page_cost = 1.0 After
>> this I don't know what to do.
>
> I think part of the issue here is that the @@ operator is expensive,
> and so evaluating it once per row is expensive, but the pg_proc.procost
> setting for it doesn't adequately reflect that.  You could experiment
> with tweaking that setting ...

In something I was testing a couple months ago, by far the biggest
expense of the @@ operator in a full table scan was in crawling
through the entire toast table (and not necessarily in sequential
order) in order to get the tsvector data on which to apply the
operator.  So increasing the cost of @@ might very well be the best
immediate solution, but should the cost estimation code be changed to
explicitly take page reads associated with toast into account, so that
cost of @@ itself and can remain a CPU based estimate rather than an
amalgam of CPU and IO?

Cheers,

Jeff