Thread: Full text search - query plan? PG 8.4.1

Full text search - query plan? PG 8.4.1

From
Jesper Krogh
Date:
Hi.

I'm currently testing out PostgreSQL's Full Text Search capabillities.
We're currenly using Xapian, it has some nice features and some
drawbacks (sorting), so it is especially this area I'm investigating.

I've loaded the database with 50K documents, and the table definition
is:

ftstest=# \d uniprot
                               Table "public.uniprot"
      Column      |   Type   |                      Modifiers

------------------+----------+------------------------------------------------------
 id               | integer  | not null default
nextval('textbody_id_seq'::regclass)
 body             | text     | not null default ''::text
 textbody_body_fts | tsvector |
 accession_number | text     | not null default ''::text
Indexes:
    "accno_unique_idx" UNIQUE, btree (accession_number)
    "textbody_tfs_idx" gin (textbody_body_fts)
Triggers:
    tsvectorupdate BEFORE INSERT OR UPDATE ON textbody FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger('textbody_body_fts',
'pg_catalog.english', 'body')

"commonterm" matches 37K of the 50K documents (majority), but the query
plan is "odd" in my eyes.

* Why does it mis-guess the cost of a Seq Scan on textbody so much?
* Why doesn't it use the index in "id" to fetch the 10 records?

ftstest=#  ANALYZE textbody;
ANALYZE
ftstest=# explain analyze select body from textbody where
textbody_body_fts @@ to_tsquery('commonterm') order by id limit 10 offset 0
                                                          QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2841.08..2841.11 rows=10 width=5) (actual
time=48031.563..48031.568 rows=10 loops=1)
   ->  Sort  (cost=2841.08..2933.01 rows=36771 width=5) (actual
time=48031.561..48031.564 rows=10 loops=1)
         Sort Key: id
         Sort Method:  top-N heapsort  Memory: 31kB
         ->  Seq Scan on textbody  (cost=0.00..2046.47 rows=36771
width=5) (actual time=100.107..47966.590 rows=37133 loops=1)
               Filter: (textbody_body_fts @@ to_tsquery('commonterm'::text))
 Total runtime: 48031.612 ms
(7 rows)

This query-plan doesn't answer the questions above, but it does indeed
speed it up significantly (by heading into a Bitmap Index Scan instead
of a Seq Scan)

ftstest=# set enable_seqscan=off;
SET

ftstest=# explain analyze select body from textbody where
textbody_body_fts @@ to_tsquery('commonterm') order by id limit 10 offset 0

QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=269942.41..269942.43 rows=10 width=5) (actual
time=47.567..47.572 rows=10 loops=1)
   ->  Sort  (cost=269942.41..270034.34 rows=36771 width=5) (actual
time=47.565..47.567 rows=10 loops=1)
         Sort Key: id
         Sort Method:  top-N heapsort  Memory: 31kB
         ->  Bitmap Heap Scan on textbody  (cost=267377.23..269147.80
rows=36771 width=5) (actual time=15.763..30.576 rows=37133 loops=1)
               Recheck Cond: (textbody_body_fts @@
to_tsquery('commonterm'::text))
               ->  Bitmap Index Scan on textbody_tfs_idx
(cost=0.00..267368.04 rows=36771 width=0) (actual time=15.419..15.419
rows=37134 loops=1)
                     Index Cond: (textbody_body_fts @@
to_tsquery('commonterm'::text))
 Total runtime: 47.634 ms
(9 rows)

To me it seems like the query planner could do a better job?

On "rare" terms everything seems to work excellent.

N.B.: looks a lot like this:
http://archives.postgresql.org/pgsql-performance/2009-07/msg00190.php

--
Jesper

Re: Full text search - query plan? PG 8.4.1

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> "commonterm" matches 37K of the 50K documents (majority), but the query
> plan is "odd" in my eyes.

> * Why does it mis-guess the cost of a Seq Scan on textbody so much?

The cost looks about right to me.  The cost units are not milliseconds.

> * Why doesn't it use the index in "id" to fetch the 10 records?

You haven't *got* an index on id, according to the \d output.

The only part of your results that looks odd to me is the very high cost
estimate for the bitmapscan:

>          ->  Bitmap Heap Scan on textbody  (cost=267377.23..269147.80
> rows=36771 width=5) (actual time=15.763..30.576 rows=37133 loops=1)
>                Recheck Cond: (textbody_body_fts @@
> to_tsquery('commonterm'::text))
>                ->  Bitmap Index Scan on textbody_tfs_idx
> (cost=0.00..267368.04 rows=36771 width=0) (actual time=15.419..15.419
> rows=37134 loops=1)
>                      Index Cond: (textbody_body_fts @@
> to_tsquery('commonterm'::text))

When I try this with a 64K-row table having 'commonterm' in half of the
rows, what I get is estimates of 1530 cost units for the seqscan and
1405 for the bitmapscan (so it prefers the latter).  It will switch over
to using an index on id if I add one, but that's not the point at the
moment.  There's something strange about your tsvector index.  Maybe
it's really huge because the documents are huge?

            regards, tom lane

Re: Full text search - query plan? PG 8.4.1

From
Jesper Krogh
Date:
Tom Lane wrote:
> Jesper Krogh <jesper@krogh.cc> writes:
>> "commonterm" matches 37K of the 50K documents (majority), but the query
>> plan is "odd" in my eyes.
>
>> * Why does it mis-guess the cost of a Seq Scan on textbody so much?
>
> The cost looks about right to me.  The cost units are not milliseconds.
>
>> * Why doesn't it use the index in "id" to fetch the 10 records?
>
> You haven't *got* an index on id, according to the \d output.

Thanks (/me bangs my head against the table). I somehow assumed that "id
SERIAL" automatically created it for me. Even enough to not looking for
it to confirm.

> The only part of your results that looks odd to me is the very high cost
> estimate for the bitmapscan:
>
>>          ->  Bitmap Heap Scan on textbody  (cost=267377.23..269147.80
>> rows=36771 width=5) (actual time=15.763..30.576 rows=37133 loops=1)
>>                Recheck Cond: (textbody_body_fts @@
>> to_tsquery('commonterm'::text))
>>                ->  Bitmap Index Scan on textbody_tfs_idx
>> (cost=0.00..267368.04 rows=36771 width=0) (actual time=15.419..15.419
>> rows=37134 loops=1)
>>                      Index Cond: (textbody_body_fts @@
>> to_tsquery('commonterm'::text))
>
> When I try this with a 64K-row table having 'commonterm' in half of the
> rows, what I get is estimates of 1530 cost units for the seqscan and
> 1405 for the bitmapscan (so it prefers the latter).  It will switch over
> to using an index on id if I add one, but that's not the point at the
> moment.  There's something strange about your tsvector index.  Maybe
> it's really huge because the documents are huge?

huge is a relative term, but length(ts_vector(body)) is about 200 for
each document. Is that huge? I can postprocess them a bit to get it down
and will eventually do that before going to "production".

Thanks alot.

--
Jesper

Re: Full text search - query plan? PG 8.4.1

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> Tom Lane wrote:
>> ... There's something strange about your tsvector index.  Maybe
>> it's really huge because the documents are huge?

> huge is a relative term, but length(ts_vector(body)) is about 200 for
> each document. Is that huge?

It's bigger than the toy example I was trying, but not *that* much
bigger.  I think maybe your index is bloated.  Try dropping and
recreating it and see if the estimates change any.

            regards, tom lane

Re: Full text search - query plan? PG 8.4.1

From
Jesper Krogh
Date:
Tom Lane wrote:
> Jesper Krogh <jesper@krogh.cc> writes:
>> Tom Lane wrote:
>>> ... There's something strange about your tsvector index.  Maybe
>>> it's really huge because the documents are huge?
>
>> huge is a relative term, but length(ts_vector(body)) is about 200 for
>> each document. Is that huge?
>
> It's bigger than the toy example I was trying, but not *that* much
> bigger.  I think maybe your index is bloated.  Try dropping and
> recreating it and see if the estimates change any.

I'm a bit reluctant to dropping it and re-creating it. It'll take a
couple of days to regenerate, so this should hopefully not be an common
situation for the system.

I have set the statistics target to 1000 for the tsvector, the
documentation didn't specify any heavy negative sides of doing that and
since that I haven't seen row estimates that are orders of magnitude off.

It is build from scratch using inserts all the way to around 10m now,
should that result in index-bloat? Can I inspect the size of bloat
without rebuilding (or similar locking operation)?

The query still has a "wrong" tipping point between the two query-plans:

ftstest=# explain analyze select body from ftstest where
ftstest_body_fts @@ to_tsquery('testterm') order by id limit 100;

QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..7357.77 rows=100 width=738) (actual
time=3978.974..8595.086 rows=100 loops=1)
   ->  Index Scan using ftstest_id_pri_idx on ftstest
(cost=0.00..1436458.05 rows=19523 width=738) (actual
time=3978.971..8594.932 rows=100 loops=1)
         Filter: (ftstest_body_fts @@ to_tsquery('testterm'::text))
 Total runtime: 8595.222 ms
(4 rows)

ftstest=# set enable_indexscan=off;
SET
ftstest=# explain analyze select body from ftstest where
ftstest_body_fts @@ to_tsquery('testterm') order by id limit 100;
                                                                   QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=59959.61..59959.86 rows=100 width=738) (actual
time=338.832..339.055 rows=100 loops=1)
   ->  Sort  (cost=59959.61..60008.41 rows=19523 width=738) (actual
time=338.828..338.908 rows=100 loops=1)
         Sort Key: id
         Sort Method:  top-N heapsort  Memory: 32kB
         ->  Bitmap Heap Scan on ftstest  (cost=22891.18..59213.45
rows=19523 width=738) (actual time=5.097..316.780 rows=19444 loops=1)
               Recheck Cond: (ftstest_body_fts @@
to_tsquery('testterm'::text))
               ->  Bitmap Index Scan on ftstest_tfs_idx
(cost=0.00..22886.30 rows=19523 width=0) (actual time=4.259..4.259
rows=20004 loops=1)
                     Index Cond: (ftstest_body_fts @@
to_tsquery('testterm'::text))
 Total runtime: 339.201 ms
(9 rows)

So for getting 100 rows where the term exists in 19.444 of 10.000.000
documents it chooses the index-scan where it (given random distribution
of the documents) should scan: 100*(10000000/19444) = 51429 documents.
So it somehow believes that the cost for the bitmap index scan is higher
than it actually is or the cost for the index-scan is lower than it
actually is.

Is is possible to manually set the cost for the @@ operator? It seems
natural that matching up a ts_vector to a ts_query, which is a much
heavier operation  than = and even is stored in EXTENDED storage should
be much higher than a integer in plain storage.

I tried to search docs for operator cost, but I only found the overall
ones in the configuration file that are base values.

Jesper
--
Jesper

Re: Full text search - query plan? PG 8.4.1

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> Is is possible to manually set the cost for the @@ operator?

You want to set the cost for the underlying function.

            regards, tom lane

Re: Full text search - query plan? PG 8.4.1

From
Scott Marlowe
Date:
On Fri, Oct 23, 2009 at 2:32 PM, Jesper Krogh <jesper@krogh.cc> wrote:
> Tom Lane wrote:
>> Jesper Krogh <jesper@krogh.cc> writes:
>>> Tom Lane wrote:
>>>> ... There's something strange about your tsvector index.  Maybe
>>>> it's really huge because the documents are huge?
>>
>>> huge is a relative term, but length(ts_vector(body)) is about 200 for
>>> each document. Is that huge?
>>
>> It's bigger than the toy example I was trying, but not *that* much
>> bigger.  I think maybe your index is bloated.  Try dropping and
>> recreating it and see if the estimates change any.
>
> I'm a bit reluctant to dropping it and re-creating it. It'll take a
> couple of days to regenerate, so this should hopefully not be an common
> situation for the system.

Note that if it is bloated, you can create the replacement index with
a concurrently created one, then drop the old one when the new one
finishes.  So, no time spent without an index.

> I have set the statistics target to 1000 for the tsvector, the
> documentation didn't specify any heavy negative sides of doing that and
> since that I haven't seen row estimates that are orders of magnitude off.

It increases planning time mostly.  Also increases analyze times but
not that much.

> It is build from scratch using inserts all the way to around 10m now,
> should that result in index-bloat? Can I inspect the size of bloat
> without rebuilding (or similar locking operation)?

Depends on how many lost inserts there were.  If 95% of all your
inserts failed then yeah, it would be bloated.

Re: Full text search - query plan? PG 8.4.1

From
Jesper Krogh
Date:
Scott Marlowe wrote:
> On Fri, Oct 23, 2009 at 2:32 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>> Tom Lane wrote:
>>> Jesper Krogh <jesper@krogh.cc> writes:
>>>> Tom Lane wrote:
>>>>> ... There's something strange about your tsvector index.  Maybe
>>>>> it's really huge because the documents are huge?
>>>> huge is a relative term, but length(ts_vector(body)) is about 200 for
>>>> each document. Is that huge?
>>> It's bigger than the toy example I was trying, but not *that* much
>>> bigger.  I think maybe your index is bloated.  Try dropping and
>>> recreating it and see if the estimates change any.
>> I'm a bit reluctant to dropping it and re-creating it. It'll take a
>> couple of days to regenerate, so this should hopefully not be an common
>> situation for the system.
>
> Note that if it is bloated, you can create the replacement index with
> a concurrently created one, then drop the old one when the new one
> finishes.  So, no time spent without an index.

Nice tip, thanks.

>> It is build from scratch using inserts all the way to around 10m now,
>> should that result in index-bloat? Can I inspect the size of bloat
>> without rebuilding (or similar locking operation)?
>
> Depends on how many lost inserts there were.  If 95% of all your
> inserts failed then yeah, it would be bloated.

Less than 10.000 I'd bet, the import-script more or less ran by itself
the only failures where when I manually stopped it to add some more code
in it.

--
Jesper

Re: Full text search - query plan? PG 8.4.1

From
Jesper Krogh
Date:
Tom Lane wrote:
> Jesper Krogh <jesper@krogh.cc> writes:
>> Is is possible to manually set the cost for the @@ operator?
>
> You want to set the cost for the underlying function.

alter function ts_match_vq(tsvector,tsquery) cost 500

seems to change my test-queries in a very positive way (e.g. resolve to
bitmap index scan on most queryies but fall onto index-scans on
alternative columns when queriterms are common enough).

According to the documentation the default cost is 1 for builin
functions and 100 for others, is this true for the ts-stuff also?
Can I query the database for the cost of the functions?

It somehow seems natural that comparing a 1,3 term tsquery to a 200+
term tsvector is orders of magitude more expensive than simple operations.

I somehow suspect that this is a bad thing to do if I have other
gin-indexes where the tsvector is much smaller in the same database. But
then I can just use ts_match_qv for those queries or add my own which
justy raises the cost.

Thanks.

--
Jesper

Re: Full text search - query plan? PG 8.4.1

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> According to the documentation the default cost is 1 for builin
> functions and 100 for others, is this true for the ts-stuff also?

Yeah.  There was some recent discussion of pushing up the default cost
for some of these obviously-not-so-cheap functions, but nothing's been
done yet.

> Can I query the database for the cost of the functions?

See pg_proc.

            regards, tom lane