Thread: FTS query, statistics and planner estimations…

FTS query, statistics and planner estimations…

From
Pierre Ducroquet
Date:
Hello

I recently stumbled on a slow query in my database that showed an odd
behaviour related to the statistics of FTS queries.
The query does a few joins «after» running a FTS query on a main table.
The FTS query returns a few thousand rows, but the estimations are wrong,
leading the optimizer to terrible plans compared to what should happen, and
thus creates a far higher execution time.
I managed to isolate the odd behaviour in a single query, and I would like
your opinion about it.

I have modified the table name, columns and query to hide sensitive values,
but the issue remain the same. The table contains about 295,000 documents, and
all is running under PostgreSQL 9.5.

EXPLAIN ANALYZE
SELECT COUNT(*)
FROM documents
WHERE
    to_tsvector('french', subject || ' ' || body) @@ plainto_tsquery('XXX');

Of course, there is an index on to_tsvector('french', subject || ' ' || body).

That query gives me the following results for several values of XXX :

 Request                          | Estimated rows | Real rows
----------------------------------+----------------+-----------
'word1'                           | 38050          | 37500
'word1 word2'                     | 4680           | 32000
'word1 word2 word3'               | 270            | 12300
'word1 word2 word3 word4'         | 10             | 9930
'word1 word2 word3 word4 word5'   | 1              | 9930

You can see that with more words in query, the estimation falls far behind
reality.

Is that a known limitation of the FTS indexing ? Am I missing something
obvious, or a poor configuration ?

Thanks a lot

Attachment

Re: [GENERAL] FTS query, statistics and planner estimations…

From
Francisco Olarte
Date:
Pierre:

On Wed, Nov 9, 2016 at 10:22 AM, Pierre Ducroquet
<pierre.ducroquet@people-doc.com> wrote:
> The query does a few joins «after» running a FTS query on a main table.
> The FTS query returns a few thousand rows, but the estimations are wrong,
> leading the optimizer to terrible plans compared to what should happen, and
> thus creates a far higher execution time.
....
> but the issue remain the same. The table contains about 295,000 documents, and
....
>  Request                          | Estimated rows | Real rows
> ----------------------------------+----------------+-----------
> 'word1'                           | 38050          | 37500
> 'word1 word2'                     | 4680           | 32000
> 'word1 word2 word3'               | 270            | 12300
> 'word1 word2 word3 word4'         | 10             | 9930
> 'word1 word2 word3 word4 word5'   | 1              | 9930
>
> You can see that with more words in query, the estimation falls far behind
> reality.

I'm not really familiar with FTS but, doing a few division of
estimations and rows it seems it estimates as uncorrelated words, and
you real rows clearly indicate some of them are clearly correlated (
like w1/w2 and w4/s5, and partially w3/w45 ) and very common.

> Is that a known limitation of the FTS indexing ? Am I missing something
> obvious, or a poor configuration ?

Someone more familiar with it needed for that, but what I've found
several times is FTS does not mix too well with relational queries at
the optimizer level ( as FTS terms can have very diverse degrees of
correlation, which is very difficult to store in the statistics a
relational optimizer normally uses ).

Francisco Olarte.


Re: [GENERAL] Re: [GENERAL] FTS query, statistics and planner estimations…

From
Pavel Stehule
Date:


2016-11-09 10:40 GMT+01:00 Francisco Olarte <folarte@peoplecall.com>:
Pierre:

On Wed, Nov 9, 2016 at 10:22 AM, Pierre Ducroquet
<pierre.ducroquet@people-doc.com> wrote:
> The query does a few joins «after» running a FTS query on a main table.
> The FTS query returns a few thousand rows, but the estimations are wrong,
> leading the optimizer to terrible plans compared to what should happen, and
> thus creates a far higher execution time.
....
> but the issue remain the same. The table contains about 295,000 documents, and
....
>  Request                          | Estimated rows | Real rows
> ----------------------------------+----------------+-----------
> 'word1'                           | 38050          | 37500
> 'word1 word2'                     | 4680           | 32000
> 'word1 word2 word3'               | 270            | 12300
> 'word1 word2 word3 word4'         | 10             | 9930
> 'word1 word2 word3 word4 word5'   | 1              | 9930
>
> You can see that with more words in query, the estimation falls far behind
> reality.

I'm not really familiar with FTS but, doing a few division of
estimations and rows it seems it estimates as uncorrelated words, and
you real rows clearly indicate some of them are clearly correlated (
like w1/w2 and w4/s5, and partially w3/w45 ) and very common.

> Is that a known limitation of the FTS indexing ? Am I missing something
> obvious, or a poor configuration ?

Someone more familiar with it needed for that, but what I've found
several times is FTS does not mix too well with relational queries at
the optimizer level ( as FTS terms can have very diverse degrees of
correlation, which is very difficult to store in the statistics a
relational optimizer normally uses ).

there is workaround - the FTS query can be wrapped to immutable function - then can be executed in planner time, and the estimations can be better

 http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_II#Using_IMMUTABLE_functions_as_hints_for_the_optimizer

Regards

Pavel


Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: FTS query, statistics and planner estimations…

From
Artur Zakirov
Date:
Hello,

On 09.11.2016 12:22, Pierre Ducroquet wrote:
> Hello
>
> I recently stumbled on a slow query in my database that showed an odd
> behaviour related to the statistics of FTS queries.
> The query does a few joins «after» running a FTS query on a main table.
> The FTS query returns a few thousand rows, but the estimations are wrong,
> leading the optimizer to terrible plans compared to what should happen, and
> thus creates a far higher execution time.
> I managed to isolate the odd behaviour in a single query, and I would like
> your opinion about it.
>
> I have modified the table name, columns and query to hide sensitive values,
> but the issue remain the same. The table contains about 295,000 documents, and
> all is running under PostgreSQL 9.5.
>
> EXPLAIN ANALYZE
> SELECT COUNT(*)
> FROM documents
> WHERE
>     to_tsvector('french', subject || ' ' || body) @@ plainto_tsquery('XXX');
>
> Of course, there is an index on to_tsvector('french', subject || ' ' || body).

Did you try in the query to specify FTS configuration:

WHERE
     to_tsvector('french', subject || ' ' || body) @@
plainto_tsquery('french', 'XXX');

plainto_tsquery() with specified configuration should be faster and
plainto_tsquery(regconfig,test) is immutable function.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: FTS query, statistics and planner estimations…

From
Pierre Ducroquet
Date:
On Wednesday, November 9, 2016 1:01:29 PM CET you wrote:
> Hello,
>
> On 09.11.2016 12:22, Pierre Ducroquet wrote:
> > Hello
> >
> > I recently stumbled on a slow query in my database that showed an odd
> > behaviour related to the statistics of FTS queries.
> > The query does a few joins «after» running a FTS query on a main table.
> > The FTS query returns a few thousand rows, but the estimations are wrong,
> > leading the optimizer to terrible plans compared to what should happen,
> > and
> > thus creates a far higher execution time.
> > I managed to isolate the odd behaviour in a single query, and I would like
> > your opinion about it.
> >
> > I have modified the table name, columns and query to hide sensitive
> > values,
> > but the issue remain the same. The table contains about 295,000 documents,
> > and all is running under PostgreSQL 9.5.
> >
> > EXPLAIN ANALYZE
> > SELECT COUNT(*)
> > FROM documents
> > WHERE
> >
> >     to_tsvector('french', subject || ' ' || body) @@
> >     plainto_tsquery('XXX');
> >
> > Of course, there is an index on to_tsvector('french', subject || ' ' ||
> > body).
> Did you try in the query to specify FTS configuration:
>
> WHERE
>      to_tsvector('french', subject || ' ' || body) @@
> plainto_tsquery('french', 'XXX');
>
> plainto_tsquery() with specified configuration should be faster and
> plainto_tsquery(regconfig,test) is immutable function.

Ho my bad, yes indeed I used plainto_tsquery('french', 'XXX') that is
immutable, sorry for the confusion.

Attachment

Re: Re: [GENERAL] FTS query,statistics and planner estimations…

From
Pierre Ducroquet
Date:
On Wednesday, November 9, 2016 10:51:11 AM CET Pavel Stehule wrote:
> 2016-11-09 10:40 GMT+01:00 Francisco Olarte <folarte@peoplecall.com>:
> > Pierre:
> >
> > On Wed, Nov 9, 2016 at 10:22 AM, Pierre Ducroquet
> >
> > <pierre.ducroquet@people-doc.com> wrote:
> > > The query does a few joins «after» running a FTS query on a main table.
> > > The FTS query returns a few thousand rows, but the estimations are
> > > wrong,
> > > leading the optimizer to terrible plans compared to what should happen,
> >
> > and
> >
> > > thus creates a far higher execution time.
> >
> > ....
> >
> > > but the issue remain the same. The table contains about 295,000
> >
> > documents, and
> > ....
> >
> > >  Request                          | Estimated rows | Real rows
> > >
> > > ----------------------------------+----------------+-----------
> > > 'word1'                           | 38050          | 37500
> > > 'word1 word2'                     | 4680           | 32000
> > > 'word1 word2 word3'               | 270            | 12300
> > > 'word1 word2 word3 word4'         | 10             | 9930
> > > 'word1 word2 word3 word4 word5'   | 1              | 9930
> > >
> > > You can see that with more words in query, the estimation falls far
> >
> > behind
> >
> > > reality.
> >
> > I'm not really familiar with FTS but, doing a few division of
> > estimations and rows it seems it estimates as uncorrelated words, and
> > you real rows clearly indicate some of them are clearly correlated (
> > like w1/w2 and w4/s5, and partially w3/w45 ) and very common.
> >
> > > Is that a known limitation of the FTS indexing ? Am I missing something
> > > obvious, or a poor configuration ?
> >
> > Someone more familiar with it needed for that, but what I've found
> > several times is FTS does not mix too well with relational queries at
> > the optimizer level ( as FTS terms can have very diverse degrees of
> > correlation, which is very difficult to store in the statistics a
> > relational optimizer normally uses ).
>
> there is workaround - the FTS query can be wrapped to immutable function -
> then can be executed in planner time, and the estimations can be better
>
>
> http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_II#Using_IMMUTABLE_functions_a
> s_hints_for_the_optimizer

My bad, I used the plainto_tsquery(configuration, query) function that is
immutable, so the planner knows what is being searched.

Attachment

Re: FTS query, statistics and planner estimations…

From
Pierre Ducroquet
Date:
On Wednesday, November 9, 2016 10:40:10 AM CET Francisco Olarte wrote:
> Pierre:
>
> On Wed, Nov 9, 2016 at 10:22 AM, Pierre Ducroquet
>
> <pierre.ducroquet@people-doc.com> wrote:
> > The query does a few joins «after» running a FTS query on a main table.
> > The FTS query returns a few thousand rows, but the estimations are wrong,
> > leading the optimizer to terrible plans compared to what should happen,
> > and
> > thus creates a far higher execution time.
>
> ....
>
> > but the issue remain the same. The table contains about 295,000 documents,
> > and
> ....
>
> >  Request                          | Estimated rows | Real rows
> >
> > ----------------------------------+----------------+-----------
> > 'word1'                           | 38050          | 37500
> > 'word1 word2'                     | 4680           | 32000
> > 'word1 word2 word3'               | 270            | 12300
> > 'word1 word2 word3 word4'         | 10             | 9930
> > 'word1 word2 word3 word4 word5'   | 1              | 9930
> >
> > You can see that with more words in query, the estimation falls far behind
> > reality.
>
> I'm not really familiar with FTS but, doing a few division of
> estimations and rows it seems it estimates as uncorrelated words, and
> you real rows clearly indicate some of them are clearly correlated (
> like w1/w2 and w4/s5, and partially w3/w45 ) and very common.
>
> > Is that a known limitation of the FTS indexing ? Am I missing something
> > obvious, or a poor configuration ?
>
> Someone more familiar with it needed for that, but what I've found
> several times is FTS does not mix too well with relational queries at
> the optimizer level ( as FTS terms can have very diverse degrees of
> correlation, which is very difficult to store in the statistics a
> relational optimizer normally uses ).

Indeed the words in the query are correlated, but I do hope that the FTS
indexing is able to cope with that. Otherwise it makes it far less usable than
what one would expect since real world queries will often contain sentences or
related words. Also, PostgreSQL 9.6 introduced phrase search in FTS, and I
don't see how that would work without a working multi-words query.

Attachment

Re: [GENERAL] FTS query, statistics and planner estimations…

From
Pavel Stehule
Date:


2016-11-09 11:19 GMT+01:00 Pierre Ducroquet <pierre.ducroquet@people-doc.com>:
On Wednesday, November 9, 2016 10:40:10 AM CET Francisco Olarte wrote:
> Pierre:
>
> On Wed, Nov 9, 2016 at 10:22 AM, Pierre Ducroquet
>
> <pierre.ducroquet@people-doc.com> wrote:
> > The query does a few joins «after» running a FTS query on a main table.
> > The FTS query returns a few thousand rows, but the estimations are wrong,
> > leading the optimizer to terrible plans compared to what should happen,
> > and
> > thus creates a far higher execution time.
>
> ....
>
> > but the issue remain the same. The table contains about 295,000 documents,
> > and
> ....
>
> >  Request                          | Estimated rows | Real rows
> >
> > ----------------------------------+----------------+-----------
> > 'word1'                           | 38050          | 37500
> > 'word1 word2'                     | 4680           | 32000
> > 'word1 word2 word3'               | 270            | 12300
> > 'word1 word2 word3 word4'         | 10             | 9930
> > 'word1 word2 word3 word4 word5'   | 1              | 9930
> >
> > You can see that with more words in query, the estimation falls far behind
> > reality.
>
> I'm not really familiar with FTS but, doing a few division of
> estimations and rows it seems it estimates as uncorrelated words, and
> you real rows clearly indicate some of them are clearly correlated (
> like w1/w2 and w4/s5, and partially w3/w45 ) and very common.
>
> > Is that a known limitation of the FTS indexing ? Am I missing something
> > obvious, or a poor configuration ?
>
> Someone more familiar with it needed for that, but what I've found
> several times is FTS does not mix too well with relational queries at
> the optimizer level ( as FTS terms can have very diverse degrees of
> correlation, which is very difficult to store in the statistics a
> relational optimizer normally uses ).

Indeed the words in the query are correlated, but I do hope that the FTS
indexing is able to cope with that. Otherwise it makes it far less usable than
what one would expect since real world queries will often contain sentences or
related words. Also, PostgreSQL 9.6 introduced phrase search in FTS, and I
don't see how that would work without a working multi-words query.

The PostgreSQL statistics are not multidimensional - so bad estimation is expected :(

Regards

Pavel

Re: [GENERAL] FTS query, statistics and planner estimations…

From
Francisco Olarte
Date:
On Wed, Nov 9, 2016 at 11:19 AM, Pierre Ducroquet
<pierre.ducroquet@people-doc.com> wrote:
> Indeed the words in the query are correlated, but I do hope that the FTS
> indexing is able to cope with that.

If the query returns correct results in reasonable time it can. OTOH
the planner, and the statistics system, is another beast. Correlation
info in FTS is HUGE, and the planner is supposed to work with a
smallish summary of the index.

> Otherwise it makes it far less usable than
> what one would expect since real world queries will often contain sentences or
> related words.

Well, I concur it would be great to have it, but having written FTS
engines I suspect it would be difficult to have it AND maintain it. I
have built an FTS system, and I built an index as a compressed list of
(stemed-word, document, position), and then compressed it. The
information for word-word correlation would be huge, as its
cardinality could grow with n^2. Especially if you have to keep it in
an updatabale format. And it would not help you for the three, four,
etc.. cases. And even then, the optimizer may be spending a lot of
time reading and processing it, as it would not fit easily in the
cache.

> Also, PostgreSQL 9.6 introduced phrase search in FTS, and I
> don't see how that would work without a working multi-words query.

Queries work, is just they are not as fast as you want/expect them to
be. Phrase search is normally done by locating documents with all the
words and then filtering, just with the index if it includes word
position or by reading the docs. In general, in FTS, you need to use
selective terms for fast queries.

Francisco Olarte.