On Wednesday, November 9, 2016 10:40:10 AM CET Francisco Olarte wrote: > Pierre: > > On Wed, Nov 9, 2016 at 10:22 AM, Pierre Ducroquet > > <pierre.ducroquet@people-doc.com> wrote: > > The query does a few joins «after» running a FTS query on a main table. > > The FTS query returns a few thousand rows, but the estimations are wrong, > > leading the optimizer to terrible plans compared to what should happen, > > and > > thus creates a far higher execution time. > > .... > > > but the issue remain the same. The table contains about 295,000 documents, > > and > .... > > > Request | Estimated rows | Real rows > > > > ----------------------------------+----------------+----------- > > 'word1' | 38050 | 37500 > > 'word1 word2' | 4680 | 32000 > > 'word1 word2 word3' | 270 | 12300 > > 'word1 word2 word3 word4' | 10 | 9930 > > 'word1 word2 word3 word4 word5' | 1 | 9930 > > > > You can see that with more words in query, the estimation falls far behind > > reality. > > I'm not really familiar with FTS but, doing a few division of > estimations and rows it seems it estimates as uncorrelated words, and > you real rows clearly indicate some of them are clearly correlated ( > like w1/w2 and w4/s5, and partially w3/w45 ) and very common. > > > Is that a known limitation of the FTS indexing ? Am I missing something > > obvious, or a poor configuration ? > > Someone more familiar with it needed for that, but what I've found > several times is FTS does not mix too well with relational queries at > the optimizer level ( as FTS terms can have very diverse degrees of > correlation, which is very difficult to store in the statistics a > relational optimizer normally uses ).
Indeed the words in the query are correlated, but I do hope that the FTS indexing is able to cope with that. Otherwise it makes it far less usable than what one would expect since real world queries will often contain sentences or related words. Also, PostgreSQL 9.6 introduced phrase search in FTS, and I don't see how that would work without a working multi-words query.
The PostgreSQL statistics are not multidimensional - so bad estimation is expected :(