Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The answer to that clearly is to not index common terms
My understanding is that we don't currently get statistics on how
common the terms in a tsvector column are until we ANALYZE the *index*
created from it. Seems like sort of a Catch 22. Also, if we exclude
words which are in the tsvector from the index on the tsvector, we
need to know what words were excluded so we know not to search on them
as well as forcing the recheck of the full tsquery (unless this always
happens already?).
> It may well be that Jesper's identified a place where the GIN code
> could be improved
My naive assumption has been that it would be possible to get an
improvement without touching the index logic, by changing this part of
the query plan:
Index Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text))
to something like this:
Index Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1'::text))
and count on this doing the rest:
Recheck Cond: (ftsbody_body_fts @@ to_tsquery
('TERM1 & TERM2 & TERM3 & TERM4 & TERM5'::text))
I'm wondering if anyone has ever confirmed that probing for the more
frequent term through the index is *ever* a win, versus using the
index for the most common of the top level AND conditions and doing
the rest on recheck. That seems like a dangerous assumption from
which to start.
> But the particular example shown here doesn't make a very good case
> for that, because it's hard to tell how much of a penalty would be
> taken in more realistic examples.
Fair enough. We're in the early stages of moving to tsearch2 and I
haven't run across this yet in practice. If I do, I'll follow up.
-Kevin