Re: Queryplan within FTS/GIN index -search. - Mailing list pgsql-performance

From Kevin Grittner
Subject Re: Queryplan within FTS/GIN index -search.
Date
Msg-id 4AEFF056020000250002C195@gw.wicourts.gov
Whole thread Raw
In response to Re: Queryplan within FTS/GIN index -search.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Queryplan within FTS/GIN index -search.
List pgsql-performance
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Any sane text search application is going to try to filter out
>>> common words as stopwords; it's only the failure to do that that's
>>> making this run slow.
>
>> I'd rather have the index used for the selective test, and apply
>> the remaining tests to the rows retrieved from the heap.
>
> Uh, that was exactly my point.  Indexing common words is a waste.

Perhaps I'm missing something.  My point was that there are words
which are too common to be useful for index searches, yet uncommon
enough to usefully limit the results.  These words could typically
benefit from tsearch2 style parsing and dictionaries; so declaring
them as stop words would be bad from a functional perspective, yet
searching an index for them would be bad from a performance
perspective.

One solution would be for the users to rigorously identify all of
these words, include them on one stop word list but not another,
include *two* tsvector columns in the table (with and without the
"iffy" words), index only the one with the larger stop word list, and
generate two tsquery values to search the two different columns.  Best
of both worlds.  Sort of.  The staff time to create and maintain such
a list would obviously be costly and writing the queries would be
error-prone.

Second best would be to somehow recognize the "iffy" words and exclude
them from the index and the index search phase, but apply the check
when the row is retrieved from the heap.  I really have a hard time
seeing how the conditional exclusion from the index could be
accomplished, though.  Next best would be to let them fall into the
index, but exclude top level ANDed values from the index search,
applying them only to the recheck when the row is read from the heap.
The seems, at least conceptually, like it could be done.

-Kevin

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Queryplan within FTS/GIN index -search.
Next
From: "Kevin Grittner"
Date:
Subject: Re: Problem with database performance, Debian 4gb ram ?