Re: tsearch2/GIST performance factors? - Mailing list pgsql-performance

From Oleg Bartunov
Subject Re: tsearch2/GIST performance factors?
Date
Msg-id Pine.GSO.4.63.0510172127510.10366@ra.sai.msu.su
Whole thread Raw
In response to tsearch2/GIST performance factors?  ("Craig A. James" <cjames@modgraph-usa.com>)
Responses Re: tsearch2/GIST performance factors?  ("Craig A. James" <cjames@modgraph-usa.com>)
List pgsql-performance
On Sat, 15 Oct 2005, Craig A. James wrote:

> We are indexing about 5 million small documents using tsearch2/GIST.  Each
> "document" contains 2 to 50 words.  This is a "write once, read many"
> situation.  Write performance is unimportant, and the database contents are
> static.  (We build it offline.)
>
> We're having problems with inconsistent performance, and it's very hard to
> separate the effects of various factors.  Here are the things we think may be
> relevant.
>
> 1. Total number of words
>
> Our documents currently contain about 110,000 unique words.  Oleg wrote:
> "[The limit is] 100K, but it's very fuzzy limit." By trial and error, we've
> learned that 50,000 works well, and 150,000 works poorly, so Oleg's comment
> appears to be a good rule-of-thumb.  (With SIGLENINT enlarged, see below.)
> But there may be other factors that affect this conclusion (such as shared
> memory, total memory, etc.).
>

Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15

>
> 2. Total size of the table
>
> 5 million documents is not a very big database (each document is a few to a
> few hundred bytes), so we don't think this is relevant.
>
>
> 3. Number of documents per word
>
> There seems to be a VERY strong effect related to "common" words.  When a
> word occurs in more than about 1% of the documents (say 50,000 to 150,000
> documents), performance goes WAY down.  Not just for that specific query, but
> it screws up tsearch2/GIST completely.
>
> We have a test of 100 queries that return 382,000 documents total.  The first
> time we run it, it's slow, about 20 minutes (as expected).  The second time
> we run it, it's very fast, about 72 seconds -- very fast!!  As long as we
> avoid queries with common words, performance is very good.
>
> But, if we run just one query that contains a common word (a word that's in
> more than about 2% of the documents, roughly 150,000 documents), then the
> next time we run the 100 test queries, it will take 20 minutes again.
>

> We can't simply eliminate these common words.  First of all, they can be very
> significant.  Second, it doesn't seem like 2% is "common".  I can understand
> that a words like "the" which occur in most documents shouldn't be indexed.
> But a word that occurs in 2% of the database seems like a very good word to
> index, yet it causes us great problems.
>

tsearch2's index is a lossy index, read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
so search results  should be rechecked !


> I've read a bit about tsearchd, and wonder if it would solve our problem.
> For our application, consistent performance is VERY important.  If we could
> lock the GIST index into memory, I think it would fix our problem.

I think so, tsearchd was designed for static contents in mind and it's
index doesn't require rechecking !

>
> I tried copying the GIST indexes (which are in a separate tablespace) to a 1
> GB RAM disk, and it made the initial query faster, but overall performance
> seemed worse, probably because the RAM disk was using memory that could have
> been used by the file-system cache.
>
>
> 4. Available RAM and Disk drives
>
> Would more RAM help?  How would we tell Postgres to use it effectively?  The
> GIST indexes are currently about 2.6 GB on the disk.

try to decrease signature size, say,
#define SIGLENINT  15


> I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are
> interacting in ways that I can't predict or analyze.  Thanks very much for
> any comments and advice.

We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

pgsql-performance by date:

Previous
From: Richard Huxton
Date:
Subject: Re: Sequential scan on FK join
Next
From: Martin Nickel
Date:
Subject: Re: Sequential scan on FK join