Thread: tsearch2/GIST performance factors?

tsearch2/GIST performance factors?

From

"Craig A. James"

Date:

17 October 2005, 14:09:43

We are indexing about 5 million small documents using tsearch2/GIST. Each "document" contains 2 to 50 words. This is
a"write once, read many" situation. Write performance is unimportant, and the database contents are static. (We build
itoffline.)

We're having problems with inconsistent performance, and it's very hard to separate the effects of various factors.
Hereare the things we think may be relevant.

1. Total number of words

Our documents currently contain about 110,000 unique words. Oleg wrote: "[The limit is] 100K, but it's very fuzzy
limit."By trial and error, we've learned that 50,000 works well, and 150,000 works poorly, so Oleg's comment appears to
bea good rule-of-thumb. (With SIGLENINT enlarged, see below.) But there may be other factors that affect this
conclusion(such as shared memory, total memory, etc.).

2. Total size of the table

5 million documents is not a very big database (each document is a few to a few hundred bytes), so we don't think this
isrelevant.

3. Number of documents per word

There seems to be a VERY strong effect related to "common" words. When a word occurs in more than about 1% of the
documents(say 50,000 to 150,000 documents), performance goes WAY down. Not just for that specific query, but it screws
uptsearch2/GIST completely.

We have a test of 100 queries that return 382,000 documents total. The first time we run it, it's slow, about 20
minutes(as expected). The second time we run it, it's very fast, about 72 seconds -- very fast!! As long as we avoid
querieswith common words, performance is very good.

But, if we run just one query that contains a common word (a word that's in more than about 2% of the documents,
roughly150,000 documents), then the next time we run the 100 test queries, it will take 20 minutes again.

We can't simply eliminate these common words. First of all, they can be very significant. Second, it doesn't seem
like2% is "common". I can understand that a words like "the" which occur in most documents shouldn't be indexed. But
aword that occurs in 2% of the database seems like a very good word to index, yet it causes us great problems.

I've read a bit about tsearchd, and wonder if it would solve our problem. For our application, consistent performance
isVERY important. If we could lock the GIST index into memory, I think it would fix our problem.

I tried copying the GIST indexes (which are in a separate tablespace) to a 1 GB RAM disk, and it made the initial query
faster,but overall performance seemed worse, probably because the RAM disk was using memory that could have been used
bythe file-system cache.

4. Available RAM and Disk drives

Would more RAM help? How would we tell Postgres to use it effectively? The GIST indexes are currently about 2.6 GB on
thedisk.

Would more disks help? I know they would make it faster -- the 20-minute initial query would be reduce with a RAID
drive,etc. But I'm not concerned about the 20-minute initial query, I'm concerned about keeping the system in that
super-faststate where the GIST indexes are all in memory.

Hardware:
Dual-CPU Xeon Dell server with 4 GB memory and a single SATA 7200 RPM 150GB disk.

tsearch2/gistidx.h
modified as: #define SIGLENINT 120

System configuration:
echo 2147483648 >/proc/sys/kernel/shmmax
echo 4096 >/proc/sys/kernel/shmmni
echo 2097152 >/proc/sys/kernel/shmall

Postgres Configuration:
shared_buffers = 20000
work_mem = 32768
effective_cache_size = 300000

I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are interacting in ways that I can't predict
oranalyze. Thanks very much for any comments and advice.

Craig

Re: tsearch2/GIST performance factors?

From

Oleg Bartunov

Date:

17 October 2005, 15:01:23

On Sat, 15 Oct 2005, Craig A. James wrote:

> We are indexing about 5 million small documents using tsearch2/GIST.  Each
> "document" contains 2 to 50 words.  This is a "write once, read many"
> situation.  Write performance is unimportant, and the database contents are
> static.  (We build it offline.)
>
> We're having problems with inconsistent performance, and it's very hard to
> separate the effects of various factors.  Here are the things we think may be
> relevant.
>
> 1. Total number of words
>
> Our documents currently contain about 110,000 unique words.  Oleg wrote:
> "[The limit is] 100K, but it's very fuzzy limit." By trial and error, we've
> learned that 50,000 works well, and 150,000 works poorly, so Oleg's comment
> appears to be a good rule-of-thumb.  (With SIGLENINT enlarged, see below.)
> But there may be other factors that affect this conclusion (such as shared
> memory, total memory, etc.).
>

Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15

>
> 2. Total size of the table
>
> 5 million documents is not a very big database (each document is a few to a
> few hundred bytes), so we don't think this is relevant.
>
>
> 3. Number of documents per word
>
> There seems to be a VERY strong effect related to "common" words.  When a
> word occurs in more than about 1% of the documents (say 50,000 to 150,000
> documents), performance goes WAY down.  Not just for that specific query, but
> it screws up tsearch2/GIST completely.
>
> We have a test of 100 queries that return 382,000 documents total.  The first
> time we run it, it's slow, about 20 minutes (as expected).  The second time
> we run it, it's very fast, about 72 seconds -- very fast!!  As long as we
> avoid queries with common words, performance is very good.
>
> But, if we run just one query that contains a common word (a word that's in
> more than about 2% of the documents, roughly 150,000 documents), then the
> next time we run the 100 test queries, it will take 20 minutes again.
>

> We can't simply eliminate these common words.  First of all, they can be very
> significant.  Second, it doesn't seem like 2% is "common".  I can understand
> that a words like "the" which occur in most documents shouldn't be indexed.
> But a word that occurs in 2% of the database seems like a very good word to
> index, yet it causes us great problems.
>

tsearch2's index is a lossy index, read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
so search results  should be rechecked !


> I've read a bit about tsearchd, and wonder if it would solve our problem.
> For our application, consistent performance is VERY important.  If we could
> lock the GIST index into memory, I think it would fix our problem.

I think so, tsearchd was designed for static contents in mind and it's
index doesn't require rechecking !

>
> I tried copying the GIST indexes (which are in a separate tablespace) to a 1
> GB RAM disk, and it made the initial query faster, but overall performance
> seemed worse, probably because the RAM disk was using memory that could have
> been used by the file-system cache.
>
>
> 4. Available RAM and Disk drives
>
> Would more RAM help?  How would we tell Postgres to use it effectively?  The
> GIST indexes are currently about 2.6 GB on the disk.

try to decrease signature size, say,
#define SIGLENINT  15


> I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are
> interacting in ways that I can't predict or analyze.  Thanks very much for
> any comments and advice.

We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: tsearch2/GIST performance factors?

From

"Craig A. James"

Date:

18 October 2005, 03:09:41

Oleg wrote:
> Did you consider *decreasing* SIGLENINT ? Size of index will diminish
> and performance could be increased. I use in current project SIGLENINT=15

The default value for SIGLENINT actually didn't work at all.  It was only by increasing it that I got any performance
atall.  An examination of the GIST indexes showed that most of the first level and many of the second level bitmaps
weresaturated. 

> tsearch2's index is a lossy index, read
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
> so search results  should be rechecked !

Yes, thanks.  We do indeed recheck the actual results.  The tests I'm running are just on the raw index performance -
howlong does it take to "select ... where dockeys @@ to_tsquery(...)". 

> We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
> and hope to find sponsorhips for fts project for 8.2 release.
> Unfortunately, I didn't find spare time to package tsearchd for you,
> it should certainly help you.

At this point we may not have time to try tsearchd, and unfortunately we're not in a position to sponsor anything yet.

My original question is still bothering me.  Is it normal for a keyword that occurs in more than about 2% of the
documentsto cause such inconsistent performance?  Is there any single thing I might look at that would help improve
performance(like, do I need more memory?  More shared memory?  Different config parameters?) 

Thanks,
Craig

Re: tsearch2/GIST performance factors?

From

Oleg Bartunov

Date:

18 October 2005, 03:21:03

Craig,

could you prepare excerption from your db (if possible), so I could
play myself ?

     Oleg
On Mon, 17 Oct 2005, Craig A. James wrote:

> Oleg wrote:
>> Did you consider *decreasing* SIGLENINT ? Size of index will diminish
>> and performance could be increased. I use in current project SIGLENINT=15
>
> The default value for SIGLENINT actually didn't work at all.  It was only by
> increasing it that I got any performance at all.  An examination of the GIST
> indexes showed that most of the first level and many of the second level
> bitmaps were saturated.
>
>> tsearch2's index is a lossy index, read
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>> so search results  should be rechecked !
>
> Yes, thanks.  We do indeed recheck the actual results.  The tests I'm running
> are just on the raw index performance - how long does it take to "select ...
> where dockeys @@ to_tsquery(...)".
>
>> We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
>> and hope to find sponsorhips for fts project for 8.2 release.
>> Unfortunately, I didn't find spare time to package tsearchd for you,
>> it should certainly help you.
>
> At this point we may not have time to try tsearchd, and unfortunately we're
> not in a position to sponsor anything yet.
>
> My original question is still bothering me.  Is it normal for a keyword that
> occurs in more than about 2% of the documents to cause such inconsistent
> performance?  Is there any single thing I might look at that would help
> improve performance (like, do I need more memory?  More shared memory?
> Different config parameters?)
>
> Thanks,
> Craig
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83