Thread: Large Text Search Help

Large Text Search Help

From

psql-mail@freeuk.com

Date:

14 October 2003, 13:44:58

Hi,
I am trying to design a large text search database.

It will have upwards of 6 million documents, along with meta data on
each.

I am currently looking at tsearch2 to provide fast text searching and
also playing around with different hardware configurations.

1. With tsearch2 I get very good query times up until I insert more
records. For example with 100,000 records tsearch2 returns in around 6
seconds, with 200,000 records tsearch2 returns in just under a minute.
Is this due to the indices fitting entirely in memory with 100,000
records?

2. As well as whole word matching i also need to be able to do
substring matching. Is the FTI module the way to approach this?

3. I have just begun to look into distibuted queries. Is there an
existing solution for distibuting a postgresql database amongst
multiple servers, so each has the same schema but only a subset of the
total data?

Any other helpful comments or sugestions on how to improve query times
using different hardware or software techniques would be appreciated.

Thanks,

Mat

Re: Large Text Search Help

From

Josh Berkus

Date:

14 October 2003, 14:16:04

Mat,

> 1. With tsearch2 I get very good query times up until I insert more
> records. For example with 100,000 records tsearch2 returns in around 6
> seconds, with 200,000 records tsearch2 returns in just under a minute.
> Is this due to the indices fitting entirely in memory with 100,000
> records?

Maybe, maybe not.  If you want a difinitive answer, post your EXPLAIN ANALYZE
results with the original query.

I assume that you have run VACUUM ANALYZE, first?  Don't bother to respond
until you have.

> 2. As well as whole word matching i also need to be able to do
> substring matching. Is the FTI module the way to approach this?

Yes.

> 3. I have just begun to look into distibuted queries. Is there an
> existing solution for distibuting a postgresql database amongst
> multiple servers, so each has the same schema but only a subset of the
> total data?

No, it would be ad-hoc.  So far, Moore's law has prevented us from needing to
devote serious effort to the above approach.

> Any other helpful comments or sugestions on how to improve query times
> using different hardware or software techniques would be appreciated.

Read the archives of this list.

--
Josh Berkus
Aglio Database Solutions
San Francisco