Thread: full-text indexing

full-text indexing

From
"Brett W. McCoy"
Date:
Does anyone have any experience using Postgres's full-text indexing in a
production environment?  We're thinking about using it for a project (the
other solution is to regexp it with Perl...).  I've set up the stuff
before for experimentation, but am mainly curious about it's performance
on a live, fairly heavily trafficked server.

Brett W. McCoy
                                              http://www.chapelperilous.net
---------------------------------------------------------------------------
"The chain which can be yanked is not the eternal chain."
        -- G. Fitch


Re: full-text indexing

From
Bruce Momjian
Date:
> Does anyone have any experience using Postgres's full-text indexing in a
> production environment?  We're thinking about using it for a project (the
> other solution is to regexp it with Perl...).  I've set up the stuff
> before for experimentation, but am mainly curious about it's performance
> on a live, fairly heavily trafficked server.

I have one word for you:  CLUSTER.  Without it, index lookups are too
slow.  With it, they are rapid.  I have done some work like this
commerically with Ingres, which has an ISAM type that keeps the matching
rows pretty close on a newly-created ISAM index.  In PostgreSQL, and
regular CLUSTER will keep you good.

If you find it slow, let me know.  I have done some benchmarking with
the author and he found it pretty fast, usually a few seconds.  See the
section in my book on CLUSTER for information on _why_ it helps.

http://www.postgresql.org/docs/awbook.html


--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: full-text indexing

From
Jim Richards
Date:
At 08:40 PM 18/04/00 -0400, Bruce Momjian wrote:

>I have one word for you:  CLUSTER.  Without it, index lookups are too
>slow.  With it, they are rapid.  I have done some work like this
>commerically with Ingres, which has an ISAM type that keeps the matching
>rows pretty close on a newly-created ISAM index.  In PostgreSQL, and
>regular CLUSTER will keep you good.

The only down side of CLUSTER is that you loose your other keys on the
table, including a primary key, so you loose out on your intregrity checks,
which you the have to explictily re-apply with CREATE UNQIUE INDEX ...


--
Kumera - a new Open Source Content Management System
for small to medium web sites written in Perl and using XML
http://www.cyber4.org/kumera/index.html

Re: full-text indexing

From
Bruce Momjian
Date:
> At 08:40 PM 18/04/00 -0400, Bruce Momjian wrote:
>
> >I have one word for you:  CLUSTER.  Without it, index lookups are too
> >slow.  With it, they are rapid.  I have done some work like this
> >commerically with Ingres, which has an ISAM type that keeps the matching
> >rows pretty close on a newly-created ISAM index.  In PostgreSQL, and
> >regular CLUSTER will keep you good.
>
> The only down side of CLUSTER is that you loose your other keys on the
> table, including a primary key, so you loose out on your intregrity checks,
> which you the have to explictily re-apply with CREATE UNQIUE INDEX ...
>

Yes, you basically have to recreate them after CLUSTER.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: full-text indexing

From
"Brett W. McCoy"
Date:
On Tue, 18 Apr 2000, Bruce Momjian wrote:

> I have one word for you:  CLUSTER.  Without it, index lookups are too
> slow.  With it, they are rapid.  I have done some work like this
> commerically with Ingres, which has an ISAM type that keeps the matching
> rows pretty close on a newly-created ISAM index.  In PostgreSQL, and
> regular CLUSTER will keep you good.

I agree!  The last bit of advice given in the full text README.  As I
said, I'd built full-text stuff for experimentation (I had maybe 30k of
raw text, which amounted to several 100,000 indexed entries), and I had
clustered it, and it was pretty darn fast, even on a Pentium 233 with only
48 megs of RAM.  I have significantly better hardware to run it on now.
The original project called MySQL, but it just didn't have what we needed
to put something like this together.

> If you find it slow, let me know.  I have done some benchmarking with
> the author and he found it pretty fast, usually a few seconds.  See the
> section in my book on CLUSTER for information on _why_ it helps.

Thanks, Bruce.

Brett W. McCoy
                                              http://www.chapelperilous.net
---------------------------------------------------------------------------
Twenty two thousand days.
Twenty two thousand days.
It's not a lot.
It's all you've got.
Twenty two thousand days.
        -- Moody Blues, "Twenty Two Thousand Days"


Re: full-text indexing

From
Bruce Momjian
Date:
> On Tue, 18 Apr 2000, Bruce Momjian wrote:
>
> > I have one word for you:  CLUSTER.  Without it, index lookups are too
> > slow.  With it, they are rapid.  I have done some work like this
> > commerically with Ingres, which has an ISAM type that keeps the matching
> > rows pretty close on a newly-created ISAM index.  In PostgreSQL, and
> > regular CLUSTER will keep you good.
>
> I agree!  The last bit of advice given in the full text README.  As I
> said, I'd built full-text stuff for experimentation (I had maybe 30k of
> raw text, which amounted to several 100,000 indexed entries), and I had
> clustered it, and it was pretty darn fast, even on a Pentium 233 with only
> 48 megs of RAM.  I have significantly better hardware to run it on now.
> The original project called MySQL, but it just didn't have what we needed
> to put something like this together.

With the original author, testing was fast, but when he loaded all the
data, it got very slow.  The problem was that as soon as his data
exceeded the buffer cache, performance became terrible.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: full-text indexing

From
"Brett W. McCoy"
Date:
On Tue, 18 Apr 2000, Bruce Momjian wrote:

> > I agree!  The last bit of advice given in the full text README.  As I
> > said, I'd built full-text stuff for experimentation (I had maybe 30k of
> > raw text, which amounted to several 100,000 indexed entries), and I had
> > clustered it, and it was pretty darn fast, even on a Pentium 233 with only
> > 48 megs of RAM.  I have significantly better hardware to run it on now.
> > The original project called MySQL, but it just didn't have what we needed
> > to put something like this together.
>
> With the original author, testing was fast, but when he loaded all the
> data, it got very slow.  The problem was that as soon as his data
> exceeded the buffer cache, performance became terrible.

How much data are we talking here?  How can one get around this buffer
cache problem?

Brett W. McCoy
                                              http://www.chapelperilous.net
---------------------------------------------------------------------------
Twenty two thousand days.
Twenty two thousand days.
It's not a lot.
It's all you've got.
Twenty two thousand days.
        -- Moody Blues, "Twenty Two Thousand Days"


Re: full-text indexing

From
Bruce Momjian
Date:
> On Tue, 18 Apr 2000, Bruce Momjian wrote:
>
> > > I agree!  The last bit of advice given in the full text README.  As I
> > > said, I'd built full-text stuff for experimentation (I had maybe 30k of
> > > raw text, which amounted to several 100,000 indexed entries), and I had
> > > clustered it, and it was pretty darn fast, even on a Pentium 233 with only
> > > 48 megs of RAM.  I have significantly better hardware to run it on now.
> > > The original project called MySQL, but it just didn't have what we needed
> > > to put something like this together.
> >
> > With the original author, testing was fast, but when he loaded all the
> > data, it got very slow.  The problem was that as soon as his data
> > exceeded the buffer cache, performance became terrible.
>
> How much data are we talking here?  How can one get around this buffer
> cache problem?
>

You would have to fit _all_ your heap data into the PostgreSQL buffer
cache.  That is a lot of shared memory.  If it was that small, you
wouldn't need full-text indexing.  :-)

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: full-text indexing

From
"Brett W. McCoy"
Date:
On Tue, 18 Apr 2000, Bruce Momjian wrote:

> Let me be specific.  The problem is that without cluster, your fragment
> rows are together in the index, but are all over the heap table, so you
> have to read in all those disk buffers, and that is slow.  With cluster,
> most of your matching fragments are on the same disk page, so one access
> gets them all.
>
> The nightly CLUSTER is a pain, but the only way I ever got it working
> quickly.

This is pretty painless compared to what I've had to do getting big
full-text systems like Excalibur to work.  I can deal with a nightly
cluster.

Brett W. McCoy
                                              http://www.chapelperilous.net
---------------------------------------------------------------------------
Twenty two thousand days.
Twenty two thousand days.
It's not a lot.
It's all you've got.
Twenty two thousand days.
        -- Moody Blues, "Twenty Two Thousand Days"


Re: full-text indexing

From
Maarten Boekhold
Date:
Hi,

I guess that 'original author' would be me....

> > With the original author, testing was fast, but when he loaded all the
> > data, it got very slow.  The problem was that as soon as his data
> > exceeded the buffer cache, performance became terrible.
>
> How much data are we talking here?  How can one get around this buffer
> cache problem?

This is all from head, but if I remember correctly, the main table had about
750.000 rows, of which one varchar(25) field was fti'ed ('full text indexed'
:),
resulting in some 5 million rows in the fti table.

wrt file sizes, I don't really remember. If you're really interested, I can
make another setup to check this (over easter time).

I'm curious: Bruce mentioned buffer cache sizes. What exactly is this buffer
cache
used for? I thought we relied on the OS filesystem caching to cache database
files?
What will increasing buffer caches give me?

Maarten

--

Maarten Boekhold, maarten.boekhold@tibcofinance.com
TIBCO Finance Technology Inc.
"Sevilla" Building
Entrada 308
1096 ED Amsterdam, The Netherlands
tel: +31 20 6601000 (direct: +31 20 6601066)
fax: +31 20 6601005
http://www.tibcofinance.com

Re: full-text indexing

From
Bruce Momjian
Date:
> Hi,
>
> I guess that 'original author' would be me....

Yes.

>
> > > With the original author, testing was fast, but when he loaded all the
> > > data, it got very slow.  The problem was that as soon as his data
> > > exceeded the buffer cache, performance became terrible.
> >
> > How much data are we talking here?  How can one get around this buffer
> > cache problem?
>
> This is all from head, but if I remember correctly, the main table had about
> 750.000 rows, of which one varchar(25) field was fti'ed ('full text indexed'
> :),
> resulting in some 5 million rows in the fti table.
>
> wrt file sizes, I don't really remember. If you're really interested, I can
> make another setup to check this (over easter time).
>
> I'm curious: Bruce mentioned buffer cache sizes. What exactly is this buffer
> cache
> used for? I thought we relied on the OS filesystem caching to cache database
> files?
> What will increasing buffer caches give me?

The PostgreSQL shared buffers are used by the database to read/write 8k
db buffers.  The OS has buffers two, so there is some duplication.  Ours
exist in shared memory so all backends can use the information and
mark/flush them as needed.  Increasing the shared buffer cache will keep
more buffers availible, but frankly the OS buffer cache is just/more
important.  It is when the stuff is in neither cache and we have to go
to disk thousands of time for one query that things get bad.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026