Thread: full-text indexing
Does anyone have any experience using Postgres's full-text indexing in a production environment? We're thinking about using it for a project (the other solution is to regexp it with Perl...). I've set up the stuff before for experimentation, but am mainly curious about it's performance on a live, fairly heavily trafficked server. Brett W. McCoy http://www.chapelperilous.net --------------------------------------------------------------------------- "The chain which can be yanked is not the eternal chain." -- G. Fitch
> Does anyone have any experience using Postgres's full-text indexing in a > production environment? We're thinking about using it for a project (the > other solution is to regexp it with Perl...). I've set up the stuff > before for experimentation, but am mainly curious about it's performance > on a live, fairly heavily trafficked server. I have one word for you: CLUSTER. Without it, index lookups are too slow. With it, they are rapid. I have done some work like this commerically with Ingres, which has an ISAM type that keeps the matching rows pretty close on a newly-created ISAM index. In PostgreSQL, and regular CLUSTER will keep you good. If you find it slow, let me know. I have done some benchmarking with the author and he found it pretty fast, usually a few seconds. See the section in my book on CLUSTER for information on _why_ it helps. http://www.postgresql.org/docs/awbook.html -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
At 08:40 PM 18/04/00 -0400, Bruce Momjian wrote: >I have one word for you: CLUSTER. Without it, index lookups are too >slow. With it, they are rapid. I have done some work like this >commerically with Ingres, which has an ISAM type that keeps the matching >rows pretty close on a newly-created ISAM index. In PostgreSQL, and >regular CLUSTER will keep you good. The only down side of CLUSTER is that you loose your other keys on the table, including a primary key, so you loose out on your intregrity checks, which you the have to explictily re-apply with CREATE UNQIUE INDEX ... -- Kumera - a new Open Source Content Management System for small to medium web sites written in Perl and using XML http://www.cyber4.org/kumera/index.html
> At 08:40 PM 18/04/00 -0400, Bruce Momjian wrote: > > >I have one word for you: CLUSTER. Without it, index lookups are too > >slow. With it, they are rapid. I have done some work like this > >commerically with Ingres, which has an ISAM type that keeps the matching > >rows pretty close on a newly-created ISAM index. In PostgreSQL, and > >regular CLUSTER will keep you good. > > The only down side of CLUSTER is that you loose your other keys on the > table, including a primary key, so you loose out on your intregrity checks, > which you the have to explictily re-apply with CREATE UNQIUE INDEX ... > Yes, you basically have to recreate them after CLUSTER. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 18 Apr 2000, Bruce Momjian wrote: > I have one word for you: CLUSTER. Without it, index lookups are too > slow. With it, they are rapid. I have done some work like this > commerically with Ingres, which has an ISAM type that keeps the matching > rows pretty close on a newly-created ISAM index. In PostgreSQL, and > regular CLUSTER will keep you good. I agree! The last bit of advice given in the full text README. As I said, I'd built full-text stuff for experimentation (I had maybe 30k of raw text, which amounted to several 100,000 indexed entries), and I had clustered it, and it was pretty darn fast, even on a Pentium 233 with only 48 megs of RAM. I have significantly better hardware to run it on now. The original project called MySQL, but it just didn't have what we needed to put something like this together. > If you find it slow, let me know. I have done some benchmarking with > the author and he found it pretty fast, usually a few seconds. See the > section in my book on CLUSTER for information on _why_ it helps. Thanks, Bruce. Brett W. McCoy http://www.chapelperilous.net --------------------------------------------------------------------------- Twenty two thousand days. Twenty two thousand days. It's not a lot. It's all you've got. Twenty two thousand days. -- Moody Blues, "Twenty Two Thousand Days"
> On Tue, 18 Apr 2000, Bruce Momjian wrote: > > > I have one word for you: CLUSTER. Without it, index lookups are too > > slow. With it, they are rapid. I have done some work like this > > commerically with Ingres, which has an ISAM type that keeps the matching > > rows pretty close on a newly-created ISAM index. In PostgreSQL, and > > regular CLUSTER will keep you good. > > I agree! The last bit of advice given in the full text README. As I > said, I'd built full-text stuff for experimentation (I had maybe 30k of > raw text, which amounted to several 100,000 indexed entries), and I had > clustered it, and it was pretty darn fast, even on a Pentium 233 with only > 48 megs of RAM. I have significantly better hardware to run it on now. > The original project called MySQL, but it just didn't have what we needed > to put something like this together. With the original author, testing was fast, but when he loaded all the data, it got very slow. The problem was that as soon as his data exceeded the buffer cache, performance became terrible. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 18 Apr 2000, Bruce Momjian wrote: > > I agree! The last bit of advice given in the full text README. As I > > said, I'd built full-text stuff for experimentation (I had maybe 30k of > > raw text, which amounted to several 100,000 indexed entries), and I had > > clustered it, and it was pretty darn fast, even on a Pentium 233 with only > > 48 megs of RAM. I have significantly better hardware to run it on now. > > The original project called MySQL, but it just didn't have what we needed > > to put something like this together. > > With the original author, testing was fast, but when he loaded all the > data, it got very slow. The problem was that as soon as his data > exceeded the buffer cache, performance became terrible. How much data are we talking here? How can one get around this buffer cache problem? Brett W. McCoy http://www.chapelperilous.net --------------------------------------------------------------------------- Twenty two thousand days. Twenty two thousand days. It's not a lot. It's all you've got. Twenty two thousand days. -- Moody Blues, "Twenty Two Thousand Days"
> On Tue, 18 Apr 2000, Bruce Momjian wrote: > > > > I agree! The last bit of advice given in the full text README. As I > > > said, I'd built full-text stuff for experimentation (I had maybe 30k of > > > raw text, which amounted to several 100,000 indexed entries), and I had > > > clustered it, and it was pretty darn fast, even on a Pentium 233 with only > > > 48 megs of RAM. I have significantly better hardware to run it on now. > > > The original project called MySQL, but it just didn't have what we needed > > > to put something like this together. > > > > With the original author, testing was fast, but when he loaded all the > > data, it got very slow. The problem was that as soon as his data > > exceeded the buffer cache, performance became terrible. > > How much data are we talking here? How can one get around this buffer > cache problem? > You would have to fit _all_ your heap data into the PostgreSQL buffer cache. That is a lot of shared memory. If it was that small, you wouldn't need full-text indexing. :-) -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 18 Apr 2000, Bruce Momjian wrote: > Let me be specific. The problem is that without cluster, your fragment > rows are together in the index, but are all over the heap table, so you > have to read in all those disk buffers, and that is slow. With cluster, > most of your matching fragments are on the same disk page, so one access > gets them all. > > The nightly CLUSTER is a pain, but the only way I ever got it working > quickly. This is pretty painless compared to what I've had to do getting big full-text systems like Excalibur to work. I can deal with a nightly cluster. Brett W. McCoy http://www.chapelperilous.net --------------------------------------------------------------------------- Twenty two thousand days. Twenty two thousand days. It's not a lot. It's all you've got. Twenty two thousand days. -- Moody Blues, "Twenty Two Thousand Days"
Hi, I guess that 'original author' would be me.... > > With the original author, testing was fast, but when he loaded all the > > data, it got very slow. The problem was that as soon as his data > > exceeded the buffer cache, performance became terrible. > > How much data are we talking here? How can one get around this buffer > cache problem? This is all from head, but if I remember correctly, the main table had about 750.000 rows, of which one varchar(25) field was fti'ed ('full text indexed' :), resulting in some 5 million rows in the fti table. wrt file sizes, I don't really remember. If you're really interested, I can make another setup to check this (over easter time). I'm curious: Bruce mentioned buffer cache sizes. What exactly is this buffer cache used for? I thought we relied on the OS filesystem caching to cache database files? What will increasing buffer caches give me? Maarten -- Maarten Boekhold, maarten.boekhold@tibcofinance.com TIBCO Finance Technology Inc. "Sevilla" Building Entrada 308 1096 ED Amsterdam, The Netherlands tel: +31 20 6601000 (direct: +31 20 6601066) fax: +31 20 6601005 http://www.tibcofinance.com
> Hi, > > I guess that 'original author' would be me.... Yes. > > > > With the original author, testing was fast, but when he loaded all the > > > data, it got very slow. The problem was that as soon as his data > > > exceeded the buffer cache, performance became terrible. > > > > How much data are we talking here? How can one get around this buffer > > cache problem? > > This is all from head, but if I remember correctly, the main table had about > 750.000 rows, of which one varchar(25) field was fti'ed ('full text indexed' > :), > resulting in some 5 million rows in the fti table. > > wrt file sizes, I don't really remember. If you're really interested, I can > make another setup to check this (over easter time). > > I'm curious: Bruce mentioned buffer cache sizes. What exactly is this buffer > cache > used for? I thought we relied on the OS filesystem caching to cache database > files? > What will increasing buffer caches give me? The PostgreSQL shared buffers are used by the database to read/write 8k db buffers. The OS has buffers two, so there is some duplication. Ours exist in shared memory so all backends can use the information and mark/flush them as needed. Increasing the shared buffer cache will keep more buffers availible, but frankly the OS buffer cache is just/more important. It is when the stuff is in neither cache and we have to go to disk thousands of time for one query that things get bad. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026