Thread: Performance for seq. scans
Hi all, I've had a look over the docs and the FAQ and I can't see anything answering this, so here goes: I'm in the (slightly unusual, in a relational world) situation that the dominant query on my database is a wildcard search, so that no indexes can be used. (E.g. select * from table_a where foo like '%bar%'). Without some /very/ clever (and disk-space intensive) subword indexing, this query is doomed to be a sequential scan, which I'm resigned to. It's a question of making that as fast as possible. My dataset is around 500M as a text file on disk, and around 1500M as postgres data. The machine I'm working on at the moment does the search in around 90 seconds. (For comparision, MS SQL 7, the other solution being considered here, takes around 75 seconds on identical hardware). Interestingly, using 'vmstat' shows that the CPU is maxxed out at 50% (this being a dual CPU machine), while the disk access is a mere 4M/sec --- bonnie claims this machine is capable of around 25M/sec to this particular disk. So it would seem that the bottleneck is the CPU. [I understand why both CPUs aren't used] My previous feeling had been that the bottleneck was going to be the disk, in which case I was going to recommend installing enough memory in the machine that the kernel disk cache could cache the whole file, and thus speeding up the search. In the current situtation, it seems like the only improvement would be to install a faster CPU (and since we're currently using a PIII 600, I couldn't expect much more than a 60% improvement or so that way). It seems slightly surprising that postgres can only "service" a 4M/sec stream of data from the disk with a LIKE query -- not such a complex query. Is there some unnecessary data copying in the critical path for the search? I almost forgot -- this is debian package 7.0.2-2. Any pointers to whether or not this performance can be improved upon, welcomed. Currently I'm feeling like the right solution may be to dump the 500M text file periodically and run 'grep' on a machine with enough memory to cache the text file ;-) Jules Bean
At 11:51 26/07/00 +0100, you wrote: >Hi all, > >I've had a look over the docs and the FAQ and I can't see anything >answering this, so here goes: > >I'm in the (slightly unusual, in a relational world) situation that >the dominant query on my database is a wildcard search, so that no >indexes can be used. (E.g. select * from table_a where foo like >'%bar%'). We were in a similar position and went for the 'Full Text Indexing" extra. You'll find it in contrib/fulltextindex. It creates a function which you call on a trigger to produce an index of words for specified fields. These indexes do get _very_ large (one of ours is ~800 MB), but it does work very well and speeds searches up enormously. Steve -- thorNET - Internet Consultancy, Services & Training Phone: 01454 854413 Fax: 01454 854412 http://www.thornet.co.uk
On Wed, Jul 26, 2000 at 12:14:13PM +0100, Steve Heaven wrote: > At 11:51 26/07/00 +0100, you wrote: > >Hi all, > > > >I've had a look over the docs and the FAQ and I can't see anything > >answering this, so here goes: > > > >I'm in the (slightly unusual, in a relational world) situation that > >the dominant query on my database is a wildcard search, so that no > >indexes can be used. (E.g. select * from table_a where foo like > >'%bar%'). > > We were in a similar position and went for the 'Full Text Indexing" extra. > You'll find it in contrib/fulltextindex. > It creates a function which you call on a trigger to produce an index of > words for specified fields. These indexes do get _very_ large (one of ours > is ~800 MB), but it does work very well and speeds searches up enormously. If I understand you correctly, that's word-based? It's just splitting on whitespace and punctuation? Unfortunately, that's not quite what we need --- our wildcard searches needn't have their '%' on word boundaries. Jules
At 12:28 26/07/00 +0100, Jules Bean wrote: >> We were in a similar position and went for the 'Full Text Indexing" extra. >> You'll find it in contrib/fulltextindex. >> It creates a function which you call on a trigger to produce an index of >> words for specified fields. These indexes do get _very_ large (one of ours >> is ~800 MB), but it does work very well and speeds searches up enormously. > >If I understand you correctly, that's word-based? It's just splitting >on whitespace and punctuation? Unfortunately, that's not quite what >we need --- our wildcard searches needn't have their '%' on word >boundaries. > There is a function in the source called breakup(). This can be customised to create the index entries on sub-word strings. Steve -- thorNET - Internet Consultancy, Services & Training Phone: 01454 854413 Fax: 01454 854412 http://www.thornet.co.uk
The FTI trigger code that's distributed with PostgreSQL now actually breaks the words up into two character substrings. I re-wrote it to eliminate duplicates and only split up the words based on whitespace and delimiters -- if you did this you could still use LIKE to match based on substrings, then you would have the added speed of an index scan.. Jules: select * from table_a where foo like '%bar%' Depending on the table type of foo, that doesn't have to do a seq scan... If it's anything but text, you can create an index on it -- LIKE can use indexes. If it is type text then I would look into using the FTI stuff in contrib. If you want mine, let me know however it sounds like the distributed version would be more suited to what you'd like to do. Good luck! -Mitch ----- Original Message ----- From: "Steve Heaven" <steve@thornet.co.uk> To: "Jules Bean" <jules@jellybean.co.uk> Cc: <pgsql-general@postgresql.org> Sent: Wednesday, July 26, 2000 7:39 AM Subject: Re: [GENERAL] Performance for seq. scans > At 12:28 26/07/00 +0100, Jules Bean wrote: > >> We were in a similar position and went for the 'Full Text Indexing" extra. > >> You'll find it in contrib/fulltextindex. > >> It creates a function which you call on a trigger to produce an index of > >> words for specified fields. These indexes do get _very_ large (one of ours > >> is ~800 MB), but it does work very well and speeds searches up enormously. > > > >If I understand you correctly, that's word-based? It's just splitting > >on whitespace and punctuation? Unfortunately, that's not quite what > >we need --- our wildcard searches needn't have their '%' on word > >boundaries. > > > > There is a function in the source called breakup(). This can be customised > to create the index entries on sub-word strings. > > Steve > > -- > thorNET - Internet Consultancy, Services & Training > Phone: 01454 854413 > Fax: 01454 854412 > http://www.thornet.co.uk >
On Wed, Jul 26, 2000 at 10:34:39AM -0400, Mitch Vincent wrote: > The FTI trigger code that's distributed with PostgreSQL now actually breaks > the words up into two character substrings. > > I re-wrote it to eliminate duplicates and only split up the words based on > whitespace and delimiters -- if you did this you could still use LIKE to > match based on substrings, then you would have the added speed of an index > scan.. > > Jules: > > select * from table_a where foo like '%bar%' > > Depending on the table type of foo, that doesn't have to do a seq scan... If > it's anything but text, you can create an index on it -- LIKE can use > indexes. If it is type text then I would look into using the FTI stuff in > contrib. If you want mine, let me know however it sounds like the > distributed version would be more suited to what you'd like to do. Hmm. Colour me confused. You're suggesting I can use a conventional index on 'foo'? Well, certainly, there is such an index on the table anyhow (since some queries do use it) but I don't understand. Surely LIKE only uses indexes when the string match is anchored on the left? Certainly that's my understanding of indexes, and it's what the docs say... OTOH, are you saying that the FTI code in 7.0.2 contrib does in fact do the 'index all occuring 2-char substring' trick? I must go and investigate that straight away. Many thanks. Jules
On Wed, Jul 26, 2000 at 03:44:47PM +0100, Jules Bean wrote: > > OTOH, are you saying that the FTI code in 7.0.2 contrib does in fact > do the 'index all occuring 2-char substring' trick? I must go and > investigate that straight away. Many thanks. OK, I apologise for my skepticism. I have just read the FTI code, and as far as a quick glance can tell me, it appears to index, for each 'word' in my main table, all the final segments of that word. This enables me to match any substring, since if %bar% matches, then there must be /some/ final segment which begins with bar, so I can search for bar% in the FTI table. Rather cleverer than the algorithm I'd thought of ;-) Thanks to everyone for all your help. Now it just remains to make up my mind whether I have the disk space for this ;-) Jules
On Wed, 26 Jul 2000, Jules Bean wrote: > If I understand you correctly, that's word-based? It's just splitting > on whitespace and punctuation? Unfortunately, that's not quite what > we need --- our wildcard searches needn't have their '%' on word > boundaries. You can have it search on less than word boundaries. It uses a regexp in the query statement, and actually will match against sub-strings. Check the documentation. It talks about how this works. Brett W. McCoy http://www.chapelperilous.net/~bmccoy/ ------------------------------------------------------------------------------- If everything is coming your way then you're in the wrong lane.
> From: Jules Bean <jules@jellybean.co.uk> > To: pgsql-general@postgresql.org > Subject: Performance for seq. scans > > I've had a look over the docs and the FAQ and I can't see anything > answering this, so here goes: > > I'm in the (slightly unusual, in a relational world) situation that > the dominant query on my database is a wildcard search, so that no > indexes can be used. (E.g. select * from table_a where foo like > '%bar%'). > > Interestingly, using 'vmstat' shows that the CPU is maxxed out at 50% > (this being a dual CPU machine), while the disk access is a mere > 4M/sec --- bonnie claims this machine is capable of around 25M/sec to > this particular disk. So it would seem that the bottleneck is the > CPU. [I understand why both CPUs aren't used] I'd make sure that you're using the latest compilers to build postgres, and maxing-out the optimization. If you're binaries are unoptimized, that ought to be good for 15% (for your app, possibly more). Also, you haven't said what postmaster options you're using - I've seen big changes by tweaking sort memory, buffer pools, etc. Details of the parameters are in the man pages and docs. -- Mike
> > We were in a similar position and went for the 'Full Text Indexing" extra. > > You'll find it in contrib/fulltextindex. > > It creates a function which you call on a trigger to produce an index of > > words for specified fields. These indexes do get _very_ large (one of ours > > is ~800 MB), but it does work very well and speeds searches up enormously. > > If I understand you correctly, that's word-based? It's just splitting > on whitespace and punctuation? Unfortunately, that's not quite what > we need --- our wildcard searches needn't have their '%' on word > boundaries. It is not word-based. It slices each word into pieces and indexes it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026