Thread: 50 000 000 Table entries and I have to do a keyword search HELP NEEDED

50 000 000 Table entries and I have to do a keyword search HELP NEEDED

From
"borajetta"
Date:

So I have a table with about 50 million entries in it, I have to do a keyword search.

 

The keyword search is done on the title of the entry.  For example a entry could be "This is a title string which could be searched"

 

I have tried a few ways to search but I get horrible search times.  Some keywords will come up with matches as big as .25 million but most are around 1000-5000.

 

I use an index which narrows the table down to about 1.5-2million entries.

 

I used 2 tables which had a 1:1 correspondence.

One held a gist index which was on a int field which searched the for the keyword.  Then I would join the table to another to retrieve the rest of the information about the items it matched.

 

This was slow even for returning 100 entries.  About 10 seconds, sometimes 5.  But when I start getting 1xxx entries its about 30-50 seconds.  The rest is just horrible.

 

How should I set up my indexes and or tables.

We were thinking of putting the index inside one table then the join would not have to be done but this still returns rather slow results.

 

I have not fully tested this method but it looks like when run for just the keyword search on the title and no joining it can return in about 10 seconds or less.

This is a great improvement but I am currently going to make the table all in one and see how long it will take.  I believe it will not be much more as there will be no join needed only the returning of some attribute fields. 

 

This is still not the kind of time I would like to see, I wanted something around 2 seconds or less.  I know there is a lot of information especially if .25 million rows are to be returned but if there is only 1xxx-9xxx rows to be returned I believe 2 seconds seems about right.

 

How do search engines do it?

Any suggestions are welcome,

 

Thanks

Borajetta,

> So I have a table with about 50 million entries in it, I have to do a
keyword search.

Are you using OpenFTS/TSearch2?     Is your database optimized?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: 50 000 000 Table entries and I have to do a keyword search HELP NEEDED

From
"Jeremy Dunn"
Date:
One option that does not take advantage of any fancy indexing methods is to create a trigger on the table, on insert/update/delete, which extracts each individual word from the field you care about, and creates an entry in another 'keyword' table, id = 'word', value = pk of your original table.  then index the keyword table on the 'keyword' field, and do your searches from there.  this should improve performance substantially, even on very large return sets, because the keyword table rows are very small and thus a lot of them fit in a disk block.
 
- Jeremy
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of borajetta
Sent: Monday, June 07, 2004 5:47 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] 50 000 000 Table entries and I have to do a keyword search HELP NEEDED

So I have a table with about 50 million entries in it, I have to do a keyword search.

 

The keyword search is done on the title of the entry.  For example a entry could be "This is a title string which could be searched"

 

I have tried a few ways to search but I get horrible search times.  Some keywords will come up with matches as big as .25 million but most are around 1000-5000.

 

I use an index which narrows the table down to about 1.5-2million entries.

 

I used 2 tables which had a 1:1 correspondence.

One held a gist index which was on a int field which searched the for the keyword.  Then I would join the table to another to retrieve the rest of the information about the items it matched.

 

This was slow even for returning 100 entries.  About 10 seconds, sometimes 5.  But when I start getting 1xxx entries its about 30-50 seconds.  The rest is just horrible.

 

How should I set up my indexes and or tables.

We were thinking of putting the index inside one table then the join would not have to be done but this still returns rather slow results.

 

I have not fully tested this method but it looks like when run for just the keyword search on the title and no joining it can return in about 10 seconds or less.

This is a great improvement but I am currently going to make the table all in one and see how long it will take.  I believe it will not be much more as there will be no join needed only the returning of some attribute fields. 

 

This is still not the kind of time I would like to see, I wanted something around 2 seconds or less.  I know there is a lot of information especially if .25 million rows are to be returned but if there is only 1xxx-9xxx rows to be returned I believe 2 seconds seems about right.

 

How do search engines do it?

Any suggestions are welcome,

 

Thanks

Borajetta,

The other thing you can do is take a look at your hardware.   What are you
running this on?   How much RAM?   Are other things running on this server as
well?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


The words for the keyword can be made up of a sentace, ie 10 or more keywords to one entry.
Also incase I didnt answer before, we are using TSearch2 and all tables have been fully analyzed and indexed.
Any other suggestions?
How long do searches take when 10 000 rows are returned?
We can not use a limit of 100 because we need to analyze the entire data set returned.
Thanks,
----- Original Message -----
Sent: Tuesday, June 15, 2004 5:43 AM
Subject: RE: [PERFORM] 50 000 000 Table entries and I have to do a keyword search HELP NEEDED

One option that does not take advantage of any fancy indexing methods is to create a trigger on the table, on insert/update/delete, which extracts each individual word from the field you care about, and creates an entry in another 'keyword' table, id = 'word', value = pk of your original table.  then index the keyword table on the 'keyword' field, and do your searches from there.  this should improve performance substantially, even on very large return sets, because the keyword table rows are very small and thus a lot of them fit in a disk block.
 
- Jeremy
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of borajetta
Sent: Monday, June 07, 2004 5:47 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] 50 000 000 Table entries and I have to do a keyword search HELP NEEDED

So I have a table with about 50 million entries in it, I have to do a keyword search.

 

The keyword search is done on the title of the entry.  For example a entry could be "This is a title string which could be searched"

 

I have tried a few ways to search but I get horrible search times.  Some keywords will come up with matches as big as .25 million but most are around 1000-5000.

 

I use an index which narrows the table down to about 1.5-2million entries.

 

I used 2 tables which had a 1:1 correspondence.

One held a gist index which was on a int field which searched the for the keyword.  Then I would join the table to another to retrieve the rest of the information about the items it matched.

 

This was slow even for returning 100 entries.  About 10 seconds, sometimes 5.  But when I start getting 1xxx entries its about 30-50 seconds.  The rest is just horrible.

 

How should I set up my indexes and or tables.

We were thinking of putting the index inside one table then the join would not have to be done but this still returns rather slow results.

 

I have not fully tested this method but it looks like when run for just the keyword search on the title and no joining it can return in about 10 seconds or less.

This is a great improvement but I am currently going to make the table all in one and see how long it will take.  I believe it will not be much more as there will be no join needed only the returning of some attribute fields. 

 

This is still not the kind of time I would like to see, I wanted something around 2 seconds or less.  I know there is a lot of information especially if .25 million rows are to be returned but if there is only 1xxx-9xxx rows to be returned I believe 2 seconds seems about right.

 

How do search engines do it?

Any suggestions are welcome,

 

Thanks