Thread: General search problem - challenge

General search problem - challenge

From
"Postgres User"
Date:
I have a table of around 6,000 places in the world.  Everytime my
server receives a ping, I'm grabbing the content of an article from an
RSS feed.  Then I search the article for the presence of any the 6000
terms.
A typical article is around 1200 words.

I don't need to save the article in a table and the search is
performed only once, so it's not about FTS.

Any thoughts on the best way to execute these searches using a
traditional language like C++ ?

Re: General search problem - challenge

From
Steve Atkins
Date:
On Jul 2, 2007, at 3:36 PM, Postgres User wrote:

> I have a table of around 6,000 places in the world.  Everytime my
> server receives a ping, I'm grabbing the content of an article from an
> RSS feed.  Then I search the article for the presence of any the 6000
> terms.
> A typical article is around 1200 words.
>
> I don't need to save the article in a table and the search is
> performed only once, so it's not about FTS.
>
> Any thoughts on the best way to execute these searches using a
> traditional language like C++ ?

That'll depend heavily on the performance you need and the
language you use. C++ is very different to C++/STL is very
different to C++/Qt.

Naive approach: On receiving an article, read all 6000 terms
from the search table. See if any of them are in the article, with
strstr(3).

If that's fast enough for you, you're done. If not, you'll need to
do some work to cache / precompile search patterns in core,
or preprocess the articles for fast multi-term search. It's very
unlikely you'd need to do that, though.

(Also, this is an application that screams "I could be written
faster in perl than c++").

Cheers,
   Steve


Re: General search problem - challenge

From
Richard Huxton
Date:
Postgres User wrote:
> I have a table of around 6,000 places in the world.  Everytime my
> server receives a ping, I'm grabbing the content of an article from an
> RSS feed.  Then I search the article for the presence of any the 6000
> terms.
> A typical article is around 1200 words.
>
> I don't need to save the article in a table and the search is
> performed only once, so it's not about FTS.
>
> Any thoughts on the best way to execute these searches using a
> traditional language like C++ ?

Not sure that it's got anything to do with PostgreSQL.

1. Pre-process the 6000 words into a hash-lookup-table using hash
library of choice.

2. Split the article into "words" (however you define that)
3. Use your hash table to lookup each word from the article.
4. Stop on first match

Like Steve Atkins says, I'd use Perl instead of C++ and go home early :-)

--
   Richard Huxton
   Archonet Ltd