Thread: full text searching

full text searching

From
Culley Harrelson
Date:
Hi,

OK full text searching.  Will the full text index
catch changes in verb tense?  i.e. will a search for
woman catch women?

I'm researching before I dive in to this later in the
week so please excuse this incompletely informed
question:  Will I need to rebuild postgresql with the
full-text index module included?  Unfortunately I'm
away from my linux machine-- would someone be willing
to email me the README?

Thanks in advance,

Culley

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year!  http://personal.mail.yahoo.com/

Re: full text searching

From
"Mitch Vincent"
Date:
> Hi,
>
> OK full text searching.  Will the full text index
> catch changes in verb tense?  i.e. will a search for
> woman catch women?
>
> I'm researching before I dive in to this later in the
> week so please excuse this incompletely informed
> question:  Will I need to rebuild postgresql with the
> full-text index module included?  Unfortunately I'm
> away from my linux machine-- would someone be willing
> to email me the README?

Regardless of indexing, you're still searching for a specific string (if you
search using the = operator).

SELECT * from people WHERE whatever = 'woman';

-- Isn't going to catch anything but the literal string "woman".. (it's case
sensitive too, mind you)

SELECT * from people WHERE whatever LIKE 'wom%n';

-- Should check either.

A regex search is going to get more specific but when using the regex
search, you can't use indexes.

Anyone, please correct me if I'm wrong.

-Mitch



Re: full text searching

From
"Mitch Vincent"
Date:
Another thing..

Full text indexing, last time I checked, was just a trigger/function, you
don't have to rebuild anything that I'm aware of to include it..

-Mitch

> Hi,
>
> OK full text searching.  Will the full text index
> catch changes in verb tense?  i.e. will a search for
> woman catch women?
>
> I'm researching before I dive in to this later in the
> week so please excuse this incompletely informed
> question:  Will I need to rebuild postgresql with the
> full-text index module included?  Unfortunately I'm
> away from my linux machine-- would someone be willing
> to email me the README?
>
> Thanks in advance,
>
> Culley
>
> __________________________________________________
> Get personalized email addresses from Yahoo! Mail - only $35
> a year!  http://personal.mail.yahoo.com/
>


Re: full text searching

From
Gunnar R|nning
Date:
"Mitch Vincent" <mitch@venux.net> writes:

> -- Isn't going to catch anything but the literal string "woman".. (it's case
> sensitive too, mind you)
>
> SELECT * from people WHERE whatever LIKE 'wom%n';
>
> -- Should check either.

Well you wouldn't want to start  building these kind of rules in your
application - better to have them in the search engine. The fulltextindex
in the contrib package does of course not offer anything like this, it
would be nice to see a third party addon provide fulltext capabilities for
Postgresql.

regards,

    Gunnar

Re: full text searching

From
"Mitch Vincent"
Date:
> Well you wouldn't want to start  building these kind of rules in your
> application - better to have them in the search engine. The fulltextindex
> in the contrib package does of course not offer anything like this, it
> would be nice to see a third party addon provide fulltext capabilities for
> Postgresql.

Well, the search engine isn't the database, IMHO. The search "engine" is
your application... The database will go get anything you tell it to, you
just have to know how to tell it and make sure that your application tells
it in the correct way.

Teaching an application or database the English language is going to be a
hell of a project, good luck!

 Personally, I think the FTI trigger and function that's in contrib is
pretty bad. It's not usable in a lot of situations, I re-wrote it to remove
duplicates and index whole words but it still didn't work out for me...
Namely when you have fairly large chunks of text (30k or so), one for each
record in a row (and you have 10,000 rows).. Well, ripping out and indexing
30k*10k text chunks is a lot by itself but then when you search it you have
to join the two tables... It becomes a mess and was actually slower than
when I used LIKE to search the big text fields in my single table. It only
take a few seconds for the seq scan and the index scan on the FTI table but
with FTI updating became a 30 second job (of course there were like 4
million rows and each app did have 30k of text or so).. I don't have too
many small databases, so maybe this works for a lot of people :-)

Anyway. Moral of the story.. I'd like to see native PostgreSQL full text
indexing before we go adding on to the contrib'd trigger/function
implementation...

-Mitch


Re: full text searching

From
Gunnar R|nning
Date:
"Mitch Vincent" <mitch@venux.net> writes:


> Well, the search engine isn't the database, IMHO. The search "engine" is
> your application... The database will go get anything you tell it to, you
> just have to know how to tell it and make sure that your application tells
> it in the correct way.
>
> Teaching an application or database the English language is going to be a
> hell of a project, good luck!

Well, I don't want to write another search engine. What I would like to see
is a way to integrate with different third party products. It would be cool
with Lucene or some other free search engine as an optional add on
for PostgreSQL.

> Anyway. Moral of the story.. I'd like to see native PostgreSQL full text
> indexing before we go adding on to the contrib'd trigger/function
> implementation...

Well, I think any attempt at a "complete" full text indexing implementation
in the database itself is futile. Find a way to move this out of the
database and integrate with another product.

I've been using a variant of the FTI system in an application, but this is
far from sufficient when it comes to matching. Speed is OK, but the quality
of the results could have been a lot better.

Regards,

    Gunnar

Re: full text searching

From
"Mitch Vincent"
Date:
> > Well, the search engine isn't the database, IMHO. The search "engine" is
> > your application... The database will go get anything you tell it to,
you
> > just have to know how to tell it and make sure that your application
tells
> > it in the correct way.
> >
> > Teaching an application or database the English language is going to be
a
> > hell of a project, good luck!
>
> Well, I don't want to write another search engine. What I would like to
see
> is a way to integrate with different third party products. It would be
cool
> with Lucene or some other free search engine as an optional add on
> for PostgreSQL.
>
> > Anyway. Moral of the story.. I'd like to see native PostgreSQL full text
> > indexing before we go adding on to the contrib'd trigger/function
> > implementation...
>
> Well, I think any attempt at a "complete" full text indexing
implementation
> in the database itself is futile. Find a way to move this out of the
> database and integrate with another product.

Futile? Nah, I don't think it's futile anymore than indexing for any other
field is futile. If you could have both then well, that would rock.

I'm talking about indexing from the standpoint of fast searching, not really
smart searching (I wouldn't want a search for "woman" to return results with
"women"). I put it upon myself to generate the queries needed to give the
proper results.. I work for a custom software shop and so far every
application I've written needs a search and the client needs it to do a very
customized, very specific thing. I can't help but write it from scratch (of
course all I'm really doing is writing a frontend to PostgreSQL).. I'm not
sure that a generic search engine would work for me because all the clients
I've worked with have very specific needs.. PostgreSQL is my generic search
engine for all intents and ppurposes and I make it give me what I want..
With regard to FTI, I just want it to search large chunks of text faster...

> I've been using a variant of the FTI system in an application, but this is
> far from sufficient when it comes to matching. Speed is OK, but the
quality
> of the results could have been a lot better.

Really? How are you using it? If it's better than the one I wrote (and it
almost has to be!) I'd love to take a look.. Speed is OK on the machine I'm
searching through large text chunks with now because of a combination of a
good database (PostgreSQL) and a hefty machine (Dual PII 800, 512M ECC RAM,
Ultra 160 SCSI drives).. Still it's only doing sequential scans and using
LIKE to give me matches.. My search is generic SELECT * from whatever WHERE
textfield LIKE '%<searched word>%';  -----  That's fairly fast -- it would
be a hell of a lot faster if I could do an index scan there.. Of course it
was, it's just that updating and inserting suffered too much; something that
will happen anytime you're indexing large amount of data on the fly, I would
just like to see it suffer less, which might happen if FTI was built into
PG.. I'm just talking here, I don't know how FTI would be implemented better
if it was built in, other than I'm sure the person doing it would know more
about the internals of PG and more about C then me (Sadly I'm not all that
good with C anymore)..

Have a good one!


-Mitch







Re: full text searching

From
Gunnar R|nning
Date:
"Mitch Vincent" <mitch@venux.net> writes:

> > I've been using a variant of the FTI system in an application, but this is
> > far from sufficient when it comes to matching. Speed is OK, but the
> quality
> > of the results could have been a lot better.
>
> Really? How are you using it? If it's better than the one I wrote (and it
> almost has to be!) I'd love to take a look.. Speed is OK on the machine I'm

It is really not based on the FTI code in PostgreSQL, since with we started
out with a Progress database last year before porting to PostgreSQL. The
idea is the same though, a separate lookup table containing the words for
exact matching. Last time I had a look at the clients database it had about
50-60K rows in the content table, which amounted to about ~3500K rows in
the lookup table. Searches return results instantly even though most of
them are joins involving 3-4 tables. The database(7.0.2) is running on a
Sun 220R with one 450MHZ processor, 10000RPM disks, 1GB RAM and Solaris
7. (As a curiosity my P466 laptop with Linux is actually running PostgreSQL
faster...)

Since we're only doing exact searches, the index is utilized. But the
quality isn't good enough - I would love to have language sensitive
searches. "car" should match "cars" but not cartography and "ship"
should/could match "boat" etc.

Regards,

    Gunnar

Re: full text searching

From
"Thomas T. Thai"
Date:
On 7 Feb 2001, Gunnar R|nning wrote:
> It is really not based on the FTI code in PostgreSQL, since with we started
> out with a Progress database last year before porting to PostgreSQL. The
> idea is the same though, a separate lookup table containing the words for
> exact matching. Last time I had a look at the clients database it had about
> 50-60K rows in the content table, which amounted to about ~3500K rows in
> the lookup table. Searches return results instantly even though most of
> them are joins involving 3-4 tables. The database(7.0.2) is running on a
> Sun 220R with one 450MHZ processor, 10000RPM disks, 1GB RAM and Solaris
> 7. (As a curiosity my P466 laptop with Linux is actually running PostgreSQL
> faster...)
>
> Since we're only doing exact searches, the index is utilized. But the
> quality isn't good enough - I would love to have language sensitive
> searches. "car" should match "cars" but not cartography and "ship"
> should/could match "boat" etc.

you can use ispell prefix/suffix for searching base words.


Re: full text searching

From
Lincoln Yeoh
Date:
At 11:40 AM 07-02-2001 -0600, Thomas T. Thai wrote:
>On 7 Feb 2001, Gunnar R|nning wrote:
>>
>> Since we're only doing exact searches, the index is utilized. But the
>> quality isn't good enough - I would love to have language sensitive
>> searches. "car" should match "cars" but not cartography and "ship"
>> should/could match "boat" etc.
>
>you can use ispell prefix/suffix for searching base words.

Sometimes I want it literal too. e.g. I'm searching for car I want car and
nothing else.

What I figured first was to create yet another dictionary of meanings.

Where ship is converted to meaning1 (verb), meaning2 (noun), etc.
And boat is converted to meaning2 (noun).

But then boat is not really the same as ship. So they should actually be
separate meanings maybe linked by a weight.  Maybe we should make use of
Postgresql's polygon stuff, and stick a boat polygon somewhere where it
intersects a ship polygon and a canoe polygon.

In the end it seems too complex. I'm starting to think it may be better to
keep things literal but fast, and do the smart stuff at the client (do
multiple searches if necessary).

Cheerio,
Link.


Re: full text searching

From
Gunnar R|nning
Date:
Lincoln Yeoh <lyeoh@pop.jaring.my> writes:

> Sometimes I want it literal too. e.g. I'm searching for car I want car and
> nothing else.

Of course, you want this as well.

> In the end it seems too complex. I'm starting to think it may be better to
> keep things literal but fast, and do the smart stuff at the client (do
> multiple searches if necessary).

You got commercial products like Verity that is able to these kind of
things. I've used Verity in conjunction with Sybase, apart from stability
problems, that was a very nice combination on doing free text searches. I
could define which columns I wanted indexed, and then I could do you joins
against a proxy table(the index) to do synonym searches, word searches,
regex searches, soundex searches etc.

Verity was running in a separate process and that Sybase forwards the free
text search to when you join against the proxy table. Maybe we could have a
similar architecture in PostgreSQL as well some day.

Does anybody know how Oracle has implemented their "context" search or
whatever it is called nowadays ?

regards,

    Gunnar





Replication and on-line recovery

From
"Gordan Bobic"
Date:
Hi!

I will be setting up a 3-way replicated server setup over the next
week or two, so this is probably a good time to make sure I know
exactly what I'm in for. To my knowledge, rserv is the only currently
available replication server for Postgres - is that correct?

1) Is there a way to designate 1 server as master, and do transparent
clustering by just connecting to that one server? Or does the
clustering have to be handled on the client end by making a connection
to each server and doing a "round-robin" for picking the server for
each query (multi-threaded application)?

2) If a server fails, how can it be resynchronised with the rest of
the cluster? I am asuming that inserts all have to be channeled
through the same server, in order to avoid race conditions. Is this
correct? If a server fails, and inserts are constantly being made,
then I am guessing that a dump/restore will not work properly if the
master (insert) server is not taken off-line and inserts are stopped.
Is this the case? How can this be done without taking the master
server off-line? Taking any server off line would take it out of sync,
so just doing a restore from another secondary server would then
result in two servers being out of sync. How is this worked around?

3) I know this has been asked before, and I've managed to get a few
responses about how to implement a quick and useable solution to this,
but I seem to have misplaced the emails, so please forgive me for
asking this again.

Is it possible to run Linux + Mosix + GFS to achieve the functionality
of a truly distributed database system? The bandwidth of communication
between the servers is not a huge problem, because I have the option
of connecting them either in a "star" configuration with 100 Mb
ethernet, connect them to a switch, or use a fiber link between them.
I've been told that "postmaster" won't get migrated properly due to
IPC and shared memory issues. Can anyone suggest a work-around? DIPC,
perhaps? I can't see how to work around the shared memory, though...

I know Oracle has a full distributed clustering support, but I have
made a decision to stick with open source software all the way (plus
the cost of running Oracle on a commercial setup is quite
prohibitive).

Still, even if the answer to the fully distributed database question
here is still just big, fat, flat "NO!", I'd really appreciate some
input regarding failure recovery and insert handling on a replicated
database cluster.

Regards.

Gordan



Re: Replication and on-line recovery

From
Justin Clift
Date:
Hi Gordan,

Whilst probably not really useful just yet, (as I presently know very
little about replication, but I'm learning), I have just begun writing
up my initial attempts with rserv and Usogres (another PostgreSQL
replication approach).

techdocs.postgresql.org/installguides.html#replication

If you get it all setup and working in good order, can you write up a
guide on doing it, as I haven't found anything "out there" about it,
which is why I'm starting?

:-)

Regards and best wishes,

Justin Clift

Gordan Bobic wrote:
>
> Hi!
>
> I will be setting up a 3-way replicated server setup over the next
> week or two, so this is probably a good time to make sure I know
> exactly what I'm in for. To my knowledge, rserv is the only currently
> available replication server for Postgres - is that correct?
>
> 1) Is there a way to designate 1 server as master, and do transparent
> clustering by just connecting to that one server? Or does the
> clustering have to be handled on the client end by making a connection
> to each server and doing a "round-robin" for picking the server for
> each query (multi-threaded application)?
>
> 2) If a server fails, how can it be resynchronised with the rest of
> the cluster? I am asuming that inserts all have to be channeled
> through the same server, in order to avoid race conditions. Is this
> correct? If a server fails, and inserts are constantly being made,
> then I am guessing that a dump/restore will not work properly if the
> master (insert) server is not taken off-line and inserts are stopped.
> Is this the case? How can this be done without taking the master
> server off-line? Taking any server off line would take it out of sync,
> so just doing a restore from another secondary server would then
> result in two servers being out of sync. How is this worked around?
>
> 3) I know this has been asked before, and I've managed to get a few
> responses about how to implement a quick and useable solution to this,
> but I seem to have misplaced the emails, so please forgive me for
> asking this again.
>
> Is it possible to run Linux + Mosix + GFS to achieve the functionality
> of a truly distributed database system? The bandwidth of communication
> between the servers is not a huge problem, because I have the option
> of connecting them either in a "star" configuration with 100 Mb
> ethernet, connect them to a switch, or use a fiber link between them.
> I've been told that "postmaster" won't get migrated properly due to
> IPC and shared memory issues. Can anyone suggest a work-around? DIPC,
> perhaps? I can't see how to work around the shared memory, though...
>
> I know Oracle has a full distributed clustering support, but I have
> made a decision to stick with open source software all the way (plus
> the cost of running Oracle on a commercial setup is quite
> prohibitive).
>
> Still, even if the answer to the fully distributed database question
> here is still just big, fat, flat "NO!", I'd really appreciate some
> input regarding failure recovery and insert handling on a replicated
> database cluster.
>
> Regards.
>
> Gordan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
     - Indira Gandhi

Re: Replication and on-line recovery

From
"Gregory Wood"
Date:
> I will be setting up a 3-way replicated server setup over the next
> week or two, so this is probably a good time to make sure I know
> exactly what I'm in for. To my knowledge, rserv is the only currently
> available replication server for Postgres - is that correct?

PostgreSQL Replicator (http://pgreplicator.sourceforge.net/) appears to be
another. Justin has also mentioned Usogres (which I think is at
http://usogres.good-day.net/ but appears to be down at the moment).

I'm starting to do some research (kinda sorta) on replication and I'd love
to hear anything about either of these three servers. I'll try to do the
same if my time allows.

Greg