Thread: full text searching
Hi, OK full text searching. Will the full text index catch changes in verb tense? i.e. will a search for woman catch women? I'm researching before I dive in to this later in the week so please excuse this incompletely informed question: Will I need to rebuild postgresql with the full-text index module included? Unfortunately I'm away from my linux machine-- would someone be willing to email me the README? Thanks in advance, Culley __________________________________________________ Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/
> Hi, > > OK full text searching. Will the full text index > catch changes in verb tense? i.e. will a search for > woman catch women? > > I'm researching before I dive in to this later in the > week so please excuse this incompletely informed > question: Will I need to rebuild postgresql with the > full-text index module included? Unfortunately I'm > away from my linux machine-- would someone be willing > to email me the README? Regardless of indexing, you're still searching for a specific string (if you search using the = operator). SELECT * from people WHERE whatever = 'woman'; -- Isn't going to catch anything but the literal string "woman".. (it's case sensitive too, mind you) SELECT * from people WHERE whatever LIKE 'wom%n'; -- Should check either. A regex search is going to get more specific but when using the regex search, you can't use indexes. Anyone, please correct me if I'm wrong. -Mitch
Another thing.. Full text indexing, last time I checked, was just a trigger/function, you don't have to rebuild anything that I'm aware of to include it.. -Mitch > Hi, > > OK full text searching. Will the full text index > catch changes in verb tense? i.e. will a search for > woman catch women? > > I'm researching before I dive in to this later in the > week so please excuse this incompletely informed > question: Will I need to rebuild postgresql with the > full-text index module included? Unfortunately I'm > away from my linux machine-- would someone be willing > to email me the README? > > Thanks in advance, > > Culley > > __________________________________________________ > Get personalized email addresses from Yahoo! Mail - only $35 > a year! http://personal.mail.yahoo.com/ >
"Mitch Vincent" <mitch@venux.net> writes: > -- Isn't going to catch anything but the literal string "woman".. (it's case > sensitive too, mind you) > > SELECT * from people WHERE whatever LIKE 'wom%n'; > > -- Should check either. Well you wouldn't want to start building these kind of rules in your application - better to have them in the search engine. The fulltextindex in the contrib package does of course not offer anything like this, it would be nice to see a third party addon provide fulltext capabilities for Postgresql. regards, Gunnar
> Well you wouldn't want to start building these kind of rules in your > application - better to have them in the search engine. The fulltextindex > in the contrib package does of course not offer anything like this, it > would be nice to see a third party addon provide fulltext capabilities for > Postgresql. Well, the search engine isn't the database, IMHO. The search "engine" is your application... The database will go get anything you tell it to, you just have to know how to tell it and make sure that your application tells it in the correct way. Teaching an application or database the English language is going to be a hell of a project, good luck! Personally, I think the FTI trigger and function that's in contrib is pretty bad. It's not usable in a lot of situations, I re-wrote it to remove duplicates and index whole words but it still didn't work out for me... Namely when you have fairly large chunks of text (30k or so), one for each record in a row (and you have 10,000 rows).. Well, ripping out and indexing 30k*10k text chunks is a lot by itself but then when you search it you have to join the two tables... It becomes a mess and was actually slower than when I used LIKE to search the big text fields in my single table. It only take a few seconds for the seq scan and the index scan on the FTI table but with FTI updating became a 30 second job (of course there were like 4 million rows and each app did have 30k of text or so).. I don't have too many small databases, so maybe this works for a lot of people :-) Anyway. Moral of the story.. I'd like to see native PostgreSQL full text indexing before we go adding on to the contrib'd trigger/function implementation... -Mitch
"Mitch Vincent" <mitch@venux.net> writes: > Well, the search engine isn't the database, IMHO. The search "engine" is > your application... The database will go get anything you tell it to, you > just have to know how to tell it and make sure that your application tells > it in the correct way. > > Teaching an application or database the English language is going to be a > hell of a project, good luck! Well, I don't want to write another search engine. What I would like to see is a way to integrate with different third party products. It would be cool with Lucene or some other free search engine as an optional add on for PostgreSQL. > Anyway. Moral of the story.. I'd like to see native PostgreSQL full text > indexing before we go adding on to the contrib'd trigger/function > implementation... Well, I think any attempt at a "complete" full text indexing implementation in the database itself is futile. Find a way to move this out of the database and integrate with another product. I've been using a variant of the FTI system in an application, but this is far from sufficient when it comes to matching. Speed is OK, but the quality of the results could have been a lot better. Regards, Gunnar
> > Well, the search engine isn't the database, IMHO. The search "engine" is > > your application... The database will go get anything you tell it to, you > > just have to know how to tell it and make sure that your application tells > > it in the correct way. > > > > Teaching an application or database the English language is going to be a > > hell of a project, good luck! > > Well, I don't want to write another search engine. What I would like to see > is a way to integrate with different third party products. It would be cool > with Lucene or some other free search engine as an optional add on > for PostgreSQL. > > > Anyway. Moral of the story.. I'd like to see native PostgreSQL full text > > indexing before we go adding on to the contrib'd trigger/function > > implementation... > > Well, I think any attempt at a "complete" full text indexing implementation > in the database itself is futile. Find a way to move this out of the > database and integrate with another product. Futile? Nah, I don't think it's futile anymore than indexing for any other field is futile. If you could have both then well, that would rock. I'm talking about indexing from the standpoint of fast searching, not really smart searching (I wouldn't want a search for "woman" to return results with "women"). I put it upon myself to generate the queries needed to give the proper results.. I work for a custom software shop and so far every application I've written needs a search and the client needs it to do a very customized, very specific thing. I can't help but write it from scratch (of course all I'm really doing is writing a frontend to PostgreSQL).. I'm not sure that a generic search engine would work for me because all the clients I've worked with have very specific needs.. PostgreSQL is my generic search engine for all intents and ppurposes and I make it give me what I want.. With regard to FTI, I just want it to search large chunks of text faster... > I've been using a variant of the FTI system in an application, but this is > far from sufficient when it comes to matching. Speed is OK, but the quality > of the results could have been a lot better. Really? How are you using it? If it's better than the one I wrote (and it almost has to be!) I'd love to take a look.. Speed is OK on the machine I'm searching through large text chunks with now because of a combination of a good database (PostgreSQL) and a hefty machine (Dual PII 800, 512M ECC RAM, Ultra 160 SCSI drives).. Still it's only doing sequential scans and using LIKE to give me matches.. My search is generic SELECT * from whatever WHERE textfield LIKE '%<searched word>%'; ----- That's fairly fast -- it would be a hell of a lot faster if I could do an index scan there.. Of course it was, it's just that updating and inserting suffered too much; something that will happen anytime you're indexing large amount of data on the fly, I would just like to see it suffer less, which might happen if FTI was built into PG.. I'm just talking here, I don't know how FTI would be implemented better if it was built in, other than I'm sure the person doing it would know more about the internals of PG and more about C then me (Sadly I'm not all that good with C anymore).. Have a good one! -Mitch
"Mitch Vincent" <mitch@venux.net> writes: > > I've been using a variant of the FTI system in an application, but this is > > far from sufficient when it comes to matching. Speed is OK, but the > quality > > of the results could have been a lot better. > > Really? How are you using it? If it's better than the one I wrote (and it > almost has to be!) I'd love to take a look.. Speed is OK on the machine I'm It is really not based on the FTI code in PostgreSQL, since with we started out with a Progress database last year before porting to PostgreSQL. The idea is the same though, a separate lookup table containing the words for exact matching. Last time I had a look at the clients database it had about 50-60K rows in the content table, which amounted to about ~3500K rows in the lookup table. Searches return results instantly even though most of them are joins involving 3-4 tables. The database(7.0.2) is running on a Sun 220R with one 450MHZ processor, 10000RPM disks, 1GB RAM and Solaris 7. (As a curiosity my P466 laptop with Linux is actually running PostgreSQL faster...) Since we're only doing exact searches, the index is utilized. But the quality isn't good enough - I would love to have language sensitive searches. "car" should match "cars" but not cartography and "ship" should/could match "boat" etc. Regards, Gunnar
On 7 Feb 2001, Gunnar R|nning wrote: > It is really not based on the FTI code in PostgreSQL, since with we started > out with a Progress database last year before porting to PostgreSQL. The > idea is the same though, a separate lookup table containing the words for > exact matching. Last time I had a look at the clients database it had about > 50-60K rows in the content table, which amounted to about ~3500K rows in > the lookup table. Searches return results instantly even though most of > them are joins involving 3-4 tables. The database(7.0.2) is running on a > Sun 220R with one 450MHZ processor, 10000RPM disks, 1GB RAM and Solaris > 7. (As a curiosity my P466 laptop with Linux is actually running PostgreSQL > faster...) > > Since we're only doing exact searches, the index is utilized. But the > quality isn't good enough - I would love to have language sensitive > searches. "car" should match "cars" but not cartography and "ship" > should/could match "boat" etc. you can use ispell prefix/suffix for searching base words.
At 11:40 AM 07-02-2001 -0600, Thomas T. Thai wrote: >On 7 Feb 2001, Gunnar R|nning wrote: >> >> Since we're only doing exact searches, the index is utilized. But the >> quality isn't good enough - I would love to have language sensitive >> searches. "car" should match "cars" but not cartography and "ship" >> should/could match "boat" etc. > >you can use ispell prefix/suffix for searching base words. Sometimes I want it literal too. e.g. I'm searching for car I want car and nothing else. What I figured first was to create yet another dictionary of meanings. Where ship is converted to meaning1 (verb), meaning2 (noun), etc. And boat is converted to meaning2 (noun). But then boat is not really the same as ship. So they should actually be separate meanings maybe linked by a weight. Maybe we should make use of Postgresql's polygon stuff, and stick a boat polygon somewhere where it intersects a ship polygon and a canoe polygon. In the end it seems too complex. I'm starting to think it may be better to keep things literal but fast, and do the smart stuff at the client (do multiple searches if necessary). Cheerio, Link.
Lincoln Yeoh <lyeoh@pop.jaring.my> writes: > Sometimes I want it literal too. e.g. I'm searching for car I want car and > nothing else. Of course, you want this as well. > In the end it seems too complex. I'm starting to think it may be better to > keep things literal but fast, and do the smart stuff at the client (do > multiple searches if necessary). You got commercial products like Verity that is able to these kind of things. I've used Verity in conjunction with Sybase, apart from stability problems, that was a very nice combination on doing free text searches. I could define which columns I wanted indexed, and then I could do you joins against a proxy table(the index) to do synonym searches, word searches, regex searches, soundex searches etc. Verity was running in a separate process and that Sybase forwards the free text search to when you join against the proxy table. Maybe we could have a similar architecture in PostgreSQL as well some day. Does anybody know how Oracle has implemented their "context" search or whatever it is called nowadays ? regards, Gunnar
Hi! I will be setting up a 3-way replicated server setup over the next week or two, so this is probably a good time to make sure I know exactly what I'm in for. To my knowledge, rserv is the only currently available replication server for Postgres - is that correct? 1) Is there a way to designate 1 server as master, and do transparent clustering by just connecting to that one server? Or does the clustering have to be handled on the client end by making a connection to each server and doing a "round-robin" for picking the server for each query (multi-threaded application)? 2) If a server fails, how can it be resynchronised with the rest of the cluster? I am asuming that inserts all have to be channeled through the same server, in order to avoid race conditions. Is this correct? If a server fails, and inserts are constantly being made, then I am guessing that a dump/restore will not work properly if the master (insert) server is not taken off-line and inserts are stopped. Is this the case? How can this be done without taking the master server off-line? Taking any server off line would take it out of sync, so just doing a restore from another secondary server would then result in two servers being out of sync. How is this worked around? 3) I know this has been asked before, and I've managed to get a few responses about how to implement a quick and useable solution to this, but I seem to have misplaced the emails, so please forgive me for asking this again. Is it possible to run Linux + Mosix + GFS to achieve the functionality of a truly distributed database system? The bandwidth of communication between the servers is not a huge problem, because I have the option of connecting them either in a "star" configuration with 100 Mb ethernet, connect them to a switch, or use a fiber link between them. I've been told that "postmaster" won't get migrated properly due to IPC and shared memory issues. Can anyone suggest a work-around? DIPC, perhaps? I can't see how to work around the shared memory, though... I know Oracle has a full distributed clustering support, but I have made a decision to stick with open source software all the way (plus the cost of running Oracle on a commercial setup is quite prohibitive). Still, even if the answer to the fully distributed database question here is still just big, fat, flat "NO!", I'd really appreciate some input regarding failure recovery and insert handling on a replicated database cluster. Regards. Gordan
Hi Gordan, Whilst probably not really useful just yet, (as I presently know very little about replication, but I'm learning), I have just begun writing up my initial attempts with rserv and Usogres (another PostgreSQL replication approach). techdocs.postgresql.org/installguides.html#replication If you get it all setup and working in good order, can you write up a guide on doing it, as I haven't found anything "out there" about it, which is why I'm starting? :-) Regards and best wishes, Justin Clift Gordan Bobic wrote: > > Hi! > > I will be setting up a 3-way replicated server setup over the next > week or two, so this is probably a good time to make sure I know > exactly what I'm in for. To my knowledge, rserv is the only currently > available replication server for Postgres - is that correct? > > 1) Is there a way to designate 1 server as master, and do transparent > clustering by just connecting to that one server? Or does the > clustering have to be handled on the client end by making a connection > to each server and doing a "round-robin" for picking the server for > each query (multi-threaded application)? > > 2) If a server fails, how can it be resynchronised with the rest of > the cluster? I am asuming that inserts all have to be channeled > through the same server, in order to avoid race conditions. Is this > correct? If a server fails, and inserts are constantly being made, > then I am guessing that a dump/restore will not work properly if the > master (insert) server is not taken off-line and inserts are stopped. > Is this the case? How can this be done without taking the master > server off-line? Taking any server off line would take it out of sync, > so just doing a restore from another secondary server would then > result in two servers being out of sync. How is this worked around? > > 3) I know this has been asked before, and I've managed to get a few > responses about how to implement a quick and useable solution to this, > but I seem to have misplaced the emails, so please forgive me for > asking this again. > > Is it possible to run Linux + Mosix + GFS to achieve the functionality > of a truly distributed database system? The bandwidth of communication > between the servers is not a huge problem, because I have the option > of connecting them either in a "star" configuration with 100 Mb > ethernet, connect them to a switch, or use a fiber link between them. > I've been told that "postmaster" won't get migrated properly due to > IPC and shared memory issues. Can anyone suggest a work-around? DIPC, > perhaps? I can't see how to work around the shared memory, though... > > I know Oracle has a full distributed clustering support, but I have > made a decision to stick with open source software all the way (plus > the cost of running Oracle on a commercial setup is quite > prohibitive). > > Still, even if the answer to the fully distributed database question > here is still just big, fat, flat "NO!", I'd really appreciate some > input regarding failure recovery and insert handling on a replicated > database cluster. > > Regards. > > Gordan > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
> I will be setting up a 3-way replicated server setup over the next > week or two, so this is probably a good time to make sure I know > exactly what I'm in for. To my knowledge, rserv is the only currently > available replication server for Postgres - is that correct? PostgreSQL Replicator (http://pgreplicator.sourceforge.net/) appears to be another. Justin has also mentioned Usogres (which I think is at http://usogres.good-day.net/ but appears to be down at the moment). I'm starting to do some research (kinda sorta) on replication and I'd love to hear anything about either of these three servers. I'll try to do the same if my time allows. Greg