Thread: [SQL, regex, words] how to match word boundaries using regex's?
[SQL, regex, words] how to match word boundaries using regex's?
From
Tony.Curtis@vcpc.univie.ac.at
Date:
Hi, I have been searching all the mailing lists, the on-line docs, and the source! but cannot find an answer/solution to this: I have a varchar() field containing descriptions of things (doesn't matter what they are). I want to do a regex match limited to words. I tried this: where ... ~ '\Wword\W'; where ... ~ '\W*word\W*'; where ... ~ '\b\(word\)\b'; and other things with LIKE but no joy. How can I do this? thanks tony
Re: [SQL] [SQL, regex, words] how to match word boundaries using regex's?
From
"Gene Selkov Jr."
Date:
> I want to do a regex match limited to words. > > I tried this: > > where ... ~ '\Wword\W'; > where ... ~ '\W*word\W*'; > where ... ~ '\b\(word\)\b'; > > and other things with LIKE but no joy. Based on the comments in the source, regexp stuff used in postgres is something like this: http://tiger8.com/us/regexp-manpage.html I guess there are no backslash macros is POSIX expressions. No joy. By the way, I am wondering what determined the choiceof the regexp machine for postgres? Is it performance-related? Is it possible to have the same stuff as in perl? As to your question, how about a poor man's Altavista like this: Split the text into words before loading into a special index table. Words are numbered sequentially, so you can search for"phrases": Table = word +----------------------------------+----------------------------------+-------+ | Field | Type | Length| +----------------------------------+----------------------------------+-------+ | rec | char() | 12 | | seq | int4 | 4 | | word | text | var | +----------------------------------+----------------------------------+-------+ SELECT DISTINCT w1.rec FROM word w1, word w2 WHERE w1.word ~ '^a$' AND w2.word ~ '^phrase$' AND w1.rec = w2.rec AND w2.seq - w1.seq = 1; -- Distance between the words This way, you can control what represents the concept of a 'word' by an external program (perl script, etc.) Certainly, this method will show suboptimal performance with extra large tables and more than three or four words in a seachphrase. But it is possible to optimise by delegating set operations (joins) and position arithmetic to the client. Itworks very well for my ~500k tables and the most common queries. --Gene
Re: [SQL] [SQL, regex, words] how to match word boundaries using regex's?
From
Tony.Curtis@vcpc.univie.ac.at
Date:
Re: [SQL] [SQL, regex, words] how to match word boundaries using regex's? , Gene <selkovjr@mcs.anl.gov> said: Gene> I guess there are no backslash macros is POSIX Gene> expressions. No joy. By the way, I am wondering what No joy, Ren :-( Gene> determined the choice of the regexp machine for Gene> postgres? Is it performance-related? Is it possible to Gene> have the same stuff as in perl? That would be really nice. I was just considering how easy it would be to pull out the current regex backend directory and put perl-type (or GNU) ones there instead. Or is that not the way to do it? Could I write a new backend with overloaded/new operators to do this? Gene> Split the text into words before loading into a Gene> special index table. Words are numbered sequentially, Gene> so you can search for "phrases": I had thought about doing this because the description string will probably require interpretation at some level. But I was looking for a straight-forward inefficiency-be-damned-it's-only-a-demo-for-now solution. tony