Thread: Postgres regexp matching failure?
I have an regular expression, wich works fine in Java or python, but I don't seem to be able to make it work in postgres. Regular expression needs to match everything begining with '+' (plus sign) or letters 'STOP', 'stop', 'StoP', or any other combination pronounced 'stop'. Here is the python example: >>> import re >>> p = re.compile(r'^\s*(?:[\+|-]|(?:[sS][tT][oO][pP]\b)).*$') >>> p.match('+mario').group(0) '+mario' >>> p.match('+mario works').group(0) '+mario works' >>> p.match('mario works').group(0) Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'NoneType' object has no attribute 'group' >>> p.match('stop works').group(0) 'stop works' >>> Now, here is what happens if I try this in postgres: pulitzer2=# select '+mario' ~ '^\s*(?:[\+|-]|(?:[sS][tT][oO][pP]\b)).* $';?column? ----------t (1 row) This one is ok. pulitzer2=# select '+mario works' ~ '^\s*(?:[\ +|-]|(?:[sS][tT][oO][pP]\b)).*$';?column? ----------t (1 row) This one is also ok. pulitzer2=# select 'mario works' ~ '^\s*(?:[\ +|-]|(?:[sS][tT][oO][pP]\b)).*$';?column? ----------f (1 row) Same as this one, also ok. pulitzer2=# select 'stop works' ~ '^\s*(?:[\ +|-]|(?:[sS][tT][oO][pP]\b)).*$';?column? ----------f (1 row) Here, postgres should return true, but it gives me false. I'd appreciate any hints on what is wrong here. Thank you in advance, Mario -- Mario Splivalo Mob-Art mario.splivalo@mobart.hr "I can do it quick, I can do it cheap, I can do it well. Pick any two."
On 9/5/06, Mario Splivalo <mario.splivalo@mobart.hr> wrote:
\b is a back-space - is that what you are wanting there? If I remove it I get true.
==================================================================
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
http://codeelixir.com
==================================================================
pulitzer2=# select 'stop works' ~ '^\s*(?:[\
+|-]|(?:[sS][tT][oO][pP]\b)).*$';
?column?
----------
f
(1 row)
Here, postgres should return true, but it gives me false.
==================================================================
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
http://codeelixir.com
==================================================================
Mario Splivalo <mario.splivalo@mobart.hr> writes: > Now, here is what happens if I try this in postgres: > pulitzer2=# select '+mario' ~ '^\s*(?:[\+|-]|(?:[sS][tT][oO][pP]\b)).*$'; I'm thinking you've forgotten to double your backslashes. regards, tom lane
On Tue, 2006-09-05 at 10:11 -0400, Tom Lane wrote: > Mario Splivalo <mario.splivalo@mobart.hr> writes: > > Now, here is what happens if I try this in postgres: > > > pulitzer2=# select '+mario' ~ '^\s*(?:[\+|-]|(?:[sS][tT][oO][pP]\b)).*$'; > > I'm thinking you've forgotten to double your backslashes. > That was the first thing I tried :) What confused me was the difference I get when using python, or perl, or Java, or any other regexp 'matcher', therefore I asked for help here. Mario -- Mario Splivalo Mob-Art mario.splivalo@mobart.hr "I can do it quick, I can do it cheap, I can do it well. Pick any two."
On Tue, 2006-09-05 at 08:42 -0500, Aaron Bono wrote: > On 9/5/06, Mario Splivalo <mario.splivalo@mobart.hr> wrote: > > pulitzer2=# select 'stop works' ~ '^\s*(?:[\ > +|-]|(?:[sS][tT][oO][pP]\b)).*$'; > ?column? > ---------- > f > (1 row) > > Here, postgres should return true, but it gives me false. > > > \b is a back-space - is that what you are wanting there? If I remove > it I get true. Actually, I'm not sure :) As I've mentioned, python/java/perl do as I expected, postgres on the other hand doesn't. If \b was the backspace, then I'd have trouble with '+mario test', and that one seems to be OK. I also tried doublebackslashing, all the same. So, I guess it's obvious that postgres doesn't treat regular expressions the same way as java/perl/pyton/php/awk/sed do... Mario -- Mario Splivalo Mob-Art mario.splivalo@mobart.hr "I can do it quick, I can do it cheap, I can do it well. Pick any two."
Mario Splivalo wrote: > On Tue, 2006-09-05 at 08:42 -0500, Aaron Bono wrote: > > On 9/5/06, Mario Splivalo <mario.splivalo@mobart.hr> wrote: > > > > pulitzer2=# select 'stop works' ~ '^\s*(?:[\ > > +|-]|(?:[sS][tT][oO][pP]\b)).*$'; > > ?column? > > ---------- > > f > > (1 row) > > > > Here, postgres should return true, but it gives me false. > > > > > > \b is a back-space - is that what you are wanting there? If I remove > > it I get true. > > Actually, I'm not sure :) As I've mentioned, python/java/perl do as I > expected, postgres on the other hand doesn't. If \b was the backspace, > then I'd have trouble with '+mario test', and that one seems to be OK. No, because the \b is inside the "stop" arm of the |. You need to do *both*, double backslashes and get rid of \b (or at least understand what you actually mean with it ...) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, 2006-09-05 at 10:21 -0400, Alvaro Herrera wrote: > Mario Splivalo wrote: > > On Tue, 2006-09-05 at 08:42 -0500, Aaron Bono wrote: > > > On 9/5/06, Mario Splivalo <mario.splivalo@mobart.hr> wrote: > > > > > > pulitzer2=# select 'stop works' ~ '^\s*(?:[\ > > > +|-]|(?:[sS][tT][oO][pP]\b)).*$'; > > > ?column? > > > ---------- > > > f > > > (1 row) > > > > > > Here, postgres should return true, but it gives me false. > > > > > > > > > \b is a back-space - is that what you are wanting there? If I remove > > > it I get true. > > > > Actually, I'm not sure :) As I've mentioned, python/java/perl do as I > > expected, postgres on the other hand doesn't. If \b was the backspace, > > then I'd have trouble with '+mario test', and that one seems to be OK. > > No, because the \b is inside the "stop" arm of the |. You need to do > *both*, double backslashes and get rid of \b (or at least understand > what you actually mean with it ...) > I know this might not be the right place for this question, but, how come (or better: why?) is above regexp macthed ok (ok as in 'the way I expected') when employed from java/perl/python? Mario
Mario Splivalo wrote: > On Tue, 2006-09-05 at 10:21 -0400, Alvaro Herrera wrote: > > Mario Splivalo wrote: > > > On Tue, 2006-09-05 at 08:42 -0500, Aaron Bono wrote: > > > > On 9/5/06, Mario Splivalo <mario.splivalo@mobart.hr> wrote: > > > > > > > > pulitzer2=# select 'stop works' ~ '^\s*(?:[\ > > > > +|-]|(?:[sS][tT][oO][pP]\b)).*$'; > > > > ?column? > > > > ---------- > > > > f > > > > (1 row) > > > > > > > > Here, postgres should return true, but it gives me false. > > > > > > > > > > > > \b is a back-space - is that what you are wanting there? If I remove > > > > it I get true. > > > > > > Actually, I'm not sure :) As I've mentioned, python/java/perl do as I > > > expected, postgres on the other hand doesn't. If \b was the backspace, > > > then I'd have trouble with '+mario test', and that one seems to be OK. > > > > No, because the \b is inside the "stop" arm of the |. You need to do > > *both*, double backslashes and get rid of \b (or at least understand > > what you actually mean with it ...) > > > > I know this might not be the right place for this question, but, how > come (or better: why?) is above regexp macthed ok (ok as in 'the way I > expected') when employed from java/perl/python? In Perl at least, \b is a word boundary. In PostgreSQL (and probably Tcl as well) it's a backslash AFAICT. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Mario Splivalo <mario.splivalo@mobart.hr> writes: > So, I guess it's obvious that postgres doesn't treat regular expressions > the same way as java/perl/pyton/php/awk/sed do... When you get into stuff as arcane as word-boundary constraints, you'll find that regexes are not NEARLY as well standardized as that flippant complaint suggests. For penance, actually try it in all six of those languages and report back. Postgres' regexp code is the same as Tcl's (it's Henry Spencer's package) and if you check TFM you will find out that \y, or possibly \M, is what you want. regards, tom lane
On 9/5/06, Alvaro Herrera <alvherre@commandprompt.com> wrote:
More specifically, Perl, Java and Python interpret \b as a backspace in the Character class only (got that from the Regular Expression pocket reference) but otherwise are word boundries. I am not entirely sure what the Character class context is referring to but it does show that \b is interpreted differently under different contexts. PostgreSQL may not have a way (or at least a very good way) to make this distinction.
Any other thoughts?
==================================================================
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
http://codeelixir.com
==================================================================
In Perl at least, \b is a word boundary. In PostgreSQL (and probably
Tcl as well) it's a backslash AFAICT.
More specifically, Perl, Java and Python interpret \b as a backspace in the Character class only (got that from the Regular Expression pocket reference) but otherwise are word boundries. I am not entirely sure what the Character class context is referring to but it does show that \b is interpreted differently under different contexts. PostgreSQL may not have a way (or at least a very good way) to make this distinction.
Any other thoughts?
==================================================================
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
http://codeelixir.com
==================================================================
On Tue, 2006-09-05 at 11:22 -0400, Tom Lane wrote: > Mario Splivalo <mario.splivalo@mobart.hr> writes: > > So, I guess it's obvious that postgres doesn't treat regular expressions > > the same way as java/perl/pyton/php/awk/sed do... > > When you get into stuff as arcane as word-boundary constraints, you'll > find that regexes are not NEARLY as well standardized as that flippant > complaint suggests. For penance, actually try it in all six of those > languages and report back. So I did, and I was mistaken the first time. Java, perl, python and php treat regexpes differently than awk/sed and differently than tcl. I can post source snippets and results here if anyone is interested in those. > Postgres' regexp code is the same as Tcl's (it's Henry Spencer's package) > and if you check TFM you will find out that \y, or possibly \M, is what > you want. Thnx, I missed that part that 'postgres has same regexp code as Tcl'. Mike -- Mario Splivalo Mob-Art mario.splivalo@mobart.hr "I can do it quick, I can do it cheap, I can do it well. Pick any two."