Thread: Broken select on regular expression !!!

Broken select on regular expression !!!

From
Constantin Teodorescu
Date:
Hi all,

I think I have found a bug in regexp based selections.
Watch this :

create table regdemo (fld1 varchar(32));
CREATE
insert into regdemo values('410');
INSERT 726409 1
insert into regdemo values('7410');
INSERT 726410 1
insert into regdemo values('source');
INSERT 726411 1
insert into regdemo values('destination');
INSERT 726412 1
select * from regdemo where fld1 ~* '^sou|^des';
fld1
-----------
source
destination
(2 rows)

select * from regdemo where fld1 ~* '41|^des';
fld1
-----------
410
7410
destination
(3 rows)

select * from regdemo where fld1 ~* '^41|^des';
fld1
----
(0 rows)

^^^^^^^^^^^^^^
!?!?!?!
I thought it should return '410' and 'destination' rows. But it returns
nothing!
The first select example with ^ in both variants ( ^sou|^des ) works !!!
The last one ( ^41|^des ) don't !

Am I missing something?
I am getting the same result also on 6.4.2 and 6.5 beta 1 versions!

Best regards,
-- 
Constantin Teodorescu
FLEX Consulting Braila, ROMANIA


Re: [HACKERS] Broken select on regular expression !!!

From
Tom Lane
Date:
Constantin Teodorescu <teo@flex.ro> writes:
> select * from regdemo where fld1 ~* '^41|^des';
> fld1
> ----
> (0 rows)

> ^^^^^^^^^^^^^^
> !?!?!?!

I see it too.  Even more interesting is that these variants are OK:

regression=> select * from regdemo where fld1 ~* '^des|^41';
fld1
-----------
410
destination
(2 rows)

regression=> select * from regdemo where fld1 ~* '(^41)|(^des)';
fld1
-----------
410
destination
(2 rows)

And if you want *really* disturbing:

regression=> select * from regdemo where fld1 ~* '^sou|^des';
fld1
-----------
source
destination
(2 rows)

regression=> select * from regdemo where fld1 ~ '^sou|^des';
fld1
----
(0 rows)

Something is rotten in the state of Denmark...
        regards, tom lane


Re: [HACKERS] Broken select on regular expression !!!

From
Tatsuo Ishii
Date:
>> select * from regdemo where fld1 ~* '^41|^des';
>> fld1
>> ----
>> (0 rows)
>
>> ^^^^^^^^^^^^^^
>> !?!?!?!
>
>I see it too.  Even more interesting is that these variants are OK:
>
>regression=> select * from regdemo where fld1 ~* '^des|^41';
>fld1
>-----------
>410
>destination
>(2 rows)
>
>regression=> select * from regdemo where fld1 ~* '(^41)|(^des)';
>fld1
>-----------
>410
>destination
>(2 rows)
>
>And if you want *really* disturbing:
>
>regression=> select * from regdemo where fld1 ~* '^sou|^des';
>fld1
>-----------
>source
>destination
>(2 rows)
>
>regression=> select * from regdemo where fld1 ~ '^sou|^des';
>fld1
>----
>(0 rows)
>
>Something is rotten in the state of Denmark...

These all oddness are caused by the parser (makeIndexable). When
makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
regexp so that an index can be used. The rewritten query might be
something like:

fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'

Apparently this is wrong. This is because makeIndexable does not
understand '|' and '^' appearing in the middle of the regexp. On the
other hand, 

>regression=> select * from regdemo where fld1 ~* '^des|^41';
>regression=> select * from regdemo where fld1 ~* '^sou|^des';

will work since makeIndexable gave up the optimization if the op is
"~*" and a letter appearing right after '^' is *alphabet*.

Note that:

>regression=> select * from regdemo where fld1 ~ '^sou|^des';

will not work because the op is *not* "~*".

It seems that the only solution is checking '|' to see if it appears
in the target regexp and giving up the optimization in that case.

One might think that ~* '^41|^des' can be rewritten like:

fld1 ~* '^41' or fld1 ~* '^des'

For me this seems not to be a good idea. To accomplish this, we have
to deeply parse the regexp (consider that we might have arbitrary
complex regexps) and such kind thing is a job regexp() shoud
do.

Comments?
---
Tatsuo Ishii




Re: [HACKERS] Broken select on regular expression !!!

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Something is rotten in the state of Denmark...

> These all oddness are caused by the parser (makeIndexable).

Ah-hah, I'm sure you're right.  That makes *two* serious bugs in
makeIndexable.  (We still don't have an adequate solution
for its known problem in non-C locales...)

> It seems that the only solution is checking '|' to see if it appears
> in the target regexp and giving up the optimization in that case.

I'm feeling a strong urge to just rip out makeIndexable until
it can be redesigned... how many other problems has it got
that we haven't stumbled upon?
        regards, tom lane


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> These all oddness are caused by the parser (makeIndexable). When
> makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
> regexp so that an index can be used. The rewritten query might be
> something like:
> 
> fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'
> 
> Apparently this is wrong. This is because makeIndexable does not
> understand '|' and '^' appearing in the middle of the regexp. On the
> other hand, 
> 
> >regression=> select * from regdemo where fld1 ~* '^des|^41';
> >regression=> select * from regdemo where fld1 ~* '^sou|^des';
> 
> will work since makeIndexable gave up the optimization if the op is
> "~*" and a letter appearing right after '^' is *alphabet*.
> 
> Note that:
> 
> >regression=> select * from regdemo where fld1 ~ '^sou|^des';
> 
> will not work because the op is *not* "~*".
> 
> It seems that the only solution is checking '|' to see if it appears
> in the target regexp and giving up the optimization in that case.
> 
> One might think that ~* '^41|^des' can be rewritten like:
> 
> fld1 ~* '^41' or fld1 ~* '^des'
> 
> For me this seems not to be a good idea. To accomplish this, we have
> to deeply parse the regexp (consider that we might have arbitrary
> complex regexps) and such kind thing is a job regexp() shoud
> do.

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization.  Is
that the only special case I need to add?


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> >> Something is rotten in the state of Denmark...
> 
> > These all oddness are caused by the parser (makeIndexable).
> 
> Ah-hah, I'm sure you're right.  That makes *two* serious bugs in
> makeIndexable.  (We still don't have an adequate solution
> for its known problem in non-C locales...)
> 
> > It seems that the only solution is checking '|' to see if it appears
> > in the target regexp and giving up the optimization in that case.
> 
> I'm feeling a strong urge to just rip out makeIndexable until
> it can be redesigned... how many other problems has it got
> that we haven't stumbled upon?

But if we rip it out, people will complain we are not using the index
for regex and LIKE.  That is a pretty serious complaint.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> These all oddness are caused by the parser (makeIndexable). When
> makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
> regexp so that an index can be used. The rewritten query might be
> something like:
> 
> fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'

I have just applied a fix to gram.y that should fix this case.  Please
let me know how it works at your site.  Thanks.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Tatsuo Ishii
Date:
>Again very clear, and caused by the indexing of regex's, as you suggest.
>I can easily look for '|' in the string, and skip the optimization.  Is
>that the only special case I need to add?

What about '{' ?
---
Tatsuo Ishii



Re: [HACKERS] Broken select on regular expression !!!

From
Tatsuo Ishii
Date:
>I have just applied a fix to gram.y that should fix this case.  Please
>let me know how it works at your site.  Thanks.

It works great!

(This is FreeBSD 2.2.6-RELEASE)
--
Tatsuo Ishii



Re: [HACKERS] Broken select on regular expression !!!

From
Don Baccus
Date:
At 12:08 AM 5/21/99 -0400, Bruce Momjian wrote:

>But if we rip it out, people will complain we are not using the index
>for regex and LIKE.  That is a pretty serious complaint.

For this release, given that it's coming fairly soon (11 days???)
you might consider documenting the shortcomings, rather than
ripping it out.  Indexing is important, serious users will expect
it, at least for LIKE (they might not know about regex if they're
just SQL grunts).



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> >Again very clear, and caused by the indexing of regex's, as you suggest.
> >I can easily look for '|' in the string, and skip the optimization.  Is
> >that the only special case I need to add?
> 
> What about '{' ?

Does it understand {?  Man, what kind of regex library do we have?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> At 12:08 AM 5/21/99 -0400, Bruce Momjian wrote:
> 
> >But if we rip it out, people will complain we are not using the index
> >for regex and LIKE.  That is a pretty serious complaint.
> 
> For this release, given that it's coming fairly soon (11 days???)
> you might consider documenting the shortcomings, rather than
> ripping it out.  Indexing is important, serious users will expect
> it, at least for LIKE (they might not know about regex if they're
> just SQL grunts).

I didn't even know our ~ operator supported '|'!  Now I am told it know
about '{' too.  I am checking on that one.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Tatsuo Ishii
Date:
>> >Again very clear, and caused by the indexing of regex's, as you suggest.
>> >I can easily look for '|' in the string, and skip the optimization.  Is
>> >that the only special case I need to add?
>> 
>> What about '{' ?
>
>Does it understand {?  Man, what kind of regex library do we have?

I vaguely recall that we used to support only "basic" regex. At least
I thought so. Now looking into the source, I found we have supported
"extended" regex. FYI, our regex routines definitely supprt '{'. See
backend/regex/re_format.7.

P.S.    I will commit a small regex test program under
backend/regex for the testing purpose.  
--
Tatsuo Ishii



Re: [HACKERS] Broken select on regular expression !!!

From
Thomas Lockhart
Date:
> I vaguely recall that we used to support only "basic" regex. At least
> I thought so. Now looking into the source, I found we have supported
> "extended" regex. FYI, our regex routines definitely supprt '{'. See
> backend/regex/re_format.7.

Didn't even know that man page existed. It sure would be nice to have
it in the main docs, perhaps in the chapter on operators...
                    - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> >> >Again very clear, and caused by the indexing of regex's, as you suggest.
> >> >I can easily look for '|' in the string, and skip the optimization.  Is
> >> >that the only special case I need to add?
> >> 
> >> What about '{' ?
> >
> >Does it understand {?  Man, what kind of regex library do we have?
> 
> I vaguely recall that we used to support only "basic" regex. At least
> I thought so. Now looking into the source, I found we have supported
> "extended" regex. FYI, our regex routines definitely supprt '{'. See
> backend/regex/re_format.7.
> 
> P.S.    I will commit a small regex test program under
> backend/regex for the testing purpose.  

I have just commited a fix to skip {} too.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
Fixed in 6.5 final.


> Hi all,
> 
> I think I have found a bug in regexp based selections.
> Watch this :
> 
> create table regdemo (fld1 varchar(32));
> CREATE
> insert into regdemo values('410');
> INSERT 726409 1
> insert into regdemo values('7410');
> INSERT 726410 1
> insert into regdemo values('source');
> INSERT 726411 1
> insert into regdemo values('destination');
> INSERT 726412 1
> select * from regdemo where fld1 ~* '^sou|^des';
> fld1
> -----------
> source
> destination
> (2 rows)
> 
> select * from regdemo where fld1 ~* '41|^des';
> fld1
> -----------
> 410
> 7410
> destination
> (3 rows)
> 
> select * from regdemo where fld1 ~* '^41|^des';
> fld1
> ----
> (0 rows)
> 
> ^^^^^^^^^^^^^^
> !?!?!?!
> I thought it should return '410' and 'destination' rows. But it returns
> nothing!
> The first select example with ^ in both variants ( ^sou|^des ) works !!!
> The last one ( ^41|^des ) don't !
> 
> Am I missing something?
> I am getting the same result also on 6.4.2 and 6.5 beta 1 versions!
> 
> Best regards,
> -- 
> Constantin Teodorescu
> FLEX Consulting Braila, ROMANIA
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Broken select on regular expression !!!

From
Bruce Momjian
Date:
> >Again very clear, and caused by the indexing of regex's, as you suggest.
> >I can easily look for '|' in the string, and skip the optimization.  Is
> >that the only special case I need to add?
> 
> What about '{' ?

I dealt with this before 6.5.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026