Thread: Broken select on regular expression !!!
Hi all, I think I have found a bug in regexp based selections. Watch this : create table regdemo (fld1 varchar(32)); CREATE insert into regdemo values('410'); INSERT 726409 1 insert into regdemo values('7410'); INSERT 726410 1 insert into regdemo values('source'); INSERT 726411 1 insert into regdemo values('destination'); INSERT 726412 1 select * from regdemo where fld1 ~* '^sou|^des'; fld1 ----------- source destination (2 rows) select * from regdemo where fld1 ~* '41|^des'; fld1 ----------- 410 7410 destination (3 rows) select * from regdemo where fld1 ~* '^41|^des'; fld1 ---- (0 rows) ^^^^^^^^^^^^^^ !?!?!?! I thought it should return '410' and 'destination' rows. But it returns nothing! The first select example with ^ in both variants ( ^sou|^des ) works !!! The last one ( ^41|^des ) don't ! Am I missing something? I am getting the same result also on 6.4.2 and 6.5 beta 1 versions! Best regards, -- Constantin Teodorescu FLEX Consulting Braila, ROMANIA
Constantin Teodorescu <teo@flex.ro> writes: > select * from regdemo where fld1 ~* '^41|^des'; > fld1 > ---- > (0 rows) > ^^^^^^^^^^^^^^ > !?!?!?! I see it too. Even more interesting is that these variants are OK: regression=> select * from regdemo where fld1 ~* '^des|^41'; fld1 ----------- 410 destination (2 rows) regression=> select * from regdemo where fld1 ~* '(^41)|(^des)'; fld1 ----------- 410 destination (2 rows) And if you want *really* disturbing: regression=> select * from regdemo where fld1 ~* '^sou|^des'; fld1 ----------- source destination (2 rows) regression=> select * from regdemo where fld1 ~ '^sou|^des'; fld1 ---- (0 rows) Something is rotten in the state of Denmark... regards, tom lane
>> select * from regdemo where fld1 ~* '^41|^des'; >> fld1 >> ---- >> (0 rows) > >> ^^^^^^^^^^^^^^ >> !?!?!?! > >I see it too. Even more interesting is that these variants are OK: > >regression=> select * from regdemo where fld1 ~* '^des|^41'; >fld1 >----------- >410 >destination >(2 rows) > >regression=> select * from regdemo where fld1 ~* '(^41)|(^des)'; >fld1 >----------- >410 >destination >(2 rows) > >And if you want *really* disturbing: > >regression=> select * from regdemo where fld1 ~* '^sou|^des'; >fld1 >----------- >source >destination >(2 rows) > >regression=> select * from regdemo where fld1 ~ '^sou|^des'; >fld1 >---- >(0 rows) > >Something is rotten in the state of Denmark... These all oddness are caused by the parser (makeIndexable). When makeIndexable sees ~* '^41|^des' , it tries to rewrite the target regexp so that an index can be used. The rewritten query might be something like: fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377' Apparently this is wrong. This is because makeIndexable does not understand '|' and '^' appearing in the middle of the regexp. On the other hand, >regression=> select * from regdemo where fld1 ~* '^des|^41'; >regression=> select * from regdemo where fld1 ~* '^sou|^des'; will work since makeIndexable gave up the optimization if the op is "~*" and a letter appearing right after '^' is *alphabet*. Note that: >regression=> select * from regdemo where fld1 ~ '^sou|^des'; will not work because the op is *not* "~*". It seems that the only solution is checking '|' to see if it appears in the target regexp and giving up the optimization in that case. One might think that ~* '^41|^des' can be rewritten like: fld1 ~* '^41' or fld1 ~* '^des' For me this seems not to be a good idea. To accomplish this, we have to deeply parse the regexp (consider that we might have arbitrary complex regexps) and such kind thing is a job regexp() shoud do. Comments? --- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> Something is rotten in the state of Denmark... > These all oddness are caused by the parser (makeIndexable). Ah-hah, I'm sure you're right. That makes *two* serious bugs in makeIndexable. (We still don't have an adequate solution for its known problem in non-C locales...) > It seems that the only solution is checking '|' to see if it appears > in the target regexp and giving up the optimization in that case. I'm feeling a strong urge to just rip out makeIndexable until it can be redesigned... how many other problems has it got that we haven't stumbled upon? regards, tom lane
> These all oddness are caused by the parser (makeIndexable). When > makeIndexable sees ~* '^41|^des' , it tries to rewrite the target > regexp so that an index can be used. The rewritten query might be > something like: > > fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377' > > Apparently this is wrong. This is because makeIndexable does not > understand '|' and '^' appearing in the middle of the regexp. On the > other hand, > > >regression=> select * from regdemo where fld1 ~* '^des|^41'; > >regression=> select * from regdemo where fld1 ~* '^sou|^des'; > > will work since makeIndexable gave up the optimization if the op is > "~*" and a letter appearing right after '^' is *alphabet*. > > Note that: > > >regression=> select * from regdemo where fld1 ~ '^sou|^des'; > > will not work because the op is *not* "~*". > > It seems that the only solution is checking '|' to see if it appears > in the target regexp and giving up the optimization in that case. > > One might think that ~* '^41|^des' can be rewritten like: > > fld1 ~* '^41' or fld1 ~* '^des' > > For me this seems not to be a good idea. To accomplish this, we have > to deeply parse the regexp (consider that we might have arbitrary > complex regexps) and such kind thing is a job regexp() shoud > do. Again very clear, and caused by the indexing of regex's, as you suggest. I can easily look for '|' in the string, and skip the optimization. Is that the only special case I need to add? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> Something is rotten in the state of Denmark... > > > These all oddness are caused by the parser (makeIndexable). > > Ah-hah, I'm sure you're right. That makes *two* serious bugs in > makeIndexable. (We still don't have an adequate solution > for its known problem in non-C locales...) > > > It seems that the only solution is checking '|' to see if it appears > > in the target regexp and giving up the optimization in that case. > > I'm feeling a strong urge to just rip out makeIndexable until > it can be redesigned... how many other problems has it got > that we haven't stumbled upon? But if we rip it out, people will complain we are not using the index for regex and LIKE. That is a pretty serious complaint. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> These all oddness are caused by the parser (makeIndexable). When > makeIndexable sees ~* '^41|^des' , it tries to rewrite the target > regexp so that an index can be used. The rewritten query might be > something like: > > fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377' I have just applied a fix to gram.y that should fix this case. Please let me know how it works at your site. Thanks. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>Again very clear, and caused by the indexing of regex's, as you suggest. >I can easily look for '|' in the string, and skip the optimization. Is >that the only special case I need to add? What about '{' ? --- Tatsuo Ishii
>I have just applied a fix to gram.y that should fix this case. Please >let me know how it works at your site. Thanks. It works great! (This is FreeBSD 2.2.6-RELEASE) -- Tatsuo Ishii
At 12:08 AM 5/21/99 -0400, Bruce Momjian wrote: >But if we rip it out, people will complain we are not using the index >for regex and LIKE. That is a pretty serious complaint. For this release, given that it's coming fairly soon (11 days???) you might consider documenting the shortcomings, rather than ripping it out. Indexing is important, serious users will expect it, at least for LIKE (they might not know about regex if they're just SQL grunts). - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at http://donb.photo.net
> >Again very clear, and caused by the indexing of regex's, as you suggest. > >I can easily look for '|' in the string, and skip the optimization. Is > >that the only special case I need to add? > > What about '{' ? Does it understand {? Man, what kind of regex library do we have? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> At 12:08 AM 5/21/99 -0400, Bruce Momjian wrote: > > >But if we rip it out, people will complain we are not using the index > >for regex and LIKE. That is a pretty serious complaint. > > For this release, given that it's coming fairly soon (11 days???) > you might consider documenting the shortcomings, rather than > ripping it out. Indexing is important, serious users will expect > it, at least for LIKE (they might not know about regex if they're > just SQL grunts). I didn't even know our ~ operator supported '|'! Now I am told it know about '{' too. I am checking on that one. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>> >Again very clear, and caused by the indexing of regex's, as you suggest. >> >I can easily look for '|' in the string, and skip the optimization. Is >> >that the only special case I need to add? >> >> What about '{' ? > >Does it understand {? Man, what kind of regex library do we have? I vaguely recall that we used to support only "basic" regex. At least I thought so. Now looking into the source, I found we have supported "extended" regex. FYI, our regex routines definitely supprt '{'. See backend/regex/re_format.7. P.S. I will commit a small regex test program under backend/regex for the testing purpose. -- Tatsuo Ishii
> I vaguely recall that we used to support only "basic" regex. At least > I thought so. Now looking into the source, I found we have supported > "extended" regex. FYI, our regex routines definitely supprt '{'. See > backend/regex/re_format.7. Didn't even know that man page existed. It sure would be nice to have it in the main docs, perhaps in the chapter on operators... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> >> >Again very clear, and caused by the indexing of regex's, as you suggest. > >> >I can easily look for '|' in the string, and skip the optimization. Is > >> >that the only special case I need to add? > >> > >> What about '{' ? > > > >Does it understand {? Man, what kind of regex library do we have? > > I vaguely recall that we used to support only "basic" regex. At least > I thought so. Now looking into the source, I found we have supported > "extended" regex. FYI, our regex routines definitely supprt '{'. See > backend/regex/re_format.7. > > P.S. I will commit a small regex test program under > backend/regex for the testing purpose. I have just commited a fix to skip {} too. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Fixed in 6.5 final. > Hi all, > > I think I have found a bug in regexp based selections. > Watch this : > > create table regdemo (fld1 varchar(32)); > CREATE > insert into regdemo values('410'); > INSERT 726409 1 > insert into regdemo values('7410'); > INSERT 726410 1 > insert into regdemo values('source'); > INSERT 726411 1 > insert into regdemo values('destination'); > INSERT 726412 1 > select * from regdemo where fld1 ~* '^sou|^des'; > fld1 > ----------- > source > destination > (2 rows) > > select * from regdemo where fld1 ~* '41|^des'; > fld1 > ----------- > 410 > 7410 > destination > (3 rows) > > select * from regdemo where fld1 ~* '^41|^des'; > fld1 > ---- > (0 rows) > > ^^^^^^^^^^^^^^ > !?!?!?! > I thought it should return '410' and 'destination' rows. But it returns > nothing! > The first select example with ^ in both variants ( ^sou|^des ) works !!! > The last one ( ^41|^des ) don't ! > > Am I missing something? > I am getting the same result also on 6.4.2 and 6.5 beta 1 versions! > > Best regards, > -- > Constantin Teodorescu > FLEX Consulting Braila, ROMANIA > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> >Again very clear, and caused by the indexing of regex's, as you suggest. > >I can easily look for '|' in the string, and skip the optimization. Is > >that the only special case I need to add? > > What about '{' ? I dealt with this before 6.5. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026