Thread: Minor regexp bug

Minor regexp bug

From

Tom Lane

Date:

07 November 2015, 02:33:02

Happened across this while investigating something else ...

The regexp documentation says:

    Lookahead and lookbehind constraints cannot contain <firstterm>back
    references</> (see <xref linkend="posix-escape-sequences">),
    and all parentheses within them are considered non-capturing.

This is true if you try a simple case, eg

regression=# select 'xyz' ~ 'x(\w)(?=\1)';
ERROR:  invalid regular expression: invalid backreference number

(That's not the greatest choice of error code, perhaps, but the point
is it rejects the backref.)  However, stick an extra set of parentheses
in there, and the regexp parser forgets all about the restriction:

regression=# select 'xyz' ~ 'x(\w)(?=(\1))';
 ?column?
----------
 t
(1 row)

Since the execution machinery has no hope of executing such a backref
properly, you silently get a wrong answer.  (The actual behavior in
a case like this is as though the backref had been replaced by a copy
of the bit of regexp pattern it'd referred to, ie this pattern works
like 'x(\w)(?=\w)', without any enforcement that the two \w's are
matching identical text.)

The fix is a one-liner, as per the attached patch: when recursing
to parse the innards of a parenthesized subexpression, we have to
pass down the current "type" not necessarily PLAIN, so that any
type-related restrictions still apply inside the subexpression.

What I'm wondering about is whether to back-patch this.  It's possible
that people have written patterns like this and not realized that they
aren't doing quite what's expected.  Getting a failure instead might not
be desirable in a minor release.  On the other hand, wrong answers are
wrong answers.

Thoughts?

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index aa759c2..a165b3b 100644
*** a/src/backend/regex/regcomp.c
--- b/src/backend/regex/regcomp.c
*************** parseqatom(struct vars * v,
*** 951,957 ****
              EMPTYARC(lp, s);
              EMPTYARC(s2, rp);
              NOERR();
!             atom = parse(v, ')', PLAIN, s, s2);
              assert(SEE(')') || ISERR());
              NEXT();
              NOERR();
--- 951,957 ----
              EMPTYARC(lp, s);
              EMPTYARC(s2, rp);
              NOERR();
!             atom = parse(v, ')', type, s, s2);
              assert(SEE(')') || ISERR());
              NEXT();
              NOERR();
diff --git a/src/test/regress/expected/regex.out b/src/test/regress/expected/regex.out
index f0e2fc9..07fb023 100644
*** a/src/test/regress/expected/regex.out
--- b/src/test/regress/expected/regex.out
*************** select 'a' ~ '()+\1';
*** 490,492 ****
--- 490,497 ----
   t
  (1 row)

+ -- Error conditions
+ select 'xyz' ~ 'x(\w)(?=\1)';  -- no backrefs in LACONs
+ ERROR:  invalid regular expression: invalid backreference number
+ select 'xyz' ~ 'x(\w)(?=(\1))';
+ ERROR:  invalid regular expression: invalid backreference number
diff --git a/src/test/regress/sql/regex.sql b/src/test/regress/sql/regex.sql
index d3030af..c45bdc9 100644
*** a/src/test/regress/sql/regex.sql
--- b/src/test/regress/sql/regex.sql
*************** select 'a' ~ '$()|^\1';
*** 117,119 ****
--- 117,123 ----
  select 'a' ~ '.. ()|\1';
  select 'a' ~ '()*\1';
  select 'a' ~ '()+\1';
+
+ -- Error conditions
+ select 'xyz' ~ 'x(\w)(?=\1)';  -- no backrefs in LACONs
+ select 'xyz' ~ 'x(\w)(?=(\1))';

Re: Minor regexp bug

From

"David G. Johnston"

Date:

07 November 2015, 03:57:19

On Fri, Nov 6, 2015 at 7:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What I'm wondering about is whether to back-patch this. It's possible
that people have written patterns like this and not realized that they
aren't doing quite what's expected. Getting a failure instead might not
be desirable in a minor release. On the other hand, wrong answers are
wrong answers.

I'd vote to back-patch this. The unscientific reason on my end is that I suspect very few patterns in the wild would be affected and furthermore any using such patterns is likely to be in a position to change it match the existing behavior by replace the "(\1)" with the corresponding "(\w)" as you used in you example. We should probably suggest just that in the release notes. It is not a strongly held position and my first reaction was that introducing an error should be avoided. But regular expressions are tricky enough to get right when the engine does what you tell it...

David J.

Re: Minor regexp bug

From

Greg Stark

Date:

07 November 2015, 15:13:36

On Sat, Nov 7, 2015 at 2:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What I'm wondering about is whether to back-patch this.  It's possible
> that people have written patterns like this and not realized that they
> aren't doing quite what's expected.  Getting a failure instead might not
> be desirable in a minor release.  On the other hand, wrong answers are
> wrong answers.

I would say wrong answers are wrong answers. It's hard to believe
there are many people doing this but if they are they're certainly
expecting the look-ahead to actually test that it's looking at the
same thing as the capturing parens. It might even be something
security-critical like parsing an connection string or something like
that. I can't see it's doing people any favours to let their code
continue doing something unexpected to avoid new errors.

-- 
greg

Re: Minor regexp bug

From

Joe Conway

Date:

07 November 2015, 17:26:03

On 11/07/2015 07:12 AM, Greg Stark wrote:
> On Sat, Nov 7, 2015 at 2:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What I'm wondering about is whether to back-patch this.  It's possible
>> that people have written patterns like this and not realized that they
>> aren't doing quite what's expected.  Getting a failure instead might not
>> be desirable in a minor release.  On the other hand, wrong answers are
>> wrong answers.
>
>
> I would say wrong answers are wrong answers. It's hard to believe
> there are many people doing this but if they are they're certainly
> expecting the look-ahead to actually test that it's looking at the
> same thing as the capturing parens. It might even be something
> security-critical like parsing an connection string or something like
> that. I can't see it's doing people any favours to let their code
> continue doing something unexpected to avoid new errors.

+1


--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Minor regexp bug

From

Tom Lane

Date:

07 November 2015, 17:44:37

Joe Conway <mail@joeconway.com> writes:
> On 11/07/2015 07:12 AM, Greg Stark wrote:
>> On Sat, Nov 7, 2015 at 2:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What I'm wondering about is whether to back-patch this.

>> I would say wrong answers are wrong answers.

> +1

Hearing no objections, done.
        regards, tom lane