Thread: Improved regular expression error message for backrefs

Improved regular expression error message for backrefs

From
Mark Dilger
Date:
Hackers,

Please find attached an improvement to the error messages given for invalid backreference usage:

 select 'xyz' ~ '(.)(.)\3';
 ERROR:  invalid regular expression: invalid backreference number
 select 'xyz' ~ '(.)(.)(?=\2)';
-ERROR:  invalid regular expression: invalid backreference number
+ERROR:  invalid regular expression: backreference in lookaround assertion

The first regexp is invalid because only two capture groups exist, so \3 doesn't refer to anything.  The second regexp
isrejected because the regular expression system does not support backreferences within lookaround assertions.  (See
thedocs, section 9.7.3.6. Limits And Compatibility.)  It is flat wrong to say the backreference number is invalid.
Thereis a perfectly valid capture that \2 refers to.  

The patch defines a new error code REG_ENOBREF in regex/regex.h right next to REG_ESUBREG from which it is split out,
ratherthan at the end of the list.  Is there a project preference to add it at the end?  Certainly, that would give a
shortergit diff. 

Are there dependencies on the current error messages which prevent such changes?


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: Improved regular expression error message for backrefs

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> The patch defines a new error code REG_ENOBREF in regex/regex.h right next to REG_ESUBREG from which it is split out,
ratherthan at the end of the list.  Is there a project preference to add it at the end?  Certainly, that would give a
shortergit diff. 
> Are there dependencies on the current error messages which prevent such changes?

Yeah: the POSIX standard says what the error codes from regcomp() are.

POSIX defines

    REG_ESUBREG
        Number in \digit invalid or in error.

which does seem to cover this case, so what I'd argue is that we should
improve the "invalid backreference number" text rather than invent
a nonstandard error code.  Maybe about like "backreference number does
not exist or cannot be referenced from here"?

(Admittedly, there's not a huge reason why src/backend/regex/ needs to
stay compliant with the POSIX API today.  But I still have ambitions to
split that out as a free-standing library someday, as Henry Spencer had
originally planned to do.  So I'd rather stick to the API spec.)

It might be worth checking what text is attributed to this error code
by PCRE and other implementations of the POSIX spec.

            regards, tom lane



Re: Improved regular expression error message for backrefs

From
Mark Dilger
Date:

> On Aug 22, 2021, at 7:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Yeah: the POSIX standard says what the error codes from regcomp() are.

I'm not sure how to interpret them.  The language "The implementation may define additional macros or constants using
namesbeginning with REG_" at the bottom of the docs might imply that one can add to the list. 

> POSIX defines
>
>     REG_ESUBREG
>         Number in \digit invalid or in error.
>
> which does seem to cover this case,

Hmm.  The number is neither invalid nor in error.  The only thing arguing in favor of using this code is that the error
messagecontains the word "backreference": 

    "REG_ESUBREG", "invalid backreference number"

which gives the reader a clue that the problem has something to do with a backreference in the pattern. But the POSIX
wording"Number in \digit invalid or in error." doesn't even have that advantage.  We seem to be using the wrong return
code. I would think a more generic  

    REG_BADPAT
        Invalid regular expression.

would be the correct code, though arguably far less informative.

> so what I'd argue is that we should
> improve the "invalid backreference number" text rather than invent
> a nonstandard error code.  Maybe about like "backreference number does
> not exist or cannot be referenced from here"?

Assuming we leave the error codes alone, how about, "backreference number invalid or cannot be referenced from here"?

> (Admittedly, there's not a huge reason why src/backend/regex/ needs to
> stay compliant with the POSIX API today.  But I still have ambitions to
> split that out as a free-standing library someday, as Henry Spencer had
> originally planned to do.  So I'd rather stick to the API spec.)

That's fine.  Something else might kill that ambition, but this quibble over error messages isn't nearly important
enoughto do so. 

> It might be worth checking what text is attributed to this error code
> by PCRE and other implementations of the POSIX spec.

Reading the docs at pcre.org, it appears that capture groups are allowed in look-around assertions.  Our engine doesn't
dothat, instead treating all groups within assertions as non-capturing.  I don't see anything about whether
backreferencesare allowed within pcre assertions, but I know that perl itself does allow them.  So maybe the error text
usedby other implementations is irrelevant? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company