Notes about behaviour of SIMILAR TO operator - Mailing list pgsql-bugs

From Adam Buraczewski
Subject Notes about behaviour of SIMILAR TO operator
Date
Msg-id 20031120210823.GA1456@localhost.localdomain
Whole thread Raw
Responses Re: Notes about behaviour of SIMILAR TO operator  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Hallo pgsql-bugs,

I was recently playing with SIMILAR TO operator (in both PostgreSQL
7.3.x and 7.4) and discovered that sometimes it behaves not the way I
expected.  Since this operator is quite new for me (I am familiar
mainly with perl, grep, sed and emacs regexps and their syntax is
slightly different from this one), I searched for documentation.  All
I found, except from what is written in PostgreSQL 7.4 manual, is an
ISO/ANSI SQL3 Working Draft from August 1994, which contains section
titled "<similar predicate>".  It is not much, of course, but it made
me think that implementation of SIMILAR TO operator in PostgreSQL is
not ideal and should be somehow improved :-/

First of all, I found that SIMILAR TO regular expressions are
converted to POSIX regular expressions with similar_escape() function
(located in file src/backend/utils/adt/regexp.c) and then "~" operator
is used to check if given string matches the pattern.  This is a very
simple function: all it does is replacing '_' with '.', '%' with '.*',
escaping some characters with a backslash and appending '^' and '$' to
both ends of the pattern.  This causes many inconveniences, for
example the pattern 'a|z' (which should match single 'a' or 'z'
characters only, according to SQL spec) is converted into POSIX
regular expression in the form of '^a|z$' which matches all strings
beginning with 'a' ('abcdef' for example) and all strings ending with
'z' ('xyz' for example).  So the meaning of the pattern is changed,
which is not good.

I also found that there are problems with character lists enclosed
inside square brackets, for example:

    ('_' similar to '[_]') is false
    ('.' similar to '[_]') is true
    ('%' similar to '[%]') is false
    ('*' similar to '[%]') is true
    ('.' similar to '[%]') is true

The behaviour above is also caused by similar_escape(), which converts
'[_]' to '^[.]$' and '[%]' to '^[.*]$', not noticing the simple fact
that these characters are inside brackets.  Of course, one could say
that patterns like '[_]' are not valid and special characters like
'_', '%' etc. should be escaped even inside square brackets, but after
reading SQL specification I am not sure how it should work actually.
All I know is that the old behaviour should be changed: either special
characters should be treated literally, so '[_]' should match an
underscore sign only, or an error message should be generated that
such pattern is not valid.  Of course escaping those special
characters works perfectly:

    ('_' similar to '[x_]' escape 'x') is true
    ('%' similar to '[x%]' escape 'x') is true

Talking about square brackets, it should be noticed that there is a
slight difference between SIMILAR TO and POSIX way of describing named
character classes.  In POSIX regexps one could write '[[:alpha:]]' to
describe a simple Latin letter, but SQL spec clearly says it should be
written as '[:alpha:]' or even '[:ALPHA:]' (notice case-insensitivity
and single brackets surrounding class names).  Since similar_escape()
does not know about this difference either, it passes everything to
POSIX regexp engine giving false matches:

    ('z' similar to '[:alpha:]') is false
    (':' similar to '[:alpha:]') is true
    ('a' similar to '[:ALPHA:]') is false
    ('z' similar to '[[:alpha:]]') is true
    ('z' similar to '[[:ALPHA:]]';) -> ERROR (invalid character class)

Another problem is an ESCAPE clause, which works generally well, but
sometimes not the way I would expect.  As far as I know there is not
such thing as default escape character in SIMILAR TO regular
expressions and when one wants to change the meaning of some special
characters he has always to write ESCAPE '<char>' and use this <char>
inside the pattern.  However, similar_escape() uses a backslash as a
default escape character, and this causes that:

    ('\\' similar to '\\') is false

In my opinion it should not behave this way and the backslash should
be treated by PostgreSQL as an ordinary character.

Finally, I found that a new, wonderful PostgreSQL feature of setting
regular expression "flavours" influences also SIMILAR TO operator.
For example, according to PostgreSQL manual and SQL specification:

    ('z' similar to '(a|z)') is true

but after setting regex_flavor to 'basic', most special characters
lose their meaning and the operator result differs:

    ('z' similar to '(a|z)') is false

This at least could be avoided simply by prepending regular expression
returned by similar_escape() with a magic sequence '***:' which
switches regexp engine into ARE mode.

Summarising, I think that similar_escape() function should be
improved, so it would rewrite patterns into POSIX style more
carefully:

 1. returned regular expression should be prepended with '***:' so
    advanced regular expressions are always switched on,

 2. patterns should be enclosed in '^(' and ')$' instead of just '^'
    and '$', so expressions like 'a|z' would work better,

 3. expression inside square brackets should be rewritten in a special
    way, when characters '_' and '%' (and possibly others) loose their
    special meaning,

 4. character class descriptions should be translated from SQL to
    POSIX syntax,

 5. there should not be any default escape character (only explicitly
    given in ESCAPE clause) so the backslash would be treated as an
    ordinary character.

 6. generally, pattern syntax should be checked more carefully during
    conversion, so users would not be surprised with error messages
    not connected with what they wrote as a SIMILAR TO argument.

I think I am able to write such a patch in my spare time, but I am not
sure if my research is good and complete.  Especially I don't know if
my knowledge of SIMILAR TO pattern syntax is good enough (it is taken
from an old standard draft -- does anyone know anything better?).
Above all, I am afraid that what I discovered is merely a beginning
(for example I don't know yet what with patterns inside SUBSTRING
operator -- are they also converted with similar_escape()?).

Best regards,

--
Adam Buraczewski <adamb (at) nor (dot) pl> * Linux user #165585
GCS/TW d- s-:+>+:- a C+++(++++) UL++++$ P++ L++++ E++ W+ N++ o? K w--
O M- V- PS+ !PE Y PGP+ t+ 5 X+ R tv- b+ DI D G++ e+++>++++ h r+>++ y?

pgsql-bugs by date:

Previous
From: krishna kumar
Date:
Subject: Installation problem
Next
From: Tom Lane
Date:
Subject: Re: Notes about behaviour of SIMILAR TO operator