Thread: Notes about behaviour of SIMILAR TO operator
Hallo pgsql-bugs, I was recently playing with SIMILAR TO operator (in both PostgreSQL 7.3.x and 7.4) and discovered that sometimes it behaves not the way I expected. Since this operator is quite new for me (I am familiar mainly with perl, grep, sed and emacs regexps and their syntax is slightly different from this one), I searched for documentation. All I found, except from what is written in PostgreSQL 7.4 manual, is an ISO/ANSI SQL3 Working Draft from August 1994, which contains section titled "<similar predicate>". It is not much, of course, but it made me think that implementation of SIMILAR TO operator in PostgreSQL is not ideal and should be somehow improved :-/ First of all, I found that SIMILAR TO regular expressions are converted to POSIX regular expressions with similar_escape() function (located in file src/backend/utils/adt/regexp.c) and then "~" operator is used to check if given string matches the pattern. This is a very simple function: all it does is replacing '_' with '.', '%' with '.*', escaping some characters with a backslash and appending '^' and '$' to both ends of the pattern. This causes many inconveniences, for example the pattern 'a|z' (which should match single 'a' or 'z' characters only, according to SQL spec) is converted into POSIX regular expression in the form of '^a|z$' which matches all strings beginning with 'a' ('abcdef' for example) and all strings ending with 'z' ('xyz' for example). So the meaning of the pattern is changed, which is not good. I also found that there are problems with character lists enclosed inside square brackets, for example: ('_' similar to '[_]') is false ('.' similar to '[_]') is true ('%' similar to '[%]') is false ('*' similar to '[%]') is true ('.' similar to '[%]') is true The behaviour above is also caused by similar_escape(), which converts '[_]' to '^[.]$' and '[%]' to '^[.*]$', not noticing the simple fact that these characters are inside brackets. Of course, one could say that patterns like '[_]' are not valid and special characters like '_', '%' etc. should be escaped even inside square brackets, but after reading SQL specification I am not sure how it should work actually. All I know is that the old behaviour should be changed: either special characters should be treated literally, so '[_]' should match an underscore sign only, or an error message should be generated that such pattern is not valid. Of course escaping those special characters works perfectly: ('_' similar to '[x_]' escape 'x') is true ('%' similar to '[x%]' escape 'x') is true Talking about square brackets, it should be noticed that there is a slight difference between SIMILAR TO and POSIX way of describing named character classes. In POSIX regexps one could write '[[:alpha:]]' to describe a simple Latin letter, but SQL spec clearly says it should be written as '[:alpha:]' or even '[:ALPHA:]' (notice case-insensitivity and single brackets surrounding class names). Since similar_escape() does not know about this difference either, it passes everything to POSIX regexp engine giving false matches: ('z' similar to '[:alpha:]') is false (':' similar to '[:alpha:]') is true ('a' similar to '[:ALPHA:]') is false ('z' similar to '[[:alpha:]]') is true ('z' similar to '[[:ALPHA:]]';) -> ERROR (invalid character class) Another problem is an ESCAPE clause, which works generally well, but sometimes not the way I would expect. As far as I know there is not such thing as default escape character in SIMILAR TO regular expressions and when one wants to change the meaning of some special characters he has always to write ESCAPE '<char>' and use this <char> inside the pattern. However, similar_escape() uses a backslash as a default escape character, and this causes that: ('\\' similar to '\\') is false In my opinion it should not behave this way and the backslash should be treated by PostgreSQL as an ordinary character. Finally, I found that a new, wonderful PostgreSQL feature of setting regular expression "flavours" influences also SIMILAR TO operator. For example, according to PostgreSQL manual and SQL specification: ('z' similar to '(a|z)') is true but after setting regex_flavor to 'basic', most special characters lose their meaning and the operator result differs: ('z' similar to '(a|z)') is false This at least could be avoided simply by prepending regular expression returned by similar_escape() with a magic sequence '***:' which switches regexp engine into ARE mode. Summarising, I think that similar_escape() function should be improved, so it would rewrite patterns into POSIX style more carefully: 1. returned regular expression should be prepended with '***:' so advanced regular expressions are always switched on, 2. patterns should be enclosed in '^(' and ')$' instead of just '^' and '$', so expressions like 'a|z' would work better, 3. expression inside square brackets should be rewritten in a special way, when characters '_' and '%' (and possibly others) loose their special meaning, 4. character class descriptions should be translated from SQL to POSIX syntax, 5. there should not be any default escape character (only explicitly given in ESCAPE clause) so the backslash would be treated as an ordinary character. 6. generally, pattern syntax should be checked more carefully during conversion, so users would not be surprised with error messages not connected with what they wrote as a SIMILAR TO argument. I think I am able to write such a patch in my spare time, but I am not sure if my research is good and complete. Especially I don't know if my knowledge of SIMILAR TO pattern syntax is good enough (it is taken from an old standard draft -- does anyone know anything better?). Above all, I am afraid that what I discovered is merely a beginning (for example I don't know yet what with patterns inside SUBSTRING operator -- are they also converted with similar_escape()?). Best regards, -- Adam Buraczewski <adamb (at) nor (dot) pl> * Linux user #165585 GCS/TW d- s-:+>+:- a C+++(++++) UL++++$ P++ L++++ E++ W+ N++ o? K w-- O M- V- PS+ !PE Y PGP+ t+ 5 X+ R tv- b+ DI D G++ e+++>++++ h r+>++ y?
Adam Buraczewski <adamb@nor.pl> writes: > ... for example the pattern 'a|z' (which should match single 'a' or 'z' > characters only, according to SQL spec) is converted into POSIX > regular expression in the form of '^a|z$' which matches all strings > beginning with 'a' ('abcdef' for example) and all strings ending with > 'z' ('xyz' for example). So the meaning of the pattern is changed, > which is not good. Hm, that's a mistake, it should probably translate to ^(a|z)$ instead. > The behaviour above is also caused by similar_escape(), which converts > '[_]' to '^[.]$' and '[%]' to '^[.*]$', not noticing the simple fact > that these characters are inside brackets. As near as I can tell, the SQL spec requires special characters to be escaped when they are inside a bracket construct. So indeed the above are invalid SQL regexes. > Talking about square brackets, it should be noticed that there is a > slight difference between SIMILAR TO and POSIX way of describing named > character classes. Mmm, yeah, that looks like a mess. > This at least could be avoided simply by prepending regular expression > returned by similar_escape() with a magic sequence '***:' which > switches regexp engine into ARE mode. Good point. Actually, do we want to force ARE mode, or something simpler? Perhaps ERE or even BRE would be a better match to the SQL spec. > I think I am able to write such a patch in my spare time, Go to it ... regards, tom lane
Hallo Tom, I decided to improve similar_escape() function during the weekend. Thank you very much for the excerpt from SQL standard (I think this is much more complete than the text I found in a Working Draft from August 1994). However, there are some more issues I'd like to make clear before function redesigning. First of all, I found that SQL spec says that when one of the items of SIMILAR TO construct (<character match value>, <similar pattern> or <escape character>) is NULL, then result of the whole construct should be unknown. However, PostgreSQL treats SIMILAR TO ... ESCAPE NULL the same way as when no ESCAPE clause is present, which is wrong: ('a' similar to 'a') is true ('a' similar to 'a' escape null) is true (should be unknown!) The behaviour above is caused by the fact that escape character is passed to similar_escape() without checking for null value, and the same null value is passed to it when there is no ESCAPE clause at all. I think that either PostgreSQL should check for nulls in SIMILAR TO construct before calling similar_escape(), or there should be two versions of similar_escape() function: one getting only one argument (for SIMILAR TO without ESCAPE) and second, getting two arguments (a pattern and an escape char). Which solution is better? > As near as I can tell, the SQL spec requires special characters to be > escaped when they are inside a bracket construct. So indeed the above > are invalid SQL regexes. How the function should behave when such an invalid pattern is passed as its argument? Should it throw an error (this is what SQL spec says) or tolerate as much mistakes as possible, generating some warnings only? > Good point. Actually, do we want to force ARE mode, or something simpler? > Perhaps ERE or even BRE would be a better match to the SQL spec. I think that there is no difference which regexp dialect is choosen, only the speed matters. Function translating SIMILAR TO patterns into POSIX regular expressions will be more or less the same. What should I choose then? BTW, should I write some regression tests for SIMILAR TO? Is there any guide how to do it (except PostgreSQL sources)? Should the changes be written for CVS HEAD only or 7.4/7.3 branches either? Regards, -- Adam Buraczewski <adamb (at) nor (dot) pl> * Linux user #165585 GCS/TW d- s-:+>+:- a C+++(++++) UL++++$ P++ L++++ E++ W+ N++ o? K w-- O M- V- PS+ !PE Y PGP+ t+ 5 X+ R tv- b+ DI D G++ e+++>++++ h r+>++ y?
Adam Buraczewski <adamb@nor.pl> writes: > ('a' similar to 'a' escape null) is true (should be unknown!) Yeah, you are right; this is because we are overloading a "null" second parameter to mean "the ESCAPE part wasn't present", which in hindsight wasn't such a hot idea. > I think that either PostgreSQL should check for nulls in SIMILAR TO > construct before calling similar_escape(), or there should be two > versions of similar_escape() function: one getting only one argument > (for SIMILAR TO without ESCAPE) and second, getting two arguments > (a pattern and an escape char). Which solution is better? I think the latter is the only reasonable solution, but it will be something that we cannot implement in the 7.4.* series, because adding another function implies initdb. I'd suggest submitting one patch that fixes everything but the NULL problem, which we could back-patch into 7.4, and then a second patch that splits the function into two for 7.5. >> As near as I can tell, the SQL spec requires special characters to be >> escaped when they are inside a bracket construct. So indeed the above >> are invalid SQL regexes. > How the function should behave when such an invalid pattern is passed > as its argument? Should it throw an error (this is what SQL spec > says) or tolerate as much mistakes as possible, generating some > warnings only? I don't have a strong opinion --- could go with either behavior. You might want to take it up on the pgsql-sql list. >> Good point. Actually, do we want to force ARE mode, or something simpler? >> Perhaps ERE or even BRE would be a better match to the SQL spec. > I think that there is no difference which regexp dialect is choosen, > only the speed matters. Function translating SIMILAR TO patterns into > POSIX regular expressions will be more or less the same. What should > I choose then? I doubt there would be any speed difference. The advantage of a dumber RE flavor is that it would have fewer "extra" features that might be unintentionally triggered by a translated pattern, leading to just the sort of non-SQL-compliant behavior you are complaining of ... > BTW, should I write some regression tests for SIMILAR TO? Sure. Look at some of the existing regression tests for examples. > Should the changes be written for CVS HEAD only or 7.4/7.3 branches > either? I don't see that we'd bother applying it to 7.3, but 7.4 branch yes, if you avoid any changes in the function's API for the 7.4 version. regards, tom lane