Thread: SQL-spec incompatibilities in similar_escape() and related stuff
Over in pgsql-bugs [1], we're looking into some bugs associated with mistranslation of SQL-spec regexes to POSIX regexes. However, while poking at that I couldn't help noticing that there are more ways in which we're not following the letter of the SQL spec in this area: * As Andrew noted, somewhere between SQL99 and SQL:2008, the committee just up and changed the syntax of <regular expression substring function>. SQL99 has <regular expression substring function> ::= SUBSTRING <left paren> <character value expression> FROM <character value expression> FOR <escape character> <right paren> but in recent versions it's <regular expression substring function> ::= SUBSTRING <left paren> <character value expression> SIMILAR <character value expression> ESCAPE <escape character> <right paren> I am, frankly, inclined to ignore this as a bad idea. We do have SIMILAR and ESCAPE as keywords already, but they're type_func_name_keyword and unreserved_keyword respectively. To support this syntax, I'm pretty sure we'd have to make them both fully reserved. That seems likely to break existing applications, and I don't think it's worth it. But it's probably something to discuss. * Our function similar_escape() is not documented, but it underlies three things in the grammar: a SIMILAR TO b Translated as "a ~ similar_escape(b, null)" a SIMILAR TO b ESCAPE e Translated as "a ~ similar_escape(b, e)" substring(a, b, e) This is a SQL function expanding to select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3)) To support the first usage, similar_escape is non-strict, and it takes a NULL second argument to mean '\'. This is already a SQL spec violation, because as far as I can tell from the spec, if you don't write an ESCAPE clause then there is *no* escape character; there certainly is not a default of '\'. However, we document this behavior, so I don't know if we want to change it. This behavior also causes spec compatibility problems in the second syntax, because "a SIMILAR TO b ESCAPE NULL" is treated as though it were "ESCAPE '\'", which is again a spec violation: the result should be null. And, just to add icing on the cake, it causes performance problems in the third syntax. 3-argument substring is labeled proisstrict, which is correct behavior per spec (the result is NULL if any of the three arguments are null). But because similar_escape is not strict, the planner fails to inline the SQL function, reasoning (quite accurately) that doing so would change the behavior for null inputs. This costs us something like 4x performance compared to the underlying 2-argument POSIX-regex substring() function. I'm not sure what we want to do here, but we probably ought to do something, because right now substring() and SIMILAR TO aren't even in agreement between themselves let alone with the SQL spec. We could either move towards making all these constructs strict in accordance with the spec (and possibly breaking some existing applications), or we could make substring(a, b, e) not strict so that it inherits similar_escape's idea of what to do for e = NULL. * similar_escape considers a zero-length escape string to mean "no escape character". This is contrary to spec which clearly says that a zero-length escape string is an error condition (just as more-than-one-character is an error condition). It's also undocumented. Should we tighten that up to conform to spec, or document it as an extension? * Per spec, escape-double-quote must appear exactly twice in the second argument of substring(a, b, e), while it's not valid in SIMILAR TO. similar_escape() doesn't enforce this, and it can't do so as long as we are using the same pattern conversion function for both constructs. However, we could do better than we're doing: * If there are zero occurrences, then what you get from substring() is the whole input string if it matches, as if escape-double-quote had appeared at each end of the string. I think this is fine, but we ought to document it. * If there are an odd number of occurrences, similar_escape() doesn't complain, but you'll get this from the regex engine: ERROR: invalid regular expression: parentheses () not balanced The fact of an error isn't a problem, but the error message is pretty confusing considering that what the user wrote was not parentheses. I think similar_escape() ought to throw its own error with an on-point message. * If there are more than two pairs of escape-double-quote, you get some behavior that's completely not per spec --- the patterns between the additional pairs still contribute to whether there's an overall match, but they don't affect the result substring. I'm inclined to think we ought to throw an error for this, too. * The spec is much tighter than we are concerning what's a legal escape sequence. This is partly intentional on our part, I think; notably, you can get at POSIX-regex escapes like "\d", which the SQL spec doesn't provide. Although I think this is intentional, it's not documented. I'm not sure if we want to tighten that up or document what we allow ... thoughts? I am not eager to change any of this in released branches, but I think there's a good case for doing something about these points in HEAD. regards, tom lane [1] https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> but in recent versions it's Tom> <regular expression substring function> ::= Tom> SUBSTRING <left paren> <character value expression> Tom> SIMILAR <character value expression> Tom> ESCAPE <escape character> <right paren> Tom> I am, frankly, inclined to ignore this as a bad idea. We do have Tom> SIMILAR and ESCAPE as keywords already, but they're Tom> type_func_name_keyword and unreserved_keyword respectively. To Tom> support this syntax, I'm pretty sure we'd have to make them both Tom> fully reserved. I only did a quick trial but it doesn't seem to require reserving them more strictly - just adding the obvious productions to the grammar doesn't introduce any conflicts. Tom> * Our function similar_escape() is not documented, but it Tom> underlies three things in the grammar: Tom> a SIMILAR TO b Tom> Translated as "a ~ similar_escape(b, null)" Tom> a SIMILAR TO b ESCAPE e Tom> Translated as "a ~ similar_escape(b, e)" Tom> substring(a, b, e) Tom> This is a SQL function expanding to Tom> select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3)) Tom> To support the first usage, similar_escape is non-strict, and it Tom> takes a NULL second argument to mean '\'. This is already a SQL Tom> spec violation, because as far as I can tell from the spec, if you Tom> don't write an ESCAPE clause then there is *no* escape character; Tom> there certainly is not a default of '\'. However, we document this Tom> behavior, so I don't know if we want to change it. This is the same spec violation that we also have for LIKE, which also is supposed to have no escape character in the absense of an explicit ESCAPE clause. -- Andrew (irc:RhodiumToad)
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Tom> I am, frankly, inclined to ignore this as a bad idea. We do have Tom> SIMILAR and ESCAPE as keywords already, but they're Tom> type_func_name_keyword and unreserved_keyword respectively. To Tom> support this syntax, I'm pretty sure we'd have to make them both Tom> fully reserved. Andrew> I only did a quick trial but it doesn't seem to require Andrew> reserving them more strictly - just adding the obvious Andrew> productions to the grammar doesn't introduce any conflicts. Digging deeper, that's because both SIMILAR and ESCAPE have been assigned precedence. Ambiguities that exist include: ... COLNAME ! SIMILAR ( ... which could be COLNAME postfix-op SIMILAR a_expr, or COLNAME infix-op function-call. Postfix operators strike again... we really should kill those off. The ESCAPE part could in theory be ambiguous if the SIMILAR expression ends in a ... SIMILAR TO xxx operator, since then we wouldn't know whether to attach the ESCAPE to that or keep it as part of the function syntax. But I think this is probably a non-issue. More significant is that ... COLNAME ! ESCAPE ... again has postfix- vs. infix-operator ambiguities. -- Andrew (irc:RhodiumToad)
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> The ESCAPE part could in theory be ambiguous if the SIMILAR Andrew> expression ends in a ... SIMILAR TO xxx operator, since then we Andrew> wouldn't know whether to attach the ESCAPE to that or keep it Andrew> as part of the function syntax. But I think this is probably a Andrew> non-issue. More significant is that ... COLNAME ! ESCAPE ... Andrew> again has postfix- vs. infix-operator ambiguities. And this ambiguity shows up already in other contexts: select 'foo' similar to 'f' || escape escape escape from (values ('oo')) v(escape); psql: ERROR: syntax error at or near "escape" LINE 1: select 'foo' similar to 'f' || escape escape escape from (va... select 'foo' similar to 'f' || escape escape from (values ('oo')) v(escape); psql: ERROR: operator does not exist: unknown || LINE 1: select 'foo' similar to 'f' || escape escape from (values ('... I guess this happens because ESCAPE has precedence below POSTFIXOP, so the ('f' ||) gets reduced in preference to shifting in the first ESCAPE token. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Andrew> The ESCAPE part could in theory be ambiguous if the SIMILAR > Andrew> expression ends in a ... SIMILAR TO xxx operator, since then we > Andrew> wouldn't know whether to attach the ESCAPE to that or keep it > Andrew> as part of the function syntax. But I think this is probably a > Andrew> non-issue. More significant is that ... COLNAME ! ESCAPE ... > Andrew> again has postfix- vs. infix-operator ambiguities. > And this ambiguity shows up already in other contexts: > select 'foo' similar to 'f' || escape escape escape from (values ('oo')) v(escape); > psql: ERROR: syntax error at or near "escape" > LINE 1: select 'foo' similar to 'f' || escape escape escape from (va... Hmm. Oddly, you can't fix it by adding parens: regression=# select 'foo' similar to ('f' || escape) escape escape from (values ('oo')) v(escape); psql: ERROR: syntax error at or near "escape" LINE 1: select 'foo' similar to ('f' || escape) escape escape from (... ^ Since "escape" is an unreserved word, I'd have expected that to work. Odd. The big picture here is that fixing grammar ambiguities by adding precedence is a dangerous business :-( regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Hmm. Oddly, you can't fix it by adding parens: Tom> regression=# select 'foo' similar to ('f' || escape) escape escape from (values ('oo')) v(escape); Tom> psql: ERROR: syntax error at or near "escape" Tom> LINE 1: select 'foo' similar to ('f' || escape) escape escape from (... Tom> ^ Tom> Since "escape" is an unreserved word, I'd have expected that to Tom> work. Odd. Simpler cases fail too: select 'f' || escape from (values ('o')) v(escape); psql: ERROR: syntax error at or near "escape" select 1 + escape from (values (1)) v(escape); -- works select 1 & escape from (values (1)) v(escape); -- fails in short ESCAPE can't follow any generic operator, because its lower precedence forces the operator to be reduced as a postfix op instead. Tom> The big picture here is that fixing grammar ambiguities by adding Tom> precedence is a dangerous business :-( Yeah. But the alternative is usually reserving words more strictly, which has its own issues :-( Or we could kill off postfix operators... -- Andrew (irc:RhodiumToad)
[ backing up to a different sub-discussion ] Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> To support the first usage, similar_escape is non-strict, and it > Tom> takes a NULL second argument to mean '\'. This is already a SQL > Tom> spec violation, because as far as I can tell from the spec, if you > Tom> don't write an ESCAPE clause then there is *no* escape character; > Tom> there certainly is not a default of '\'. However, we document this > Tom> behavior, so I don't know if we want to change it. > This is the same spec violation that we also have for LIKE, which also > is supposed to have no escape character in the absense of an explicit > ESCAPE clause. Right. After further thought, I propose that what we ought to do is unify LIKE, SIMILAR TO, and 3-arg SUBSTRING on a single set of behaviors for the ESCAPE argument: 1. They are strict, ie a NULL value for the escape string produces a NULL result. This is per spec, and we don't document anything different, and nobody would really expect something different. (But see below about keeping similar_escape() as a legacy compatibility function.) 2. Omitting the ESCAPE option (not possible for SUBSTRING) results in a default of '\'. This is not per spec, but we've long documented it this way, and frankly I'd say that it's a far more useful default than the spec's behavior of "there is no escape character". I propose that we just document that this is not-per-spec and move on. 3. Interpret an empty ESCAPE string as meaning "there is no escape character". This is not per spec either (the spec would have us throw an error) but it's our historical behavior, and it seems like a saner approach than the way the spec wants to do it --- in particular, there's no way to get that behavior in 3-arg SUBSTRING if we don't allow this. So only point 1 represents an actual behavioral change from what we've been doing; the other two just require doc clarifications. Now, I don't have any problem with changing what happens when somebody actually writes "a LIKE b ESCAPE NULL"; it seems fairly unlikely that anyone would expect that to yield a non-null result. However, we do have a problem with the fact that the implementation is partially exposed: regression=# create view v1 as select f1 similar to 'x*' from text_tbl; CREATE VIEW regression=# \d+ v1 ... View definition: SELECT text_tbl.f1 ~ similar_escape('x*'::text, NULL::text) FROM text_tbl; If we just change similar_escape() to be strict, then this view will stop working, which is a bit hard on users who did not write anything non-spec-compliant. I propose therefore that we leave similar_escape in place with its current behavior, as a compatibility measure for cases like this. Intead, invent two new strict functions, say similar_to_escape(pattern) similar_to_escape(pattern, escape) and change the parser and the implementation of SUBSTRING() to rely on these going forward. The net effect will be to make explicit "ESCAPE NULL" spec-compliant, and to get rid of the performance problem from inlining failure for substring(). All else is just doc clarifications. Comments? regards, tom lane
On Mon, May 13, 2019 at 2:39 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Or we could kill off postfix operators... /me helps Andrew hijack the thread. We wouldn't even have to go that far. We could just restrict it to a specific list of operators that are hard-coded into the lexer and parser, like say only '!'. Even if we killed postfix operators completely, the number of users who would be affected would probably be minimal, because the only postfix operator we ship is for factorial, and realistically, that's not exactly a critical thing for most users, especially considering that our implementation is pretty slow. But the number of people using out-of-core postfix operators has got to be really tiny -- unless, maybe, there's some really popular extension like PostGIS that uses them. I think it's pretty clear that the theoretical beauty of being able to handle postfix operators is not worth the tangible cost they impose on our parser. We're losing more users as a result of SQL that other systems can accept and we cannot than we are gaining by being able to support user-defined postfix operators. The latter is not exactly a mainstream need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think it's pretty clear that the theoretical beauty of being able to > handle postfix operators is not worth the tangible cost they impose on > our parser. We're losing more users as a result of SQL that other > systems can accept and we cannot than we are gaining by being able to > support user-defined postfix operators. I suppose it's possible to make such an argument, but you haven't actually made one --- just asserted something without providing evidence. If we can lay out some concrete gains that justify zapping postfix operators, I'd be willing to do it. I agree that it would likely hurt few users ... but we need to be able to explain to those few why we broke it. And show that the benefits outweigh the cost. regards, tom lane
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes: Robert> But the number of people using out-of-core postfix operators Robert> has got to be really tiny -- unless, maybe, there's some really Robert> popular extension like PostGIS that uses them. If there's any extension that uses them I've so far failed to find it. For the record, the result of my Twitter poll was 29:2 in favour of removing them, for what little that's worth. -- Andrew (irc:RhodiumToad)
I wrote: > I propose therefore that we leave similar_escape in place with its > current behavior, as a compatibility measure for cases like this. > Intead, invent two new strict functions, say > similar_to_escape(pattern) > similar_to_escape(pattern, escape) > and change the parser and the implementation of SUBSTRING() to > rely on these going forward. > The net effect will be to make explicit "ESCAPE NULL" spec-compliant, > and to get rid of the performance problem from inlining failure for > substring(). All else is just doc clarifications. Here's a proposed patch for that. I think it's a bit too late to be messing with this kind of thing for v12, so I'll add this to the upcoming CF. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a79e7c0..7d4472a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -4146,6 +4146,14 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation> </para> <para> + According to the SQL standard, omitting <literal>ESCAPE</literal> + means there is no escape character (rather than defaulting to a + backslash), and a zero-length <literal>ESCAPE</literal> value is + disallowed. <productname>PostgreSQL</productname>'s behavior in + this regard is therefore slightly nonstandard. + </para> + + <para> The key word <token>ILIKE</token> can be used instead of <token>LIKE</token> to make the match case-insensitive according to the active locale. This is not in the <acronym>SQL</acronym> standard but is a @@ -4280,9 +4288,27 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation> </para> <para> - As with <function>LIKE</function>, a backslash disables the special meaning - of any of these metacharacters; or a different escape character can - be specified with <literal>ESCAPE</literal>. + As with <function>LIKE</function>, a backslash disables the special + meaning of any of these metacharacters. A different escape character + can be specified with <literal>ESCAPE</literal>, or the escape + capability can be disabled by writing <literal>ESCAPE ''</literal>. + </para> + + <para> + According to the SQL standard, omitting <literal>ESCAPE</literal> + means there is no escape character (rather than defaulting to a + backslash), and a zero-length <literal>ESCAPE</literal> value is + disallowed. <productname>PostgreSQL</productname>'s behavior in + this regard is therefore slightly nonstandard. + </para> + + <para> + Another nonstandard extension is that following the escape character + with a letter or digit provides access to the same escape sequences + defined for POSIX regular expressions, below (see + <xref linkend="posix-character-entry-escapes-table"/>, + <xref linkend="posix-class-shorthand-escapes-table"/>, and + <xref linkend="posix-constraint-escapes-table"/>). </para> <para> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 8311b1d..6462e13 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -13073,15 +13073,15 @@ a_expr: c_expr { $$ = $1; } | a_expr SIMILAR TO a_expr %prec SIMILAR { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), - list_make2($4, makeNullAConst(-1)), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), + list_make1($4), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~", $1, (Node *) n, @2); } | a_expr SIMILAR TO a_expr ESCAPE a_expr %prec SIMILAR { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), list_make2($4, $6), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~", @@ -13089,15 +13089,15 @@ a_expr: c_expr { $$ = $1; } } | a_expr NOT_LA SIMILAR TO a_expr %prec NOT_LA { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), - list_make2($5, makeNullAConst(-1)), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), + list_make1($5), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~", $1, (Node *) n, @2); } | a_expr NOT_LA SIMILAR TO a_expr ESCAPE a_expr %prec NOT_LA { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), list_make2($5, $7), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~", @@ -14323,9 +14323,9 @@ subquery_Op: | NOT_LA ILIKE { $$ = list_make1(makeString("!~~*")); } /* cannot put SIMILAR TO here, because SIMILAR TO is a hack. - * the regular expression is preprocessed by a function (similar_escape), + * the regular expression is preprocessed by a function (similar_to_escape), * and the ~ operator for posix regular expressions is used. - * x SIMILAR TO y -> x ~ similar_escape(y) + * x SIMILAR TO y -> x ~ similar_to_escape(y) * this transformation is made on the fly by the parser upwards. * however the SubLink structure which handles any/some/all stuff * is not ready for such a thing. diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 90a9197..3d38aef 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -654,15 +654,18 @@ textregexreplace(PG_FUNCTION_ARGS) } /* - * similar_escape() - * Convert a SQL:2008 regexp pattern to POSIX style, so it can be used by - * our regexp engine. + * similar_to_escape(), similar_escape() + * + * Convert a SQL "SIMILAR TO" regexp pattern to POSIX style, so it can be + * used by our regexp engine. + * + * similar_escape_internal() is the common workhorse for three SQL-exposed + * functions. esc_text can be passed as NULL to select the default escape + * (which is '\'), or as an empty string to select no escape character. */ -Datum -similar_escape(PG_FUNCTION_ARGS) +static text * +similar_escape_internal(text *pat_text, text *esc_text) { - text *pat_text; - text *esc_text; text *result; char *p, *e, @@ -673,13 +676,9 @@ similar_escape(PG_FUNCTION_ARGS) bool incharclass = false; int nquotes = 0; - /* This function is not strict, so must test explicitly */ - if (PG_ARGISNULL(0)) - PG_RETURN_NULL(); - pat_text = PG_GETARG_TEXT_PP(0); p = VARDATA_ANY(pat_text); plen = VARSIZE_ANY_EXHDR(pat_text); - if (PG_ARGISNULL(1)) + if (esc_text == NULL) { /* No ESCAPE clause provided; default to backslash as escape */ e = "\\"; @@ -687,12 +686,11 @@ similar_escape(PG_FUNCTION_ARGS) } else { - esc_text = PG_GETARG_TEXT_PP(1); e = VARDATA_ANY(esc_text); elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ - else + else if (elen > 1) { int escape_mblen = pg_mbstrlen_with_len(e, elen); @@ -898,6 +896,65 @@ similar_escape(PG_FUNCTION_ARGS) SET_VARSIZE(result, r - ((char *) result)); + return result; +} + +/* + * similar_to_escape(pattern, escape) + */ +Datum +similar_to_escape_2(PG_FUNCTION_ARGS) +{ + text *pat_text = PG_GETARG_TEXT_PP(0); + text *esc_text = PG_GETARG_TEXT_PP(1); + text *result; + + result = similar_escape_internal(pat_text, esc_text); + + PG_RETURN_TEXT_P(result); +} + +/* + * similar_to_escape(pattern) + * Inserts a default escape character. + */ +Datum +similar_to_escape_1(PG_FUNCTION_ARGS) +{ + text *pat_text = PG_GETARG_TEXT_PP(0); + text *result; + + result = similar_escape_internal(pat_text, NULL); + + PG_RETURN_TEXT_P(result); +} + +/* + * similar_escape(pattern, escape) + * + * Legacy function for compatibility with views stored using the + * pre-v13 expansion of SIMILAR TO. Unlike the above functions, this + * is non-strict, which leads to not-per-spec handling of "ESCAPE NULL". + */ +Datum +similar_escape(PG_FUNCTION_ARGS) +{ + text *pat_text; + text *esc_text; + text *result; + + /* This function is not strict, so must test explicitly */ + if (PG_ARGISNULL(0)) + PG_RETURN_NULL(); + pat_text = PG_GETARG_TEXT_PP(0); + + if (PG_ARGISNULL(1)) + esc_text = NULL; /* use default escape character */ + else + esc_text = PG_GETARG_TEXT_PP(1); + + result = similar_escape_internal(pat_text, esc_text); + PG_RETURN_TEXT_P(result); } diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 8733524..15045d0 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -3321,9 +3321,15 @@ proname => 'repeat', prorettype => 'text', proargtypes => 'text int4', prosrc => 'repeat' }, -{ oid => '1623', descr => 'convert SQL99 regexp pattern to POSIX style', +{ oid => '1623', descr => 'convert SQL regexp pattern to POSIX style', proname => 'similar_escape', proisstrict => 'f', prorettype => 'text', proargtypes => 'text text', prosrc => 'similar_escape' }, +{ oid => '1986', descr => 'convert SQL regexp pattern to POSIX style', + proname => 'similar_to_escape', prorettype => 'text', + proargtypes => 'text text', prosrc => 'similar_to_escape_2' }, +{ oid => '1987', descr => 'convert SQL regexp pattern to POSIX style', + proname => 'similar_to_escape', prorettype => 'text', proargtypes => 'text', + prosrc => 'similar_to_escape_1' }, { oid => '1624', proname => 'mul_d_interval', prorettype => 'interval', @@ -5743,10 +5749,10 @@ { oid => '2073', descr => 'extract text matching regular expression', proname => 'substring', prorettype => 'text', proargtypes => 'text text', prosrc => 'textregexsubstr' }, -{ oid => '2074', descr => 'extract text matching SQL99 regular expression', +{ oid => '2074', descr => 'extract text matching SQL regular expression', proname => 'substring', prolang => 'sql', prorettype => 'text', proargtypes => 'text text text', - prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))' }, + prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_to_escape($2, $3))' }, { oid => '2075', descr => 'convert int8 to bitstring', proname => 'bit', prorettype => 'bit', proargtypes => 'int8 int4', diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out index 486c00b..2483966 100644 --- a/src/test/regress/expected/strings.out +++ b/src/test/regress/expected/strings.out @@ -410,7 +410,56 @@ SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde"; cde (1 row) --- PostgreSQL extension to allow using back reference in replace string; +-- Check behavior of SIMILAR TO, which uses largely the same regexp variant +SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true; + true +------ + t +(1 row) + +SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false; + false +------- + f +(1 row) + +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false; + false +------- + f +(1 row) + +SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true; + true +------ + t +(1 row) + +-- Postgres uses '\' as the default escape character, which is not per spec +SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false; + false +------- + f +(1 row) + +-- and an empty string to mean "no escape", which is also not per spec +SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true; + true +------ + t +(1 row) + +-- these behaviors are per spec, though: +SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null; + null +------ + +(1 row) + +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error; +ERROR: invalid escape string +HINT: Escape string must be empty or one character. +-- Test back reference in regexp_replace SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3'); regexp_replace ---------------- diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql index 5744c9f..b5e75c3 100644 --- a/src/test/regress/sql/strings.sql +++ b/src/test/regress/sql/strings.sql @@ -144,7 +144,20 @@ SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde"; -- With a parenthesized subexpression, return only what matches the subexpr SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde"; --- PostgreSQL extension to allow using back reference in replace string; +-- Check behavior of SIMILAR TO, which uses largely the same regexp variant +SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true; +SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false; +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false; +SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true; +-- Postgres uses '\' as the default escape character, which is not per spec +SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false; +-- and an empty string to mean "no escape", which is also not per spec +SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true; +-- these behaviors are per spec, though: +SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null; +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error; + +-- Test back reference in regexp_replace SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3'); SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g'); SELECT regexp_replace('AAA', '^|$', 'Z', 'g');
Re: SQL-spec incompatibilities in similar_escape() and related stuff
From
Alvaro Herrera from 2ndQuadrant
Date:
This discussion seems to have died down. Apparently we have three directions here, from three different people. Are we doing anything? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> writes: > This discussion seems to have died down. Apparently we have three > directions here, from three different people. Are we doing anything? I don't really want to do anything beyond the patch I submitted in this thread (at <32617.1558649424@sss.pgh.pa.us>). That's what the CF entry is for, IMO. I'm not excited about the change-of-keywords business, but if someone else is, they should start a new CF entry about that. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> writes: >> This discussion seems to have died down. Apparently we have three >> directions here, from three different people. Are we doing anything? Tom> I don't really want to do anything beyond the patch I submitted in Tom> this thread (at <32617.1558649424@sss.pgh.pa.us>). That's what the Tom> CF entry is for, IMO. I have no issues with this approach. Tom> I'm not excited about the change-of-keywords business, but if Tom> someone else is, they should start a new CF entry about that. It's enough of a can of worms that I don't feel inclined to mess with it absent some good reason (the spec probably isn't a good enough reason). If postfix operators should happen to go away at some point then this can be revisited. -- Andrew (irc:RhodiumToad)
Re: SQL-spec incompatibilities in similar_escape() and related stuff
From
Alvaro Herrera from 2ndQuadrant
Date:
On 2019-May-23, Tom Lane wrote: > + <para> > + Another nonstandard extension is that following the escape character > + with a letter or digit provides access to the same escape sequences > + defined for POSIX regular expressions, below (see > + <xref linkend="posix-character-entry-escapes-table"/>, > + <xref linkend="posix-class-shorthand-escapes-table"/>, and > + <xref linkend="posix-constraint-escapes-table"/>). > </para> I think the word "same" in this para is more confusing than helpful; also the tables are an integral part of this rather than just an illustration, so they should not be in parenthesis but after only a semicolon or such. So: > + Another nonstandard extension is that following the escape character > + with a letter or digit provides access to the escape sequences > + defined for POSIX regular expressions; see > + <xref linkend="posix-character-entry-escapes-table"/>, > + <xref linkend="posix-class-shorthand-escapes-table"/>, and > + <xref linkend="posix-constraint-escapes-table"/> below. I think it would be useful to provide a trivial example that illustrates this in the <para> below; say '\mabc\M' not matching "zabc". All in all, these are pretty trivial points and I would certainly not be mad if it's committed without these changes. Marked ready for committer. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> writes: > Marked ready for committer. Thanks for reviewing. I adopted your doc change suggestions and pushed it. regards, tom lane