Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE - Mailing list pgsql-general

From Bryn Llewellyn
Subject Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE
Date
Msg-id 72349704-92B2-46FF-8DE1-C3E43C9B5B6A@yugabyte.com
Whole thread Raw
In response to Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE
Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE
Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE
List pgsql-general
david.g.johnston@gmail.com wrote:

bryn@yugabyte.com wrote:

(3) The PG doc on quote_ident says this in large friendly letters:

Quotes are added only if necessary…

Notice "only". I now know that this is very much not the case. You can compose an effectively unlimited number of different examples along these lines:

select quote_ident('redaktør'); → "redaktør"
create table redaktør(n int); → table successfully created

Yep, and that is precisely what would make for a good bug report. Pointing out that "if necessary" does not indeed match up with the behavior. I suspect it is likely to get changed - everything else being discussed just detracts attention from it.

*BRIEFLY*

What does "make for a good bug report" mean, David? Is it:

(1.1) You, David, or somebody else who has been officially recognized as a PG Contributor (https://www.postgresql.org/community/contributors/) will file the bug, granting it credibility with their imprimatur?

or (1.2) I, Bryn, should file the bug.

About "I suspect it is likely to get changed", do you mean:

(2.1) Change the doc to match quote_ident's current, unpredictable, behavior? (By all means, substitute "hard to describe accurately, precisely, and yet still tersely" for "unpredictable".)

(2.2) Change quote_ident's implementation—and then write new doc to describe the new behavior precisely and accurately? And for this option, the next question is "What's the spec of the changed implementation?"

Notice that the issue is broader than just quote_ident, as this test shows:

prepare x(text) as select format('« %s », gives « %I ».', $1::text, $1::text);
execute x('dog');
execute x('Dog');
execute x('农民');

The same over-zealous double-quoting that quote_ident shows for 农民 is shown by format. Presumably they share the same underlying implementation (but, surprisingly, don't re-use the actual SQL parser code). Option 2.1 implies using the same wording for what provokes double-quoting for each function. I'd make a similar argument for option 2.2.

*MORE DETAIL*

About option (2.2), I mentioned that ORCL's equivalent to quote_ident implements a simpler rule: the text of the identifier that it returns is always surrounded with double quotes, whether or not doing this is necessary. The ORCL scheme relies on the fact that double-quoting when this isn't necessary is in no way harmful. Here’s pseudocode (presented as tested PL/pgSQL) for what a patched C implementation of quote_ident might do:

create function quote_ident_2(name in text)
  returns text
  language plpgsql
as $body$
declare
  i0     text not null := quote_ident(name);
  i1     text not null := regexp_replace(regexp_replace(i0, '^"', ''), '"$', '');
  ident  text not null := case(i1 = i0)
                            when true then '"'||i0||'"'
                            else                i0
                          end case;
begin
  return ident;
end;
$body$;

Re David’s

everything else being discussed just detracts attention from it.

I’m not convinced. The discussion has shown that some people are somewhat confused. For example, it was suggested that a name like this:

农民

ought to be double-quoted. A simple test shows that this isn’t the case. And it helps if everybody is clear about that.

There's also the question of use-cases. I've been forced to think a lot about SQL injection over the years. It's immediately obvious from reading any of the skimpiest blogs on the topic that the root cause is always faulty code that's been written by a confused developer. But it's very rare to see an account of the root cause whose wording uses carefully defined terms of art. (In fact, the notion of defining and using such terms of art has been resisted by contributors to this list.) If a future PG shipped with a built-in function like this:

function is_exotic(name in text) returns boolean

...with, of course, excellent documentation, then an outfit that decided to outlaw the use of exotic names (and this is almost always how the outcome emerges) could police adherence to the rule, database wide, (and cluster wide for global phenomena) with a few trivial tests against the relevant catalog relations, like this:

do $body$
begin
  assert not exists (select 1 from pg_class where is_exotic(relname));
end;
$body$;

A shipped "is_exotic" function would bring the secondary, but by no means insignificant, benefit, that the case for using “format” with "%I" or its "quote_ident" cousin could be grounded upon solidly defined, and named, notions. Like this example shows:

do $body$
declare
  stmt constant  text   not null := 'create table %s(n int);';
  n              text   not null := '';

  simple_names constant text[] not null := array['farmers', 'bønder', '农民', '农民、儿童', '农民,儿童'];

  exotic_names constant text[] not null := array['T', 'farmers and children', '"x', 'y"', '农民 & 儿童'];
begin
  foreach n in array simple_names loop
    assert not is_exotic(n), 'Problem with '||n;
  end loop;

  foreach n in array exotic_names loop
    assert is_exotic(n), 'Problem with '||n;
  end loop;
end;
$body$;

Who knew that using Chinese punctuation characters in a name (they have their own code points, and look a bit different when you look closely) don't make the name exotic...

*THE OTHER CHOICE (for option 2.2): make it true that quote_ident and format with %I surround the input with double quotes exactly and only when this is necessary*

Choosing between the two alternatives (the ORCL rule or dependable parsimony) wouldn't matter much if PG comes with a built-in is_exotic function. But if the user has to implement their own, then the implementation would be far, far simpler and therefore more reliable if it were decided to implement dependable parsimony.

The following code shows what I mean.

create function is_exotic(name in text)
  returns boolean
  language plpgsql
as $body$
declare
  ident          constant text not null := quote_ident_2(name);
  stripped_ident constant text not null := regexp_replace(regexp_replace(ident, '^"', ''), '"$', '');
  stmt           constant text not null := 'deallocate %s';
begin
  /*
    This is the easy special case. The input is entirely letters inascii(7), or some
    other character set that distinguishes between upper and lower case, and has at
    least one upper case letter. This literal actual argument 'Dogs' is a sufficient
    example. Here, the value of "stripped_ident" will be equal to the value of the
    input. It's essential to test for this special case because the general test,
    below, is (famously) happy to execute "deallocate Dog" and the like.
  */
  if lower(name) <> stripped_ident then
    return true;
  end if;

  /*
    This is the harder general case. The input is anything at all that passes the
    first test and that causes no error when used "as is" in the role of a SQL identifier.
    Without access to the actual implementation of the SQL parser, the simplest practical
    way to check this property is to use an empirical test. The "deallocate" statement is
    fairly lightweight. The approach seems to bring no measureable performance penalty.
  */
  begin
    execute format(stmt, stripped_ident);
  exception when invalid_sql_statement_name then null; end;
  return false;

exception when syntax_error then
  return true;
end;
$body$;

Compare this with the implementation that I thought, at first, that I could use when I simply believed the doc. (The subject line of this thread hits at the trivial SQL statement that would implement the "language SQL" function.) ANd if that's all there is to it, then when not ship is as a built-in?

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: pg_restore creates public schema?
Next
From: Adrian Klaver
Date:
Subject: Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE