Re: Why format() adds double quote? - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: Why format() adds double quote?
Date
Msg-id 20160128.090029.781286852790195741.t-ishii@sraoss.co.jp
Whole thread Raw
In response to Re: Why format() adds double quote?  ("Daniel Verite" <daniel@manitou-mail.org>)
List pgsql-hackers
> I've used white space in the example, but I'm concerned about
> punctuation too.
> 
> unicode.org has this helpful paper:
> http://www.unicode.org/L2/L2000/00260-sql.pdf
> which studies Unicode in SQL-99 identifiers.
> 
> The relevant BNF they extracted from the standard looks like this:
> 
> identifier body> ::=
>    <identifier start>
>    [ { <underscore> | <identifier part> }... ]
> 
> <identifier start> ::=
>    <initial alphabetic character>
>    | <ideographic character>
> 
> <identifier part> ::=
>     <alphabetic character>
>     | <ideographic character>
>     | <decimal digit character>
>     | <identifier combining character>
>     | <underscore>
>     | <alternate underscore>
>     | <extender character>
>     | <identifier ignorable character>
>     | <connector character>
> 
> <delimited identifier> ::=
>    <double quote> <delimited identifier body> <double quote>
> 
> <delimited identifier body> ::= <delimited identifier part>...
> 
> <delimited identifier part> ::=
>    <nondoublequote character>
>    | <doublequote symbol>
> 
> ========
> 
> The current version of quote_ident() plays it safe by implementing
> the rule that, as soon it encounters a character outside
> of US-ASCII, it surrounds the identifier with double quotes, no matter
> to which category or block this character belongs.
> So its output is guaranteed to be compatible with the above grammar.
> 
> The change in the patch is that multibyte characters just don't imply
> quoting.
> 
> But according to the points 1 and 2 of the paper, the first character
> must have the Unicode alphabetic property, and it must not
> have the Unicode combining property.

Good point.

> I'm mostly ignorant in Unicode so I'm not sure of the precise
> implications of having such Unicode properties, but still my
> understanding is that the new quote_ident() ignores these rules,
> so in this sense it could produce outputs that wouldn't be
> compatible with SQL-99.
> 
> Also, here's what we say in the manual about non quoted identifiers:
> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html
> 
> "SQL identifiers and key words must begin with a letter (a-z, but also
> letters with diacritical marks and non-Latin letters) or an underscore
> (_). Subsequent characters in an identifier or key word can be
> letters, underscores, digits (0-9), or dollar signs ($)"
> 
> So it explicitly allows letters in general  (and also seems less
> strict than SQL-99 about underscore), but it makes no promise about
> Unicode punctuation or spaces, for instance, even though in practice
> the parser seems to accept them just fine.

You could arbitary extend your point, not only with Unicode
punctuation or spaces, There are number of characters look-alike "-"
in Unicode, for example. Do we want to treat them like ASCII "-"?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



pgsql-hackers by date:

Previous
From: "Dickson S. Guedes"
Date:
Subject: Re: Why format() adds double quote?
Next
From: Alvaro Herrera
Date:
Subject: Re: [PATCH] we have added support for box type in SP-GiST index