Re: Why format() adds double quote? - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: Why format() adds double quote?
Date
Msg-id e3788f38-83e5-4036-9fd7-faa6ea32b774@mm
Whole thread Raw
In response to Re: Why format() adds double quote?  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Why format() adds double quote?
List pgsql-hackers
Tatsuo Ishii wrote:

> 2) What does the SQL standard say? Do they say that non ASCII white
>   spaces should be treated as ASCII white spaces?

I've used white space in the example, but I'm concerned about
punctuation too.

unicode.org has this helpful paper:
http://www.unicode.org/L2/L2000/00260-sql.pdf
which studies Unicode in SQL-99 identifiers.

The relevant BNF they extracted from the standard looks like this:

identifier body> ::=  <identifier start>  [ { <underscore> | <identifier part> }... ]

<identifier start> ::=  <initial alphabetic character>  | <ideographic character>

<identifier part> ::=   <alphabetic character>   | <ideographic character>   | <decimal digit character>   |
<identifiercombining character>   | <underscore>   | <alternate underscore>   | <extender character>   | <identifier
ignorablecharacter>   | <connector character> 

<delimited identifier> ::=  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::= <delimited identifier part>...

<delimited identifier part> ::=  <nondoublequote character>  | <doublequote symbol>

========

The current version of quote_ident() plays it safe by implementing
the rule that, as soon it encounters a character outside
of US-ASCII, it surrounds the identifier with double quotes, no matter
to which category or block this character belongs.
So its output is guaranteed to be compatible with the above grammar.

The change in the patch is that multibyte characters just don't imply
quoting.

But according to the points 1 and 2 of the paper, the first character
must have the Unicode alphabetic property, and it must not
have the Unicode combining property.

I'm mostly ignorant in Unicode so I'm not sure of the precise
implications of having such Unicode properties, but still my
understanding is that the new quote_ident() ignores these rules,
so in this sense it could produce outputs that wouldn't be
compatible with SQL-99.

Also, here's what we say in the manual about non quoted identifiers:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html

"SQL identifiers and key words must begin with a letter (a-z, but also
letters with diacritical marks and non-Latin letters) or an underscore
(_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($)"

So it explicitly allows letters in general  (and also seems less
strict than SQL-99 about underscore), but it makes no promise about
Unicode punctuation or spaces, for instance, even though in practice
the parser seems to accept them just fine.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Patch: fix lock contention for HASHHDR.mutex
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: PL/Pythonu - function ereport