Home > mailing lists

Re: Why format() adds double quote? - Mailing list pgsql-hackers

From	Daniel Verite
Subject	Re: Why format() adds double quote?
Date	January 27, 2016 15:37:59
Msg-id	e3788f38-83e5-4036-9fd7-faa6ea32b774@mm Whole thread Raw
In response to	Re: Why format() adds double quote? (Tatsuo Ishii <ishii@postgresql.org>)
Responses	Re: Why format() adds double quote?
List	pgsql-hackers

Tree view

Tatsuo Ishii wrote:

> 2) What does the SQL standard say? Do they say that non ASCII white
>   spaces should be treated as ASCII white spaces?

I've used white space in the example, but I'm concerned about
punctuation too.

unicode.org has this helpful paper:
http://www.unicode.org/L2/L2000/00260-sql.pdf
which studies Unicode in SQL-99 identifiers.

The relevant BNF they extracted from the standard looks like this:

identifier body> ::=  <identifier start>  [ { <underscore> | <identifier part> }... ]

<identifier start> ::=  <initial alphabetic character>  | <ideographic character>

<identifier part> ::=   <alphabetic character>   | <ideographic character>   | <decimal digit character>   |
<identifiercombining character>   | <underscore>   | <alternate underscore>   | <extender character>   | <identifier
ignorablecharacter>   | <connector character> 

<delimited identifier> ::=  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::= <delimited identifier part>...

<delimited identifier part> ::=  <nondoublequote character>  | <doublequote symbol>

========

The current version of quote_ident() plays it safe by implementing
the rule that, as soon it encounters a character outside
of US-ASCII, it surrounds the identifier with double quotes, no matter
to which category or block this character belongs.
So its output is guaranteed to be compatible with the above grammar.

The change in the patch is that multibyte characters just don't imply
quoting.

But according to the points 1 and 2 of the paper, the first character
must have the Unicode alphabetic property, and it must not
have the Unicode combining property.

I'm mostly ignorant in Unicode so I'm not sure of the precise
implications of having such Unicode properties, but still my
understanding is that the new quote_ident() ignores these rules,
so in this sense it could produce outputs that wouldn't be
compatible with SQL-99.

Also, here's what we say in the manual about non quoted identifiers:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html

"SQL identifiers and key words must begin with a letter (a-z, but also
letters with diacritical marks and non-Latin letters) or an underscore
(_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($)"

So it explicitly allows letters in general  (and also seems less
strict than SQL-99 about underscore), but it makes no promise about
Unicode punctuation or spaces, for instance, even though in practice
the parser seems to accept them just fine.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

pgsql-hackers by date:

From: Dilip Kumar
Date: 27 January 2016, 15:27:07
Subject: Re: Patch: fix lock contention for HASHHDR.mutex

From: Pavel Stehule
Date: 27 January 2016, 16:01:08
Subject: Re: proposal: PL/Pythonu - function ereport

Re: Why format() adds double quote? - Mailing list pgsql-hackers

Previous

Next