Re: Unicode support - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: Unicode support
Date
Msg-id 87r5zualin.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: Unicode support  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-hackers
>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes:
> On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:>> FWIW, the SQL spec puts the onus of normalization squarely
onthe>> application; the database is allowed to assume that Unicode>> strings are already normalized, is allowed to
behavein>> implementation-defined ways when presented with strings that>> aren't normalized, and provision of
normalizationfunctions and>> predicates is just another optional feature.
 
Peter> Can you name chapter and verse on that?

4.2.8 Universal character sets
 A UCS string is a character string whose character repertoire is UCS and whose character encoding form is one of UTF8,
UTF16,or UTF32. Any two UCS strings are comparable.
 
 An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC),
NormalizationForm D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as specified by [Unicode15].
<normalizedpredicate> may be used to verify the normalization form to which a particular UCS string conforms.
Applicationsmay also use <normalize function> to enforce a particular <normal form>. With the exception of <normalize
function>and <normalized predicate>, the result of any operation on an unnormalized UCS string is
implementation-defined.
 Conversion of UCS strings from one character set to another is automatic.
 Detection of a noncharacter in a UCS-string causes an exception condition to be raised. The detection of an unassigned
codepoint does not.
 

[Obviously there are things here that we don't conform to anyway (we
don't raise exceptions for noncharacters, for example. We don't claim
conformance to T061.]

<normalized predicate> ::= <row value predicand> <normalized predicate part 2>
<normalized predicate part 2> ::= IS [ NOT ] [ <normal form> ] NORMALIZED

1) Without Feature T061, "UCS support", conforming SQL language shall  not contain a <normalized predicate>.

2) Without Feature F394, "Optional normal form specification",  conforming SQL language shall not contain <normal
form>.

<normalize function> ::= NORMALIZE <left paren> <character value expression>     [ <comma> <normal form> [ <comma>
<normalizefunction result length> ] ] <right paren>
 

<normal form> ::=   NFC | NFD | NFKC | NFKD

7) Without Feature T061, "UCS support", conforming SQL language shall  not contain a <normalize function>.

9) Without Feature F394, "Optional normal form specification",  conforming SQL language shall not contain <normal
form>.
Peter> I see this, for example,
Peter> 6.27 <numeric value function>[...]Peter> So SQL redirects the question of character length the UnicodePeter>
standard. I have not been able to find anything there on aPeter> quick look, but I'm sure the Unicode standard has some
veryPeter>specific ideas on this.  Note that the matter of normalizationPeter> is not mentioned here.
 

I've taken a not-so-quick look at the Unicode standard (though I don't
claim to be any sort of expert on it), and I certainly can't see any
definitive indication what the length is supposed to be; however, the
use of terminology such as "combining character sequence" (meaning a
series of codepoints that combine to make a single glyph) certainly
seems to strongly imply that our interpretation is correct and that
the OP's is not.

Other indications: the units used by length() must be the same as the
units used by position() and substring() (in the spec, when USING
CHARACTERS is specified), and it would not make sense to use a
definition of "character" that did not allow you to look inside a
combining sequence.

I've also failed so far to find any examples of other programming
languages in which a combining character sequence is taken to be a
single character for purposes of length or position specification.

-- 
Andrew (irc:RhodiumToad)


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: proposal: add columns created and altered to pg_proc and pg_class
Next
From: Tom Lane
Date:
Subject: Re: proposal: add columns created and altered to pg_proc and pg_class