Thread: is there a deep unyielding reason to limit U&'' literals to ASCII?
I see in the documentation (and confirm in practice) that a Unicode character string literal U&'...' is only allowed to have <Unicode escape value>s representing Unicode characters if the server encoding is, exactly and only, UTF8. Otherwise, it can still have <Unicode escape value>s, but they can only be in the range \+000001 to \+00007f and can only represent ASCII characters ... and this isn't just for an ASCII server encoding but for _any server encoding other than UTF8_. I'm a newcomer here, so maybe there was an existing long conversation where that was determined to be necessary for some deep reason, and I just need to be pointed to it. What I would have expected would be to allow <Unicode escape value>s for any Unicode codepoint that's representable in the server encoding, whatever encoding that is. Indeed, that's how I read the SQL standard (or my scrounged 2006 draft of it, anyway). The standard even lets you precede U& with _charsetname and have the escapes be allowed to be any character representable in the specified charset. *That*, I assume, would be tough to implement in PostgreSQL, since strings don't walk around with their own personal charsets attached. But what's the reason for not being able to mention characters available in the server encoding? -Chap
On Sat, Jan 23, 2016 at 11:27 PM, Chapman Flack <chap@anastigmatix.net> wrote: > I see in the documentation (and confirm in practice) that a > Unicode character string literal U&'...' is only allowed to have > <Unicode escape value>s representing Unicode characters if the > server encoding is, exactly and only, UTF8. > > Otherwise, it can still have <Unicode escape value>s, but they can only > be in the range \+000001 to \+00007f and can only represent ASCII characters > ... and this isn't just for an ASCII server encoding but for _any server > encoding other than UTF8_. > > I'm a newcomer here, so maybe there was an existing long conversation > where that was determined to be necessary for some deep reason, and I > just need to be pointed to it. > > What I would have expected would be to allow <Unicode escape value>s > for any Unicode codepoint that's representable in the server encoding, > whatever encoding that is. Indeed, that's how I read the SQL standard > (or my scrounged 2006 draft of it, anyway). The standard even lets > you precede U& with _charsetname and have the escapes be allowed to > be any character representable in the specified charset. *That*, I assume, > would be tough to implement in PostgreSQL, since strings don't walk > around with their own personal charsets attached. But what's the reason > for not being able to mention characters available in the server encoding? I don't know anything for sure here, but I wonder if it would make validating string literals in non-UTF8 encodings significant more costly. When the encoding is UTF-8, the test as to whether the escape sequence forms a legal code point doesn't require any table lookups. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jan 23, 2016 at 11:27 PM, Chapman Flack <chap@anastigmatix.net> wrote: >> What I would have expected would be to allow <Unicode escape value>s >> for any Unicode codepoint that's representable in the server encoding, >> whatever encoding that is. > I don't know anything for sure here, but I wonder if it would make > validating string literals in non-UTF8 encodings significant more > costly. I think it would, and it would likely also require function calls to loadable functions (at least given the current design whereby encoding conversions are farmed out to loadable libraries). I do not especially want the lexer doing that; it will open all sorts of fun questions involving what we can lex in an already-failed transaction. It may well be that these issues are surmountable with some sweat, but it doesn't sound like an easy patch to me. And how big is the use-case, really? regards, tom lane
Re: [HACKERS] is there a deep unyielding reason to limit U&''literals to ASCII?
From
Chapman Flack
Date:
On 1/25/16 12:52 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sat, Jan 23, 2016 at 11:27 PM, Chapman Flack <chap@anastigmatix.net> wrote: >>> What I would have expected would be to allow <Unicode escape value>s >>> for any Unicode codepoint that's representable in the server encoding, >>> whatever encoding that is. > >> I don't know anything for sure here, but I wonder if it would make >> validating string literals in non-UTF8 encodings significant more >> costly. > > I think it would, and it would likely also require function calls to > loadable functions (at least given the current design whereby encoding > conversions are farmed out to loadable libraries). I do not especially > want the lexer doing that; it will open all sorts of fun questions > involving what we can lex in an already-failed transaction. How outlandish would it be (not for v12, obviously!) to decree that the lexer produces UTF-8 representations of string and identifier literals unconditionally, and in some later stage of processing the parse tree, those get munged to the server encoding if different? That would keep the lexer simple, and I think it's in principle the 'correct' view if there is such a thing; choice of encoding doesn't change what counts as valid lexical form for a U&'...' or U&"..." literal, but only whether a literal thus created might happen to fit in your encoding. If it doesn't, I think that's technically a data error (22021) rather than one of syntax or lexical form. > It may well be that these issues are surmountable with some sweat, > but it doesn't sound like an easy patch to me. And how big is the > use-case, really? Hmm, other than the benefit of not having to explain why it /doesn't/ work? one could imagine a tool generating SQL output that'll be saved and run in a database through client or server encodings not known in advance, adopting a simple strategy of producing only 7-bit ASCII output and using U& literals for whatever ain't ASCII ... that would be, in principle, about the most bulletproof way for such a tool to work, but it's exactly what won't work in PostgreSQL unless the encoding is UTF-8 (which is the one scenario where there's no need for such machinations, as the literals could appear directly!). I'm a maintainer of one such SQL-generating tool, so I know the set of use cases would have at least one element, if only it would work. Regards, -Chap