Thread: Unicode string literals versus the world
So I started to look at what might be involved in teaching plpgsql about standard_conforming_strings, and was soon dismayed by the sheer epic nature of its failure to act like the core lexer. It was shaky enough before, but the recent introduction of Unicode strings and identifiers into the core has left plpgsql hopelessly behind. I can see two basic approaches to making things work: copy-and-paste practically all of parser/scan.l into plpgsql's lexer (certainly all of it that involves exclusive states); or throw out plpgsql's lexer altogether in favor of somehow using the core lexer directly. Neither one looks very attractive. It gets worse though: I have seldom seen such a badly designed piece of syntax as the Unicode string syntax --- see http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE You scan the string, and then after that they tell you what the escape character is!? Not to mention the obvious ambiguity with & as an operator. If we let this go into 8.4, our previous rounds with security holes caused by careless string parsing will look like a day at the beach. No frontend that isn't fully cognizant of the Unicode string syntax is going to parse such things correctly --- it's going to be trivial for a bad guy to confuse a quoting mechanism as to what's an escape and what isn't. I think we need to give very serious consideration to ripping out that "feature". regards, tom lane
On Fri, Apr 10, 2009 at 05:54:25PM -0400, Tom Lane wrote: > It gets worse though: I have seldom seen such a badly designed piece of > syntax as the Unicode string syntax --- see > http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE > > I think we need to give very serious consideration to ripping out that > "feature". I'd agree it's comically bad syntax; I just had a look in the archives and it was only put in a few months ago: http://archives.postgresql.org/pgsql-hackers/2008-10/msg01169.php I can't see much support in the other database engines; searched for Oracle, MS-SQL, DB2 and Firebird. MySQL has it planned for 7.1, so not for a while. http://forge.mysql.com/worklog/task.php?id=3529 -- Sam http://samason.me.uk/
Tom Lane wrote: > > It gets worse though: I have seldom seen such a badly designed piece of > syntax as the Unicode string syntax --- see > http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE > > You scan the string, and then after that they tell you what the escape > character is!? Not to mention the obvious ambiguity with & as an > operator. > > If we let this go into 8.4, our previous rounds with security holes > caused by careless string parsing will look like a day at the beach. > No frontend that isn't fully cognizant of the Unicode string syntax is > going to parse such things correctly --- it's going to be trivial for > a bad guy to confuse a quoting mechanism as to what's an escape and what > isn't. > > I think we need to give very serious consideration to ripping out that > "feature". > > > +1 I don't recall a great deal of discussion about it, and it certainly looks pretty horrible now you point it out. cheers andrew
On 4/11/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It gets worse though: I have seldom seen such a badly designed piece of > syntax as the Unicode string syntax --- see > http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE > > You scan the string, and then after that they tell you what the escape > character is!? Not to mention the obvious ambiguity with & as an > operator. > > If we let this go into 8.4, our previous rounds with security holes > caused by careless string parsing will look like a day at the beach. > No frontend that isn't fully cognizant of the Unicode string syntax is > going to parse such things correctly --- it's going to be trivial for > a bad guy to confuse a quoting mechanism as to what's an escape and what > isn't. > > I think we need to give very serious consideration to ripping out that > "feature". Ugh, it's rather dubious indeed. Especially when we are already in the middle of seriously confusing conversion from stdstr=off -> on. Is it really OK to introduce even more complexity in the mix? Alternative proposal - maybe it would be saner to introduce \uXXXX escape to E'' strings as a non-standard way for quoting unicode. Later when the standard quoting is our only quoting method we can play with standard extensions? -- marko
On 4/11/09 11:47 AM, Marko Kreen wrote: > On 4/11/09, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> It gets worse though: I have seldom seen such a badly designed piece of >> syntax as the Unicode string syntax --- see >> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE WTF? Whose feature is this? What's the use case? Speak up, submitter, please. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> On 4/11/09, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE > WTF? Whose feature is this? What's the use case? Peter put it in, I think. It is in the SQL:2008 spec, but that doesn't change the fact that it's a horribly bad piece of design. regards, tom lane
> Peter put it in, I think. It is in the SQL:2008 spec, but that doesn't > change the fact that it's a horribly bad piece of design. Hmmm. We're not going to implement *everything* in the spec; nobody does, even IBM. I think maybe these kinds of additions need to be hashed out for value so we don't waste Peter's work in the future. Peter? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Saturday 11 April 2009 00:54:25 Tom Lane wrote: > It gets worse though: I have seldom seen such a badly designed piece of > syntax as the Unicode string syntax --- see > http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL >-SYNTAX-STRINGS-UESCAPE > > You scan the string, and then after that they tell you what the escape > character is!? Not to mention the obvious ambiguity with & as an > operator. > > If we let this go into 8.4, our previous rounds with security holes > caused by careless string parsing will look like a day at the beach. > No frontend that isn't fully cognizant of the Unicode string syntax is > going to parse such things correctly --- it's going to be trivial for > a bad guy to confuse a quoting mechanism as to what's an escape and what > isn't. Note that the escape character marks the Unicode escapes; it doesn't affect the quote characters that delimit the string. So offhand I can't see any potential for quote confusion/SQL injection type problems. Please elaborate if you see a problem. If there are problems, we could consider getting rid of the UESCAPE clause. Without it, the U&'' strings would behave much like the E'' strings. But I'd like to understand the problem first.
On Saturday 11 April 2009 21:50:29 Josh Berkus wrote: > On 4/11/09 11:47 AM, Marko Kreen wrote: > > On 4/11/09, Tom Lane<tgl@sss.pgh.pa.us> wrote: > >> It gets worse though: I have seldom seen such a badly designed piece > >> of syntax as the Unicode string syntax --- see > >> > >> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html# > >>SQL-SYNTAX-STRINGS-UESCAPE > > WTF? Whose feature is this? What's the use case? The use case is approximately the same as xxx; in HTML: entering Unicode characters that your screen or keyboard cannot easily produce. It's a desperately needed feature for me.
On 4/14/09, Peter Eisentraut <peter_e@gmx.net> wrote: > On Saturday 11 April 2009 00:54:25 Tom Lane wrote: > > It gets worse though: I have seldom seen such a badly designed piece of > > syntax as the Unicode string syntax --- see > > http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL > >-SYNTAX-STRINGS-UESCAPE > > > > You scan the string, and then after that they tell you what the escape > > character is!? Not to mention the obvious ambiguity with & as an > > operator. > > > > If we let this go into 8.4, our previous rounds with security holes > > caused by careless string parsing will look like a day at the beach. > > No frontend that isn't fully cognizant of the Unicode string syntax is > > going to parse such things correctly --- it's going to be trivial for > > a bad guy to confuse a quoting mechanism as to what's an escape and what > > isn't. > > > Note that the escape character marks the Unicode escapes; it doesn't affect the > quote characters that delimit the string. So offhand I can't see any potential > for quote confusion/SQL injection type problems. Please elaborate if you see > a problem. > > If there are problems, we could consider getting rid of the UESCAPE clause. > Without it, the U&'' strings would behave much like the E'' strings. But I'd > like to understand the problem first. I think the problem is that they should not act like E'' strings, but they should act like plain '' strings - they should follow stdstr setting. That way existing tools that may (or may not..) understand E'' and stdstr settings, but definitely have not heard about U&'' strings can still parse the SQL without new surprises. If they already act that way then keeping U& should be fine. And if UESCAPE does not affect main string parsing, but is handled in second pass going over parsed string - like bytea \ - then that should also be fine and should not cause any new surprises. But if not, it must go. I would prefer that such quoting extensions would wait until stdstr=on setting is the only mode Postgres will operate. Fitting new quoting ways to environment with flippable stdstr setting will be rather painful for everyone. I still stand on my proposal, how about extending E'' strings with unicode escapes (eg. \uXXXX)? The E'' strings are already more clearly defined than '' and they are our "own", we don't need to consider random standards, but can consider our sanity. -- marko
On 4/14/09, Peter Eisentraut <peter_e@gmx.net> wrote: > On Tuesday 14 April 2009 14:38:38 Marko Kreen wrote: > > I think the problem is that they should not act like E'' strings, but they > > should act like plain '' strings - they should follow stdstr setting. > > > > That way existing tools that may (or may not..) understand E'' and stdstr > > settings, but definitely have not heard about U&'' strings can still > > parse the SQL without new surprises. > > > Can you be more specific in what "surprises" you expect? What algorithms do > you suppose those "existing tools" use and what expectations do they have? If the parsing does not happen in 2 passes and it does not take account of stdstr setting then the default breakage would be: stdstr=off, U&' \' UESCAPE '!'. And anything, whose security or functionality depends on parsing SQL can be broken that way. Broken functionality would be eg. Slony (or other replication solution) distributing developer-written SQL code to bunch of nodes. It needs to parse text file to SQL statements and execute them separately. There are probably other solutions who expect to understand SQL at least token level to function correctly. (pgpool, java has probably something depending on it, etc.) > > I still stand on my proposal, how about extending E'' strings with > > unicode escapes (eg. \uXXXX)? The E'' strings are already more > > clearly defined than '' and they are our "own", we don't need to > > consider random standards, but can consider our sanity. > > > This doesn't excite me. I think the tendency should be to get rid of E'' > usage, because its definition of escape sequences is single-byte and ASCII > centric and thus overall a legacy construct. Why are you concentrating only on \0xx escapes? The \\, \n, etc seem standard and forward looking enough. Yes, unicode escapes are missing but we can add them without breaking anything. > Certainly, we will want to keep > around E'' for a long time or forever, but it is a legitimate goal for > application writers to not use it, which is after all the reason behind this > whole standards-conforming strings project. I wouldn't want to have a > forward-looking feature such as the Unicode escapes be burdened with that kind > of legacy behavior. > > Also note that Unicode escapes are also available for identifiers, for which > there is no existing E"" that you can add it to. Well, I was not rejecting the standard quoting, but suggesting postponing until the stdstr mess is sorted out. We can use \uXX in meantime and I think most Postgres users would prefer to keep using it... -- marko
Peter Eisentraut <peter_e@gmx.net> writes: > On Saturday 11 April 2009 00:54:25 Tom Lane wrote: >> If we let this go into 8.4, our previous rounds with security holes >> caused by careless string parsing will look like a day at the beach. > Note that the escape character marks the Unicode escapes; it doesn't > affect the quote characters that delimit the string. So offhand I > can't see any potential for quote confusion/SQL injection type > problems. Please elaborate if you see a problem. The problem is the interaction with non-standard-conforming strings. Case 1: select u&'foo\' uescape ',' ... The backend will see the backslash as just a data character, and will think that "..." is live SQL text. A non-Unicode-literal-aware frontend will think that the backslash escapes the second quote, the comma is live SQL text, and the ... is quoted material. Construction of an actual SQL injection attack is left as an exercise for the reader, but certainly the raw material is here. Case 2: select u&'foo' uescape '\' ... Again, any existing frontend code will think that the backslash quotes the final quote and the ... is quoted material. This one is particularly nasty because we allow arbitrary amounts of whitespace and commenting on either side of "uescape": select u&'foo' /* hello joe, do you /* understand nested comments today?*/ -- yes, this one too */uescape-- but notthis one /*'\' ... I suspect that it's actually impossible to parse such a thing correctly without a full-fledged flex lexer or something of equivalent complexity. Certainly it's a couple of orders of magnitude harder than it is for either standard-conforming or E'' literals. Case 3: select u&'foo\' uescape ',' ... select u & 'foo\' uescape ',' ... In the first form the ... is live SQL, in the second form it is quoted material. This means that you might correctly validate a query and then have your results invalidated by later processing that innocently adds or removes whitespace. (This is particularly nasty in a standard that demands we parse "x/-1" and "x / -1" the same ...) So what we've got here is a whole new set of potential SQL injection attacks by confusing frontend literal-syntax checking, plus a truly staggering increase in the minimum *required* complexity of such checking. I understand the usefulness of being able to write Unicode code points, but they're not useful enough to justify this syntax. This thread has already mentioned a couple of ways we could add the same facility without making any lexical-level changes, at least for data values. I admit that the SQL:2008 way also covers Unicode code points in identifiers, which we can't emulate without a lexical change; but frankly I think the use-case for that is so thin as to be almost nonexistent. Who is going to choose identifiers that they cannot easily type on their keyboards? regards, tom lane
Marko Kreen wrote: > I still stand on my proposal, how about extending E'' strings with > unicode escapes (eg. \uXXXX)? The E'' strings are already more > clearly defined than '' and they are our "own", we don't need to > consider random standards, but can consider our sanity. > > I suspect there would be lots more support in the user community, where \uXXXX is well understood in a number of contexts (Java and ECMAScript, for example). It's also tolerably sane. cheers andrew
Marko Kreen <markokr@gmail.com> writes: > I would prefer that such quoting extensions would wait until > stdstr=on setting is the only mode Postgres will operate. > Fitting new quoting ways to environment with flippable stdstr setting > will be rather painful for everyone. It would certainly be a lot safer to wait until non-standard-conforming strings don't exist anymore. The problem is that that may never happen, and is certainly not on the roadmap to happen in the foreseeable future. > I still stand on my proposal, how about extending E'' strings with > unicode escapes (eg. \uXXXX)? The E'' strings are already more > clearly defined than '' and they are our "own", we don't need to > consider random standards, but can consider our sanity. That's one way we could proceed. The other proposal that seemed attractive to me was a decode-like function: uescape('foo\00e9bar')uescape('foo\00e9bar', '\') (double all the backslashes if you assume not standard_conforming_strings). The arguments in favor of this one are (1) you can apply it to the result of an expression, it's not strictly tied to literals; and (2) it's a lot lower-footprint solution since it doesn't affect basic literal handling. If you wish to suppose that this is only a stopgap until someday when we can implement the SQL standard syntax more safely, then low footprint is good. One could even imagine back-porting this into existing releases as a user-defined function. The solution with \u in extended literals is probably workable too. I'm slightly worried about the possibility of issues with code that thinks it knows what an E-literal means but doesn't really. In particular something might think it knows that "\u" just means "u", and proceed to strip the backslash. I don't see a path for that to become a security hole though, only a garden-variety bug. So I could live with that one on the grounds of being easier to use (which it would be, because of less typing compared to uescape()). regards, tom lane
On Tuesday 14 April 2009 14:38:38 Marko Kreen wrote: > I think the problem is that they should not act like E'' strings, but they > should act like plain '' strings - they should follow stdstr setting. > > That way existing tools that may (or may not..) understand E'' and stdstr > settings, but definitely have not heard about U&'' strings can still > parse the SQL without new surprises. Can you be more specific in what "surprises" you expect? What algorithms do you suppose those "existing tools" use and what expectations do they have? > I still stand on my proposal, how about extending E'' strings with > unicode escapes (eg. \uXXXX)? The E'' strings are already more > clearly defined than '' and they are our "own", we don't need to > consider random standards, but can consider our sanity. This doesn't excite me. I think the tendency should be to get rid of E'' usage, because its definition of escape sequences is single-byte and ASCII centric and thus overall a legacy construct. Certainly, we will want to keep around E'' for a long time or forever, but it is a legitimate goal for application writers to not use it, which is after all the reason behind this whole standards-conforming strings project. I wouldn't want to have a forward-looking feature such as the Unicode escapes be burdened with that kind of legacy behavior. Also note that Unicode escapes are also available for identifiers, for which there is no existing E"" that you can add it to.
On Saturday 11 April 2009 18:20:47 Sam Mason wrote: > I can't see much support in the other database engines; searched for > Oracle, MS-SQL, DB2 and Firebird. MySQL has it planned for 7.1, so not > for a while. DB2 supports it, as far as I know.
On Tue, Apr 14, 2009 at 8:53 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > This doesn't excite me. I think the tendency should be to get rid of E'' > usage, because its definition of escape sequences is single-byte and ASCII > centric and thus overall a legacy construct. Certainly, we will want to keep > around E'' for a long time or forever, but it is a legitimate goal for > application writers to not use it, which is after all the reason behind this > whole standards-conforming strings project. I wouldn't want to have a > forward-looking feature such as the Unicode escapes be burdened with that kind > of legacy behavior. > > Also note that Unicode escapes are also available for identifiers, for which > there is no existing E"" that you can add it to. Maybe I've just got my head deeply in the sand, but I don't understand what the alternative to E'' supposedly is. How am I supposed to write the equivalent of E'\t\n\f' without using E''? The standard_conforming_strings syntax apparently supports no escapes of any kind, which seems so hideously inconvenient that I can't even imagine why someone wants that behavior. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > Maybe I've just got my head deeply in the sand, but I don't understand > what the alternative to E'' supposedly is. How am I supposed to write > the equivalent of E'\t\n\f' without using E''? The > standard_conforming_strings syntax apparently supports no escapes of > any kind, which seems so hideously inconvenient that I can't even > imagine why someone wants that behavior. Well, quite aside from issues of compatibility with standards and other databases, I'm sure there are lots of Windows users who are more interested in being able to store a Windows pathname without doubling their backslashes than they are in being able to type readable names for ASCII control characters. After all, in most cases you can get those characters into a string just by typing them (especially if you aren't using readline or something like it). BTW, does anyone know whether Unicode includes the ASCII control characters ... ie, is \u0009 a name for tab? If so, maybe this syntax is in part an attempt to cover that use-case in the standard. regards, tom lane
On Tuesday 14 April 2009 20:35:21 Robert Haas wrote: > Maybe I've just got my head deeply in the sand, but I don't understand > what the alternative to E'' supposedly is. How am I supposed to write > the equivalent of E'\t\n\f' without using E''? Well, the first alternative is to type those characters in literally. The second alternative is the U&'' syntax. ;-) The third alternative is to design applications that don't need this, because the processing behavior of those characters is quite unportable. But of course in some cases using the E'' syntax is the most convenient.
On Apr 14, 2009, at 11:22 AM, Tom Lane wrote: > BTW, does anyone know whether Unicode includes the ASCII control > characters ... ie, is \u0009 a name for tab? If so, maybe this > syntax is in part an attempt to cover that use-case in the standard. Yes, you can use, e.g., in HTML to represent a tab character. Best, David
On Tuesday 14 April 2009 17:13:00 Marko Kreen wrote: > If the parsing does not happen in 2 passes and it does not take account > of stdstr setting then the default breakage would be: > > stdstr=off, U&' \' UESCAPE '!'. I think we can handle that and the cases Tom presents by erroring out when the U& syntax is used with stdstr off.
On Tuesday 14 April 2009 21:22:29 Tom Lane wrote: > BTW, does anyone know whether Unicode includes the ASCII control > characters ... ie, is \u0009 a name for tab? If so, maybe this > syntax is in part an attempt to cover that use-case in the standard. Yes on both.
On Tuesday 14 April 2009 18:54:33 Tom Lane wrote: > The other proposal that seemed > attractive to me was a decode-like function: > > uescape('foo\00e9bar') > uescape('foo\00e9bar', '\') This was discussed previously, but rejected with the following argument: There are some other disadvantages for making a function call. You couldn't use that kind of literal in any other place where the parser calls for a string constant: role names, tablespace locations, passwords, copy delimiters, enum values, function body, file names. Some of those appear to be very plausible use cases.
On Tuesday 14 April 2009 17:32:00 Tom Lane wrote: > I admit that the SQL:2008 way also covers Unicode code > points in identifiers, which we can't emulate without a lexical change; > but frankly I think the use-case for that is so thin as to be almost > nonexistent. Who is going to choose identifiers that they cannot easily > type on their keyboards? For example, table names are automatically generated, or you write a test case for weird looking names, or you want to add special characters in an identifier that will later be displayed somewhere, or in general you are writing an application for a foreign language.
Peter Eisentraut <peter_e@gmx.net> writes: > I think we can handle that and the cases Tom presents by erroring out > when the U& syntax is used with stdstr off. I think you're missing the point --- this is not about whether the syntax is unambiguous (it is already) but about whether a frontend that doesn't understand it 100% will be secure against subversion. I have no confidence in the latter assumption. regards, tom lane
On Tue, Apr 14, 2009 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Maybe I've just got my head deeply in the sand, but I don't understand >> what the alternative to E'' supposedly is. How am I supposed to write >> the equivalent of E'\t\n\f' without using E''? The >> standard_conforming_strings syntax apparently supports no escapes of >> any kind, which seems so hideously inconvenient that I can't even >> imagine why someone wants that behavior. > > Well, quite aside from issues of compatibility with standards and other > databases, I'm sure there are lots of Windows users who are more > interested in being able to store a Windows pathname without doubling > their backslashes than they are in being able to type readable names > for ASCII control characters. After all, in most cases you can get > those characters into a string just by typing them (especially if you > aren't using readline or something like it). Well, that's fine, but that's a long way from Peter's statement that "I think the tendency should be to get rid of E'' usage". It's only been a minor inconvenience for me to change my applications to use E'', but I certainly don't see how I could get by without it, and it's far more like other programming languages that I use (e.g. C) than the standard syntax is. ...Robert
Peter Eisentraut <peter_e@gmx.net> writes: > On Tuesday 14 April 2009 18:54:33 Tom Lane wrote: >> The other proposal that seemed >> attractive to me was a decode-like function: >> >> uescape('foo\00e9bar') >> uescape('foo\00e9bar', '\') > This was discussed previously, but rejected with the following argument: > There are some other disadvantages for making a function call. You > couldn't use that kind of literal in any other place where the parser > calls for a string constant: role names, tablespace locations, > passwords, copy delimiters, enum values, function body, file names. I'm less than convinced that those are really plausible use-cases for characters that one is unable to type directly. However, I'll grant the point. So that narrows us down to considering the \u extension to E'' strings as a saner and safer alternative to the spec's syntax. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Well, that's fine, but that's a long way from Peter's statement that > "I think the tendency should be to get rid of E'' usage". Bear in mind that that's Peter's opinion; it's not necessarily shared by anyone else. I was just responding to your assertion of the diametrically opposed position that non-E strings are useless (which I don't share either). regards, tom lane
On Tuesday 14 April 2009 21:48:12 Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > I think we can handle that and the cases Tom presents by erroring out > > when the U& syntax is used with stdstr off. > > I think you're missing the point --- this is not about whether the > syntax is unambiguous (it is already) but about whether a frontend that > doesn't understand it 100% will be secure against subversion. I have no > confidence in the latter assumption. I think I am getting the point quite well. Do you have an example how this can be subverted?
"Meredith L. Patterson" <mlp@osogato.com> writes: > Tom Lane wrote: >> I suspect that it's actually impossible to parse such a thing correctly >> without a full-fledged flex lexer or something of equivalent complexity. > Is there a reason not to use a full-fledged flex lexer? The point is that that's a pretty large imposition on client code that we don't control or maintain, in order to get a feature that could be gotten in much less dangerous ways that don't impact any code outside PG. > I'd be willing to take a crack at such a thing, but I'm working 80-hour > weeks through the end of June and likely wouldn't be able to put in any > time on it till then. So I definitely couldn't promise anything for 8.4, > but if putting it off till 8.5 works, sign me up. Shall we pass your name on to every package using Postgres, then? This is *not* about code within Postgres. regards, tom lane
Tom Lane wrote: > I suspect that it's actually impossible to parse such a thing correctly > without a full-fledged flex lexer or something of equivalent complexity. > Certainly it's a couple of orders of magnitude harder than it is for > either standard-conforming or E'' literals. Is there a reason not to use a full-fledged flex lexer? I'd be willing to take a crack at such a thing, but I'm working 80-hour weeks through the end of June and likely wouldn't be able to put in any time on it till then. So I definitely couldn't promise anything for 8.4, but if putting it off till 8.5 works, sign me up. Cheers, --mlp _____________________ Meredith L. Patterson Founder and CTO Osogato, Inc.
Tom Lane wrote: > This is *not* about code within Postgres. One typically provides libraries for this sort of thing, but your point is taken; suggestion withdrawn. --mlp _____________________ Meredith L. Patterson Founder and CTO Osogato, Inc.
On 4/14/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On Tuesday 14 April 2009 18:54:33 Tom Lane wrote: > >> The other proposal that seemed > >> attractive to me was a decode-like function: > >> > >> uescape('foo\00e9bar') > >> uescape('foo\00e9bar', '\') > > > This was discussed previously, but rejected with the following argument: > > > There are some other disadvantages for making a function call. You > > couldn't use that kind of literal in any other place where the parser > > calls for a string constant: role names, tablespace locations, > > passwords, copy delimiters, enum values, function body, file names. > > > I'm less than convinced that those are really plausible use-cases for > characters that one is unable to type directly. However, I'll grant the > point. So that narrows us down to considering the \u extension to E'' > strings as a saner and safer alternative to the spec's syntax. My vote would go to \u. The U& may be "sql standard" but it's different from any established practical standard. Alternative would be to make U& follow stdstr setting: stdstr=on -> you get fully standard-conforming syntax: U&'\xxx' UESCAPE '\' stdstr=off -> you need to follow old quoting rules: U&'\\xxx' UESCAPE '\\' This would result in safe, and when stdstr=on, fully standard compliant syntax. Only downside would be that in practice - stdstr=off - it would be unusable. Third alternative would be to do both of them - \u as a usable method and safe-U& to mark the checkbox for SQL-standard compliance. If we do want U&, I would prefer that to U&-only syntax. -- marko
On Tue, Apr 14, 2009 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Well, that's fine, but that's a long way from Peter's statement that >> "I think the tendency should be to get rid of E'' usage". > > Bear in mind that that's Peter's opinion; it's not necessarily shared > by anyone else. I was just responding to your assertion of the > diametrically opposed position that non-E strings are useless (which > I don't share either). Useless might be an overstatement, but I certainly have gotten no benefit out of them. It seems decidedly odd to me to propose that users embed literal control characters in their code/SQL scripts in lieu of using escape sequences. If that were a normal and reasonable thing to do then I expect C, C++, Perl, Python, Ruby, Javascript, and countless other languages wouldn't provide this functionality. In reality, most of them do provide it, sometimes as the only option. And as to your point about Windows pathnames, if you're trying to avoid escaping lots of backslashes, dollar-quoting is a perfectly adequate solution. The real motivation for standard_conforming_strings is the one embedded in the name. And I'm fine with that. But I don't see E'' disappearing from my code any time soon. I periodically have a need for things like foo || E'\n' || bar and it screws up the formatting of the code to write foo || ' ' || bar. Writing it that way, it's also entirely non-obvious whether there are any spaces or tabs after the opening quote and before the newline, and if so whether or not they are intentional. But I don't mind if someone ELSE likes the new, standard strings - it just isn't me. :-) ...Robert
On Tue, Apr 14, 2009 at 04:01:48PM +0300, Peter Eisentraut wrote: > On Saturday 11 April 2009 18:20:47 Sam Mason wrote: > > I can't see much support in the other database engines; searched for > > Oracle, MS-SQL, DB2 and Firebird. MySQL has it planned for 7.1, so not > > for a while. > > DB2 supports it, as far as I know. Doh, yes it does doesn't it. Sorry I searched for a bit and failed to find anything before. Looks as though the signal to noise ratio was far too low as I've just searched again and found a (single) reference to their docs describing the feature[1]. I've also just noticed that the MySQL todo item points to several other implementations and how they handle Unicode escape sequences. The most common option (bearing in mind that this is a sample of mainly FOSS databases) seems to be doing some variant of '\u0123', as in the style of Python. This is only supported for literals and no support for identifiers appears to be provided. -- Sam http://samason.me.uk/ [1] http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0000731.html
On Apr 15, 2009, at 4:45 AM, Sam Mason wrote: > Doh, yes it does doesn't it. Sorry I searched for a bit and failed to > find anything before. Looks as though the signal to noise ratio was > far > too low as I've just searched again and found a (single) reference to > their docs describing the feature[1]. This is ugly, but not completely unpalatable: U&'\0141ód\017A is a city in Poland' U&'c:\\temp' U&'@+01D11E' UESCAPE '@' Wouldn't we just then say that U&'' strings are always standard- conforming? Best, David
On Tue, Apr 14, 2009 at 08:10:54AM -0400, Andrew Dunstan wrote: > Marko Kreen wrote: > >I still stand on my proposal, how about extending E'' strings with > >unicode escapes (eg. \uXXXX)? The E'' strings are already more > >clearly defined than '' and they are our "own", we don't need to > >consider random standards, but can consider our sanity. > > I suspect there would be lots more support in the user community, where > \uXXXX is well understood in a number of contexts (Java and ECMAScript, > for example). It's also tolerably sane. By the way, that's an example of how to do it wrong, there are more than 2^16 unicode characters, you want to be able to support the full 21-bit range if you're going to do it right. FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which makes it clear it's hex and doesn't make assumptions as to how many characters are used. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
"David E. Wheeler" <david@kineticode.com> writes: > Wouldn't we just then say that U&'' strings are always standard- > conforming? That's exactly what's causing the problem --- they are, but there is lots of software that won't know it. regards, tom lane
On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Wouldn't we just then say that U&'' strings are always standard- >> conforming? > > That's exactly what's causing the problem --- they are, but there > is lots of software that won't know it. We could say U&'' escapes only work if you have standards_conforming_strings set to true. -- greg
On Wed, Apr 15, 2009 at 6:52 PM, Greg Stark <stark@enterprisedb.com> wrote: > On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Wouldn't we just then say that U&'' strings are always standard- >>> conforming? >> >> That's exactly what's causing the problem --- they are, but there >> is lots of software that won't know it. > > > We could say U&'' escapes only work if you have > standards_conforming_strings set to true. Or say that if you have standards_conforming_strings false then any string which contains a literal \ or ' is an error. You shouldn't ever really need either since you could use the unicode escape for either after all. -- greg
On 4/15/09, Greg Stark <stark@enterprisedb.com> wrote: > On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Wouldn't we just then say that U&'' strings are always standard- > >> conforming? > > > > That's exactly what's causing the problem --- they are, but there > > is lots of software that won't know it. > > We could say U&'' escapes only work if you have > standards_conforming_strings set to true. Whats wrong with requiring U& to conform with stdstr=off quoting rules? You can use UESCAPE if you dont want to double backslashes. -- marko
Marko Kreen <markokr@gmail.com> writes: > Whats wrong with requiring U& to conform with stdstr=off quoting rules? The sole and only excuse for that misbegotten syntax is to be exactly SQL spec compliant --- otherwise we might as well pick something saner. So it needs to work like stdstr=on. I thought Peter's proposal of rejecting it altogether when stdstr=off might be reasonable. The space sensitivity around the & still sucks, but I have not (yet) thought of a credible security exploit for that. regards, tom lane
On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marko Kreen <markokr@gmail.com> writes: > > Whats wrong with requiring U& to conform with stdstr=off quoting rules? > > The sole and only excuse for that misbegotten syntax is to be exactly > SQL spec compliant --- otherwise we might as well pick something saner. > So it needs to work like stdstr=on. I thought Peter's proposal of > rejecting it altogether when stdstr=off might be reasonable. The space > sensitivity around the & still sucks, but I have not (yet) thought of > a credible security exploit for that. So the U& syntax is only available if stdstr=on? Sort of makes sense. As both this and the doubling-\\ way would mean we should have usable alternative in case of stdstr=off also, so in the end we have agreed to accept \u also? -- marko
Martijn van Oosterhout wrote: > On Tue, Apr 14, 2009 at 08:10:54AM -0400, Andrew Dunstan wrote: > >> Marko Kreen wrote: >> >>> I still stand on my proposal, how about extending E'' strings with >>> unicode escapes (eg. \uXXXX)? The E'' strings are already more >>> clearly defined than '' and they are our "own", we don't need to >>> consider random standards, but can consider our sanity. >>> >> I suspect there would be lots more support in the user community, where >> \uXXXX is well understood in a number of contexts (Java and ECMAScript, >> for example). It's also tolerably sane. >> > > By the way, that's an example of how to do it wrong, there are more > than 2^16 unicode characters, you want to be able to support the full > 21-bit range if you're going to do it right. > > FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which > makes it clear it's hex and doesn't make assumptions as to how many > characters are used. > I could live with either. Wikipedia says: "The characters outside the first plane usually have very specialized or rare use." For years we rejected all characters beyond the first plane, and while that's fixed now, the volume of complaints wasn't huge. cheers andrew > Have a nice day, >
Marko Kreen <markokr@gmail.com> writes: > As both this and the doubling-\\ way would mean we should have usable > alternative in case of stdstr=off also, so in the end we have agreed > to accept \u also? Given Martijn's complaint about more-than-16-bit code points, I think the \u proposal is not mature enough to go into 8.4. We can think about some version of that later, if there's enough interest. regards, tom lane
On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marko Kreen <markokr@gmail.com> writes: > > As both this and the doubling-\\ way would mean we should have usable > > alternative in case of stdstr=off also, so in the end we have agreed > > to accept \u also? > > Given Martijn's complaint about more-than-16-bit code points, I think > the \u proposal is not mature enough to go into 8.4. We can think > about some version of that later, if there's enough interest. I think it would be good idea. Basically we should pick one from couple of pre-existing sane schemes. Here is quick summary of Python, Perl and Java: Python [1]: \uXXXX - 16-bit codepoint \UXXXXXXXX - 32-bit codepoint \N{char-name} - Characted by name Perl [2]: \x{XXXX..} - {} contains hexadecimal codepoint \N{char-name} - Unicode char name Java [3]: \uXXXX - 16-bit codepoint Perl is OK, but the \x makes think of literal hex-decimal, and thats because they have extented their literal byte-escapes to support unicode. So I doubt we should promote it more. \u{} would be nicer, but that would not be an established standard. Both Python and Java allow using \u to encode higher codepoints with surrogate pairs. Which sort of makes it standard and Python superset of Java. (Obviously that does not mean you need to store them as surrogate pairs.) Problem with having only \u would be that this would make hard to enter higher codepoints manually. So \U would also be good to have. There is no hurry with \N{} but if it would be possible, it would be main reason to have custom unicode escaping. So my proposal would be Python escapes without \N{}. Whether we should have it in 8.4, I don't know. If we will have U&, but it does not work with stdstr=off, this seems to hint we should have some other escaping method available in 8.4 for stdstd=off users? -- marko [1] http://docs.python.org/reference/lexical_analysis.html#string-literals [2] http://perldoc.perl.org/perluniintro.html [3] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html
> >>> I still stand on my proposal, how about extending E'' strings with > >>> unicode escapes (eg. \uXXXX)? The E'' strings are already more > >>> clearly defined than '' and they are our "own", we don't need to > >>> consider random standards, but can consider our sanity. > >>> > >> I suspect there would be lots more support in the user community, where > >> \uXXXX is well understood in a number of contexts (Java and ECMAScript, > >> for example). It's also tolerably sane. > >> > > > > By the way, that's an example of how to do it wrong, there are more > > than 2^16 unicode characters, you want to be able to support the full > > 21-bit range if you're going to do it right. > > > > FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which > > makes it clear it's hex and doesn't make assumptions as to how many > > characters are used. > > > > I could live with either. Wikipedia says: "The characters outside the > first plane usually have very specialized or rare use." For years we > rejected all characters beyond the first plane, and while that's fixed > now, the volume of complaints wasn't huge. I you mean "first plane" as BMP (i.e. 16bit range), above is not true for PostgreSQL 7.3 or later at least. -- Tatsuo Ishii SRA OSS, Inc. Japan
> > >>> I still stand on my proposal, how about extending E'' strings with > > >>> unicode escapes (eg. \uXXXX)? The E'' strings are already more > > >>> clearly defined than '' and they are our "own", we don't need to > > >>> consider random standards, but can consider our sanity. > > >>> > > >> I suspect there would be lots more support in the user community, where > > >> \uXXXX is well understood in a number of contexts (Java and ECMAScript, > > >> for example). It's also tolerably sane. > > >> > > > > > > By the way, that's an example of how to do it wrong, there are more > > > than 2^16 unicode characters, you want to be able to support the full > > > 21-bit range if you're going to do it right. > > > > > > FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which > > > makes it clear it's hex and doesn't make assumptions as to how many > > > characters are used. > > > > > > > I could live with either. Wikipedia says: "The characters outside the > > first plane usually have very specialized or rare use." For years we > > rejected all characters beyond the first plane, and while that's fixed > > now, the volume of complaints wasn't huge. > > I you mean "first plane" as BMP (i.e. 16bit range), above is not true > for PostgreSQL 7.3 or later at least. Oops. I meant 8.2 or later. -- Tatsuo Ishii SRA OSS, Inc. Japan
On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote: > On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Given Martijn's complaint about more-than-16-bit code points, I think > > the \u proposal is not mature enough to go into 8.4. We can think > > about some version of that later, if there's enough interest. > > I think it would be good idea. Basically we should pick one from > couple of pre-existing sane schemes. Here is quick summary > of Python, Perl and Java: > > Python [1]: > > \uXXXX - 16-bit codepoint > \UXXXXXXXX - 32-bit codepoint > \N{char-name} - Characted by name Microsoft have also gone this way in C#, named code points are not supported however. > Perl [2]: > > \x{XXXX..} - {} contains hexadecimal codepoint > \N{char-name} - Unicode char name Looks OK, but the 'x' seems somewhat redundant. Why not just: \{xxxx} This would be following the BitC[2] project, especially if it was more like: \{U+xxxx} e.g. \{U+03BB} would be the lowercase lambda character. Added appeal is in the fact that this (i.e. U+03BB) is how the Unicode consortium spells code points. > Java [3]: > > \uXXXX - 16-bit codepoint AFAIK, Java isn't the best reference to choose; it assumed from an early point in its design that Unicode characters were at most 16bits and hence had to switch its internal representation to UTF-16. I don't program much Java these days to know how it's all worked out, but it would be interesting to hear from people who regularly have to deal with characters outside the BMP (i.e. code points greater than 65535). -- Sam http://samason.me.uk/ [1] http://msdn.microsoft.com/en-us/library/aa664669(VS.71).aspx[2] http://www.bitc-lang.org/docs/bitc/spec.html#stringlit
On 4/16/09, Sam Mason <sam@samason.me.uk> wrote: > On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote: > > On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Given Martijn's complaint about more-than-16-bit code points, I think > > > the \u proposal is not mature enough to go into 8.4. We can think > > > about some version of that later, if there's enough interest. > > > > I think it would be good idea. Basically we should pick one from > > couple of pre-existing sane schemes. Here is quick summary > > of Python, Perl and Java: > > > > Python [1]: > > > > \uXXXX - 16-bit codepoint > > \UXXXXXXXX - 32-bit codepoint > > \N{char-name} - Characted by name > > > Microsoft have also gone this way in C#, named code points are not > supported however. And it handles also non-BMP codepoints with \u escape similarly: http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences This makes it even more standard. > > Perl [2]: > > > > \x{XXXX..} - {} contains hexadecimal codepoint > > \N{char-name} - Unicode char name > > > Looks OK, but the 'x' seems somewhat redundant. Why not just: > > \{xxxx} > > This would be following the BitC[2] project, especially if it was more > like: > > \{U+xxxx} > > e.g. > > \{U+03BB} > > would be the lowercase lambda character. Added appeal is in the fact > that this (i.e. U+03BB) is how the Unicode consortium spells code > points. We already got yet-another-unique-way-of-escaping-unicode with U&. Now let's try to support some actual standard also. > > Java [3]: > > > > \uXXXX - 16-bit codepoint > > > AFAIK, Java isn't the best reference to choose; it assumed from an early > point in its design that Unicode characters were at most 16bits and > hence had to switch its internal representation to UTF-16. I don't > program much Java these days to know how it's all worked out, but it > would be interesting to hear from people who regularly have to deal with > characters outside the BMP (i.e. code points greater than 65535). You did not read my mail carefully enough - the Java and also Python/C# already support non-BMP chars with '\u' and exactly the same (utf16) way. -- marko
Tatsuo Ishii wrote: >>> I could live with either. Wikipedia says: "The characters outside the >>> first plane usually have very specialized or rare use." For years we >>> rejected all characters beyond the first plane, and while that's fixed >>> now, the volume of complaints wasn't huge. >>> >> I you mean "first plane" as BMP (i.e. 16bit range), above is not true >> for PostgreSQL 7.3 or later at least. >> > > Oops. I meant 8.2 or later. > > Umm, that's what I said. We used to do it like that, but we don't any more. cheers andrew
On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote: > On 4/16/09, Sam Mason <sam@samason.me.uk> wrote: > > Microsoft have also gone this way in C#, named code points are not > > supported however. > > And it handles also non-BMP codepoints with \u escape similarly: > > http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences > > This makes it even more standard. I fail to see what you're pointing out here; as far as I understand it, \u is for BMP code points and \U extends the range out to 32bit code points. I can't see anything about non-BMP and \u in the above link, you appear free to write your own surrogate pairs but that seems like an independent issue. I'd not realised before that C# is specified to use UTF-16 as its internal encoding. > > This would be following the BitC[2] project, especially if it was more > > like: > > > > \{U+xxxx} > > We already got yet-another-unique-way-of-escaping-unicode with U&. > > Now let's try to support some actual standard also. That comes across *very* negatively; I hope it's just a language issue. I read your parent post as soliciting opinions on possible ways to encode Unicode characters in PG's literals. The U&'lit' was criticised, you posted some suggestions, I followed up with what I hoped to be a useful addition. It seems useful here to separate "de jure" from "de facto" standards; implementing U&'lit' would be following the de jure standard, anything else would be de facto. A survey of existing SQL implementations would seem to be more appropriate as well: Oracle: UNISTR(string-literal) and \xxxx It looks as though Oracle originally used UCS-2 internally (i.e. BMP only) but more recently Unicode support has been improvedto allow other planes. MS-SQL Server: can't find anything remotely useful; best seems to be to use NCHAR(integer-expression) which looks somewhat unmaintainable. DB2: U&string-literal and \xxxxxx i.e. it follows the SQL-2003 spec FireBird: can't find much either; support looks somewhat low on the ground MySQL: same again, seems to assume query is encoded in UTF-8 Summary seems to be that either I'm bad at searching or support for Unicode doesn't seem very complete in the database world and people work around it somehow. > You did not read my mail carefully enough - the Java and also Python/C# > already support non-BMP chars with '\u' and exactly the same (utf16) way. Again, I think this may be a language issue; if not then more verbose explanations help, maybe something like "sorry, I obviously didn't explain that very well". You will of course felt you explained it perfectly well, but everybody enters a discussion with different intuitions and biases, email has a nasty habit of accentuating these differences and compounding them with language problems. I'd never heard of UTF-16 surrogate pairs before this discussion and hence didn't realise that it's valid to have a surrogate pair in place of a single code point. The docs say that <D800 DF02> corresponds to U+10302, Python would appear to follow my intuitions in that: ord(u'\uD800\uDF02') results in an error instead of giving back 66306, as I'd expect. Is this a bug in Python, my understanding, or something else? -- Sam http://samason.me.uk/
Sam Mason <sam@samason.me.uk> writes: > I'd never heard of UTF-16 surrogate pairs before this discussion and > hence didn't realise that it's valid to have a surrogate pair in place > of a single code point. The docs say that <D800 DF02> corresponds to > U+10302, Python would appear to follow my intuitions in that: > ord(u'\uD800\uDF02') > results in an error instead of giving back 66306, as I'd expect. Is > this a bug in Python, my understanding, or something else? I might be wrong, but I think surrogate pairs are expressly forbidden in all representations other than UTF16/UCS2. We definitely forbid them when validating UTF-8 strings --- that's per an RFC recommendation. It sounds like Python is doing the same. regards, tom lane
On Thu, Apr 16, 2009 at 10:54:16AM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > I'd never heard of UTF-16 surrogate pairs before this discussion and > > hence didn't realise that it's valid to have a surrogate pair in place > > of a single code point. The docs say that <D800 DF02> corresponds to > > U+10302, Python would appear to follow my intuitions in that: > > > ord(u'\uD800\uDF02') > > > results in an error instead of giving back 66306, as I'd expect. Is > > this a bug in Python, my understanding, or something else? > > I might be wrong, but I think surrogate pairs are expressly forbidden in > all representations other than UTF16/UCS2. We definitely forbid them > when validating UTF-8 strings --- that's per an RFC recommendation. > It sounds like Python is doing the same. OK, that's good. I thought I was missing something. A minor point is that in UCS2 each 16bit value is exactly one character and characters outside the BMP aren't supported, hence the need for UTF-16. I've failed to keep up with the discussion so I'm not sure where this conversation has got to! Is the consensus for 8.4 to enable SQL2003 style U&lit escaped literals if and only if standard_conforming_strings is set? This seems easiest for client code as it can use this exclusively for knowing what to do with backslashes. -- Sam http://samason.me.uk/
On 4/16/09, Sam Mason <sam@samason.me.uk> wrote: > On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote: > > On 4/16/09, Sam Mason <sam@samason.me.uk> wrote: > > > Microsoft have also gone this way in C#, named code points are not > > > supported however. > > > > And it handles also non-BMP codepoints with \u escape similarly: > > > > http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences > > > > This makes it even more standard. > > > I fail to see what you're pointing out here; as far as I understand it, > \u is for BMP code points and \U extends the range out to 32bit code > points. I can't see anything about non-BMP and \u in the above link, > you appear free to write your own surrogate pairs but that seems like an > independent issue. Ok, maybe I glanced too quickly over that page. I can't find definite deference only hint on several pages: \U \Unnnnnnnn Unicode escape sequence for surrogate pairs. Which hints that you can aswell enter the pairs directly: \uxx\uxx. If I'd be language designer, I would not see any reason to disallow it. And anyway, at least mono seems to support it: using System; public class HelloWorld { public static void Main() { Console.WriteLine("<\uD800\uDF02>\n"); } } It will output single UTF8 character. I think this should settle it. > I'd not realised before that C# is specified to use UTF-16 as its > internal encoding. > > > > This would be following the BitC[2] project, especially if it was more > > > like: > > > > > > \{U+xxxx} > > > > > We already got yet-another-unique-way-of-escaping-unicode with U&. > > > > Now let's try to support some actual standard also. > > > That comes across *very* negatively; I hope it's just a language issue. > > I read your parent post as soliciting opinions on possible ways to > encode Unicode characters in PG's literals. The U&'lit' was criticised, > you posted some suggestions, I followed up with what I hoped to be a > useful addition. It seems useful here to separate "de jure" from "de > facto" standards; implementing U&'lit' would be following the de jure > standard, anything else would be de facto. > > A survey of existing SQL implementations would seem to be more appropriate > as well: > > Oracle: UNISTR(string-literal) and \xxxx > > It looks as though Oracle originally used UCS-2 internally (i.e. BMP > only) but more recently Unicode support has been improved to allow > other planes. > > MS-SQL Server: > > can't find anything remotely useful; best seems to be to use > NCHAR(integer-expression) which looks somewhat unmaintainable. > > DB2: U&string-literal and \xxxxxx > > i.e. it follows the SQL-2003 spec > > FireBird: > > can't find much either; support looks somewhat low on the ground > > MySQL: > > same again, seems to assume query is encoded in UTF-8 > > Summary seems to be that either I'm bad at searching or support for > Unicode doesn't seem very complete in the database world and people work > around it somehow. The de-facto about Postgres is stdstr=off. Even if not, E'' strings are still better for various things, so it would be good if they also aquired unicode-capabilities. > > You did not read my mail carefully enough - the Java and also Python/C# > > already support non-BMP chars with '\u' and exactly the same (utf16) way. > > > Again, I think this may be a language issue; if not then more verbose > explanations help, maybe something like "sorry, I obviously didn't > explain that very well". You will of course felt you explained it > perfectly well, but everybody enters a discussion with different > intuitions and biases, email has a nasty habit of accentuating these > differences and compounding them with language problems. > > I'd never heard of UTF-16 surrogate pairs before this discussion and > hence didn't realise that it's valid to have a surrogate pair in place > of a single code point. The docs say that <D800 DF02> corresponds to > U+10302, Python would appear to follow my intuitions in that: > > ord(u'\uD800\uDF02') > > results in an error instead of giving back 66306, as I'd expect. Is > this a bug in Python, my understanding, or something else? Python's internal representation is *not* UTF-16, but plain UCS2/UCS4, that is - plain 16 or 32-bit values. Seems your python is compiled with UCS2, not UCS4. As I understand, in UCS2 mode it simply takes surrogate values as-is. From ord() docs: If a unicode argument is given and Python was built with UCS2 Unicode, then the character’s code point must be in the range[0..65535] inclusive; otherwise the string length is two, and a TypeError will be raised. So only in UCS4 mode it detects surrogates and converts them to internal representation. (Which in Postgres case would be UTF8.) Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand UTF16 but low-level string routines do not: print "<%s>" % u'\uD800\uDF02' seems to handle it properly. -- marko
Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > >> I'd never heard of UTF-16 surrogate pairs before this discussion and >> hence didn't realise that it's valid to have a surrogate pair in place >> of a single code point. The docs say that <D800 DF02> corresponds to >> U+10302, Python would appear to follow my intuitions in that: >> > > >> ord(u'\uD800\uDF02') >> > > >> results in an error instead of giving back 66306, as I'd expect. Is >> this a bug in Python, my understanding, or something else? >> > > I might be wrong, but I think surrogate pairs are expressly forbidden in > all representations other than UTF16/UCS2. We definitely forbid them > when validating UTF-8 strings --- that's per an RFC recommendation. > It sounds like Python is doing the same. > > > You mustn't encode the surrogate, but it's up to us how we allow people to designate a given code point. Frankly, I think we shouldn't provide for using surrogates at all. I would prefer something like \uXXXX for BMP items and \UXXXXXXXX as the straight 32bit designation of a higher codepoint. cheers andrew
On 4/16/09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sam Mason <sam@samason.me.uk> writes: > > I'd never heard of UTF-16 surrogate pairs before this discussion and > > hence didn't realise that it's valid to have a surrogate pair in place > > of a single code point. The docs say that <D800 DF02> corresponds to > > U+10302, Python would appear to follow my intuitions in that: > > > ord(u'\uD800\uDF02') > > > results in an error instead of giving back 66306, as I'd expect. Is > > this a bug in Python, my understanding, or something else? > > > I might be wrong, but I think surrogate pairs are expressly forbidden in > all representations other than UTF16/UCS2. We definitely forbid them > when validating UTF-8 strings --- that's per an RFC recommendation. > It sounds like Python is doing the same. The point here is that Python/Java/C# allow them for escaping non-BMP unicode values, irrespective of their interal encoding. -- marko
On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote: > Which hints that you can aswell enter the pairs directly: \uxx\uxx. > If I'd be language designer, I would not see any reason to disallow it. > > And anyway, at least mono seems to support it: > > using System; > public class HelloWorld { > public static void Main() { > Console.WriteLine("<\uD800\uDF02>\n"); > } > } > > It will output single UTF8 character. I think this should settle it. I don't have any .net stuff installed so can't test; but C# is defined to use UTF-16 as its internal representation so it would make sense if the above gets treated as a single character internally. However, if it used any other encoding the above should be treated as an error. > The de-facto about Postgres is stdstr=off. Even if not, E'' strings > are still better for various things, so it would be good if they also > aquired unicode-capabilities. OK, this seems independent of the U&'lit' discussion that started the thread. Note that PG already supports UTF8; if you want the character I've been using in my examples up-thread, you can do: SELECT E'\xF0\x90\x8C\x82'; I have a feeling that this is predicated on the server_encoding being set to "utf8" and this can only be done at database creation time. Another alternative would be to use the convert_from function, i.e: SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8'); Never had to do this though, so there may be better options available. > Python's internal representation is *not* UTF-16, but plain UCS2/UCS4, > that is - plain 16 or 32-bit values. Seems your python is compiled with > UCS2, not UCS4. Cool, I didn't know that. I believe mine is UCS4 as I can do: ord(u'\U00010302') and I get 66306 back rather than an error. > As I understand, in UCS2 mode it simply takes surrogate > values as-is. UCS2 doesn't have surrogate pairs, or at least I believe it's considered a bug if you don't get an error when you present it with one. > From ord() docs: > > If a unicode argument is given and Python was built with UCS2 Unicode, > then the character’s code point must be in the range [0..65535] > inclusive; otherwise the string length is two, and a TypeError will > be raised. > > So only in UCS4 mode it detects surrogates and converts them to internal > representation. (Which in Postgres case would be UTF8.) I think you mean UTF-16 instead of UCS4; but otherwise, yes. > Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand > UTF16 but low-level string routines do not: > > print "<%s>" % u'\uD800\uDF02' > > seems to handle it properly. Yes, I get this as well. It's all a bit weird, which is why I was asking if "this a bug in Python, my understanding, or something else". When I do: python <<EOF | hexdump -C print u"\uD800\uDF02" EOF to see what it's doing I get an error which I'm not expecting, hence I think it's probably my understanding. -- Sam http://samason.me.uk/
Sam Mason <sam@samason.me.uk> writes: > I've failed to keep up with the discussion so I'm not sure where this > conversation has got to! Is the consensus for 8.4 to enable SQL2003 > style U&lit escaped literals if and only if standard_conforming_strings > is set? That was Peter's proposal, and no one's shot a hole in it yet ... I think the discussion about whether/how to add a Unicode extension to E''-style literals is 8.5 material. We are in beta so now is not the time to add new features, especially ones that weren't even on the TODO list before. regards, tom lane
On Thu, Apr 16, 2009 at 12:08:37PM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > I've failed to keep up with the discussion so I'm not sure where this > > conversation has got to! Is the consensus for 8.4 to enable SQL2003 > > style U&lit escaped literals if and only if standard_conforming_strings > > is set? > > That was Peter's proposal, and no one's shot a hole in it yet ... Just noticed that the spec only supports four hex digits; this would imply that support for anything outside the BMP would have to be done by encoding the character as a surrogate pair. If the code doesn't do this already (the original patch didn't seem to) these should be normalised back to a single character in a similar manner to Marko's recent patch[1]. > I think the discussion about whether/how to add a Unicode extension to > E''-style literals is 8.5 material. We are in beta so now is not > the time to add new features, especially ones that weren't even on the > TODO list before. OK, sounds reasonable. -- Sam http://samason.me.uk/ [1] http://archives.postgresql.org//pgsql-hackers/2009-04/msg00904.php
Sam Mason <sam@samason.me.uk> writes: > Just noticed that the spec only supports four hex digits; Better read it again. regards, tom lane
On Fri, Apr 17, 2009 at 10:15:57AM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > Just noticed that the spec only supports four hex digits; > > Better read it again. You're right of course. My ability to read patches seems not to be very good. -- Sam http://samason.me.uk/
On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote: > On Tuesday 14 April 2009 17:13:00 Marko Kreen wrote: > > If the parsing does not happen in 2 passes and it does not take account > > of stdstr setting then the default breakage would be: > > > > stdstr=off, U&' \' UESCAPE '!'. > > I think we can handle that and the cases Tom presents by erroring out when > the U& syntax is used with stdstr off. Proposed patch for that attached.
Peter Eisentraut <peter_e@gmx.net> writes: > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote: >> I think we can handle that and the cases Tom presents by erroring out when >> the U& syntax is used with stdstr off. > Proposed patch for that attached. I have not been able to think of any security hole in that proposal, so this patch seems acceptable to me. I wonder though whether any corresponding change is needed in psql's lexer, and if so how should it react exactly to the rejection case. regards, tom lane
Hi. quick test for great patch. ! == SCRIPT == set CLIENT_ENCODING to 'UTF-8'; DROP TABLE ucheck CASCADE; CREATE TABLE ucheck (key VARCHAR(10) PRIMARY KEY, data NCHAR(50)); set STANDARD_CONFORMING_STRINGS to on; INSERT INTO ucheck VALUES('ucheck1',u&'\68ee\9dd7\5916'); SELECT * FROM ucheck; set CLIENT_ENCODING to 'SHIFTJIS2004'; SELECT * FROM ucheck; set STANDARD_CONFORMING_STRINGS to off; INSERT INTO ucheck VALUES('ucheck2',u&'\68ee\9dd7\5916'); === As for psql === This should notice the console of Japanese windows-XP about code can't to display. C:\work>psql psql (8.4beta1) "help" でヘルプを表示します. HIROSHI=# set CLIENT_ENCODING to 'UTF-8'; SET HIROSHI=# DROP TABLE ucheck CASCADE; DROP TABLE HIROSHI=# CREATE TABLE ucheck (key VARCHAR(10) PRIMARY KEY, data NCHAR(50)); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "ucheck_pkey" fortable "ucheck" CREATE TABLE HIROSHI=# set STANDARD_CONFORMING_STRINGS to on; SET HIROSHI=# INSERT INTO ucheck VALUES('ucheck',u&'\68ee\9dd7\5916'); INSERT 0 1 HIROSHI=# SELECT * FROM ucheck; key | data --------+-------------------------------------------------------ucheck | 譽ョ鮃怜、・ (1 行) HIROSHI=# set CLIENT_ENCODING to 'SHIFTJIS2004'; SET HIROSHI=# SELECT * FROM ucheck; key | data --------+-------------------------------------------------------ucheck | 森・外 (1 行) HIROSHI=# set STANDARD_CONFORMING_STRINGS to off; SET HIROSHI=# INSERT INTO ucheck VALUES('ucheck2',u&'\68ee\9dd7\5916'); ERROR: unsafe use of string constant with Unicode escapes DETAIL: String constants with Unicode escapes cannot be used when standard_conf orming_strings is off. === As for pgAdminIII === pgadmin is shown very well.!!:-) http://winpg.jp/~saito/pg_work/ucheck/jisx0213.png P.S) This test obtains the same result by MimerDB. Regards, Hiroshi Saito ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> > Peter Eisentraut <peter_e@gmx.net> writes: >> On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote: >>> I think we can handle that and the cases Tom presents by erroring out when >>> the U& syntax is used with stdstr off. > >> Proposed patch for that attached. > > I have not been able to think of any security hole in that proposal, > so this patch seems acceptable to me. I wonder though whether any > corresponding change is needed in psql's lexer, and if so how should > it react exactly to the rejection case. > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Tuesday 05 May 2009 03:01:05 Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote: > >> I think we can handle that and the cases Tom presents by erroring out > >> when the U& syntax is used with stdstr off. > > > > Proposed patch for that attached. > > I have not been able to think of any security hole in that proposal, > so this patch seems acceptable to me. I wonder though whether any > corresponding change is needed in psql's lexer, and if so how should > it react exactly to the rejection case. I had thought about that as well, but concluded that no additional change is necessary. Note that the *corresponding* change would be psql complaining "I don't like what you entered", versus the just-committed behavior that psql is indifferent and the server complains "I don't like what you sent me". In any case, the point of the change is to prevent confusion in client programs, so if we had to patch psql to make sense, then the change would have been pointless in the first place.
Peter Eisentraut wrote: > On Tuesday 05 May 2009 03:01:05 Tom Lane wrote: > > Peter Eisentraut <peter_e@gmx.net> writes: > > > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote: > > >> I think we can handle that and the cases Tom presents by erroring out > > >> when the U& syntax is used with stdstr off. > > > > > > Proposed patch for that attached. > > > > I have not been able to think of any security hole in that proposal, > > so this patch seems acceptable to me. I wonder though whether any > > corresponding change is needed in psql's lexer, and if so how should > > it react exactly to the rejection case. > > I had thought about that as well, but concluded that no additional change is > necessary. > > Note that the *corresponding* change would be psql complaining "I don't like > what you entered", versus the just-committed behavior that psql is indifferent > and the server complains "I don't like what you sent me". > > In any case, the point of the change is to prevent confusion in client > programs, so if we had to patch psql to make sense, then the change would have > been pointless in the first place. I assume there is no TODO here. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Friday 29 May 2009 06:31:23 Bruce Momjian wrote: > Peter Eisentraut wrote: > > On Tuesday 05 May 2009 03:01:05 Tom Lane wrote: > > > Peter Eisentraut <peter_e@gmx.net> writes: > > > > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote: > > > >> I think we can handle that and the cases Tom presents by erroring > > > >> out when the U& syntax is used with stdstr off. > > > > > > > > Proposed patch for that attached. > > > > > > I have not been able to think of any security hole in that proposal, > > > so this patch seems acceptable to me. I wonder though whether any > > > corresponding change is needed in psql's lexer, and if so how should > > > it react exactly to the rejection case. > > > > I had thought about that as well, but concluded that no additional change > > is necessary. > > > > Note that the *corresponding* change would be psql complaining "I don't > > like what you entered", versus the just-committed behavior that psql is > > indifferent and the server complains "I don't like what you sent me". > > > > In any case, the point of the change is to prevent confusion in client > > programs, so if we had to patch psql to make sense, then the change would > > have been pointless in the first place. > > I assume there is no TODO here. No, it should be fine.