Thread: Unicode string literals versus the world

Unicode string literals versus the world

From

Tom Lane

Date:

10 April 2009, 18:54:29

So I started to look at what might be involved in teaching plpgsql about
standard_conforming_strings, and was soon dismayed by the sheer epic
nature of its failure to act like the core lexer.  It was shaky enough
before, but the recent introduction of Unicode strings and identifiers
into the core has left plpgsql hopelessly behind.

I can see two basic approaches to making things work: copy-and-paste
practically all of parser/scan.l into plpgsql's lexer (certainly all of
it that involves exclusive states); or throw out plpgsql's lexer
altogether in favor of somehow using the core lexer directly.  Neither
one looks very attractive.

It gets worse though: I have seldom seen such a badly designed piece of
syntax as the Unicode string syntax --- see
http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

You scan the string, and then after that they tell you what the escape
character is!?  Not to mention the obvious ambiguity with & as an
operator.

If we let this go into 8.4, our previous rounds with security holes
caused by careless string parsing will look like a day at the beach.
No frontend that isn't fully cognizant of the Unicode string syntax is
going to parse such things correctly --- it's going to be trivial for
a bad guy to confuse a quoting mechanism as to what's an escape and what
isn't.

I think we need to give very serious consideration to ripping out that
"feature".
        regards, tom lane

Re: Unicode string literals versus the world

From

Sam Mason

Date:

11 April 2009, 12:20:51

On Fri, Apr 10, 2009 at 05:54:25PM -0400, Tom Lane wrote:
> It gets worse though: I have seldom seen such a badly designed piece of
> syntax as the Unicode string syntax --- see
> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE
> 
> I think we need to give very serious consideration to ripping out that
> "feature".

I'd agree it's comically bad syntax; I just had a look in the archives
and it was only put in a few months ago:
 http://archives.postgresql.org/pgsql-hackers/2008-10/msg01169.php

I can't see much support in the other database engines; searched for
Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
for a while.
 http://forge.mysql.com/worklog/task.php?id=3529

--  Sam  http://samason.me.uk/

Re: Unicode string literals versus the world

From

Andrew Dunstan

Date:

11 April 2009, 14:06:51

Tom Lane wrote:
>
> It gets worse though: I have seldom seen such a badly designed piece of
> syntax as the Unicode string syntax --- see
> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE
>
> You scan the string, and then after that they tell you what the escape
> character is!?  Not to mention the obvious ambiguity with & as an
> operator.
>
> If we let this go into 8.4, our previous rounds with security holes
> caused by careless string parsing will look like a day at the beach.
> No frontend that isn't fully cognizant of the Unicode string syntax is
> going to parse such things correctly --- it's going to be trivial for
> a bad guy to confuse a quoting mechanism as to what's an escape and what
> isn't.
>
> I think we need to give very serious consideration to ripping out that
> "feature".
>
>             
>   

+1

I don't recall a great deal of discussion about it, and it certainly 
looks pretty horrible now you point it out.

cheers

andrew

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

11 April 2009, 15:47:42

On 4/11/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  It gets worse though: I have seldom seen such a badly designed piece of
>  syntax as the Unicode string syntax --- see
>  http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE
>
>  You scan the string, and then after that they tell you what the escape
>  character is!?  Not to mention the obvious ambiguity with & as an
>  operator.
>
>  If we let this go into 8.4, our previous rounds with security holes
>  caused by careless string parsing will look like a day at the beach.
>  No frontend that isn't fully cognizant of the Unicode string syntax is
>  going to parse such things correctly --- it's going to be trivial for
>  a bad guy to confuse a quoting mechanism as to what's an escape and what
>  isn't.
>
>  I think we need to give very serious consideration to ripping out that
>  "feature".

Ugh, it's rather dubious indeed.  Especially when we are already in
the middle of seriously confusing conversion from stdstr=off -> on.
Is it really OK to introduce even more complexity in the mix?

Alternative proposal - maybe it would be saner to introduce \uXXXX
escape to E'' strings as a non-standard way for quoting unicode.

Later when the standard quoting is our only quoting method we can play
with standard extensions?

-- 
marko

Re: Unicode string literals versus the world

From

Josh Berkus

Date:

11 April 2009, 15:50:42

On 4/11/09 11:47 AM, Marko Kreen wrote:
> On 4/11/09, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>   It gets worse though: I have seldom seen such a badly designed piece of
>>   syntax as the Unicode string syntax --- see
>>   http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

WTF?  Whose feature is this?  What's the use case?

Speak up, submitter, please.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

Re: Unicode string literals versus the world

From

Tom Lane

Date:

11 April 2009, 16:08:04

Josh Berkus <josh@agliodbs.com> writes:
>> On 4/11/09, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

> WTF?  Whose feature is this?  What's the use case?

Peter put it in, I think.  It is in the SQL:2008 spec, but that doesn't
change the fact that it's a horribly bad piece of design.
        regards, tom lane

Re: Unicode string literals versus the world

From

Josh Berkus

Date:

11 April 2009, 16:10:07

> Peter put it in, I think.  It is in the SQL:2008 spec, but that doesn't
> change the fact that it's a horribly bad piece of design.

Hmmm.  We're not going to implement *everything* in the spec; nobody 
does, even IBM.  I think maybe these kinds of additions need to be 
hashed out for value so we don't waste Peter's work in the future.  Peter?

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 07:52:57

On Saturday 11 April 2009 00:54:25 Tom Lane wrote:
> It gets worse though: I have seldom seen such a badly designed piece of
> syntax as the Unicode string syntax --- see
> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL
>-SYNTAX-STRINGS-UESCAPE
>
> You scan the string, and then after that they tell you what the escape
> character is!?  Not to mention the obvious ambiguity with & as an
> operator.
>
> If we let this go into 8.4, our previous rounds with security holes
> caused by careless string parsing will look like a day at the beach.
> No frontend that isn't fully cognizant of the Unicode string syntax is
> going to parse such things correctly --- it's going to be trivial for
> a bad guy to confuse a quoting mechanism as to what's an escape and what
> isn't.

Note that the escape character marks the Unicode escapes; it doesn't affect the 
quote characters that delimit the string.  So offhand I can't see any potential 
for quote confusion/SQL injection type problems.  Please elaborate if you see 
a problem.

If there are problems, we could consider getting rid of the UESCAPE clause.  
Without it, the U&'' strings would behave much like the E'' strings.  But I'd 
like to understand the problem first.

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 07:57:21

On Saturday 11 April 2009 21:50:29 Josh Berkus wrote:
> On 4/11/09 11:47 AM, Marko Kreen wrote:
> > On 4/11/09, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
> >>   It gets worse though: I have seldom seen such a badly designed piece
> >> of syntax as the Unicode string syntax --- see
> >>  
> >> http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#
> >>SQL-SYNTAX-STRINGS-UESCAPE
>
> WTF?  Whose feature is this?  What's the use case?

The use case is approximately the same as &#xxxx; in HTML: entering Unicode 
characters that your screen or keyboard cannot easily produce.  It's a 
desperately needed feature for me.

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

14 April 2009, 08:38:45

On 4/14/09, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Saturday 11 April 2009 00:54:25 Tom Lane wrote:
>  > It gets worse though: I have seldom seen such a badly designed piece of
>  > syntax as the Unicode string syntax --- see
>  > http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL
>  >-SYNTAX-STRINGS-UESCAPE
>  >
>  > You scan the string, and then after that they tell you what the escape
>  > character is!?  Not to mention the obvious ambiguity with & as an
>  > operator.
>  >
>  > If we let this go into 8.4, our previous rounds with security holes
>  > caused by careless string parsing will look like a day at the beach.
>  > No frontend that isn't fully cognizant of the Unicode string syntax is
>  > going to parse such things correctly --- it's going to be trivial for
>  > a bad guy to confuse a quoting mechanism as to what's an escape and what
>  > isn't.
>
>
> Note that the escape character marks the Unicode escapes; it doesn't affect the
>  quote characters that delimit the string.  So offhand I can't see any potential
>  for quote confusion/SQL injection type problems.  Please elaborate if you see
>  a problem.
>
>  If there are problems, we could consider getting rid of the UESCAPE clause.
>  Without it, the U&'' strings would behave much like the E'' strings.  But I'd
>  like to understand the problem first.

I think the problem is that they should not act like E'' strings, but they
should act like plain '' strings - they should follow stdstr setting.

That way existing tools that may (or may not..) understand E'' and stdstr
settings, but definitely have not heard about U&'' strings can still
parse the SQL without new surprises.

If they already act that way then keeping U& should be fine.

And if UESCAPE does not affect main string parsing, but is handled in
second pass going over parsed string - like bytea \ - then that should
also be fine and should not cause any new surprises.

But if not, it must go.

I would prefer that such quoting extensions would wait until
stdstr=on setting is the only mode Postgres will operate.
Fitting new quoting ways to environment with flippable stdstr setting
will be rather painful for everyone.

I still stand on my proposal, how about extending E'' strings with
unicode escapes (eg. \uXXXX)?  The E'' strings are already more
clearly defined than '' and they are our "own", we don't need to
consider random standards, but can consider our sanity.

-- 
marko

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

14 April 2009, 11:13:05

On 4/14/09, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Tuesday 14 April 2009 14:38:38 Marko Kreen wrote:
>  > I think the problem is that they should not act like E'' strings, but they
>  > should act like plain '' strings - they should follow stdstr setting.
>  >
>  > That way existing tools that may (or may not..) understand E'' and stdstr
>  > settings, but definitely have not heard about U&'' strings can still
>  > parse the SQL without new surprises.
>
>
> Can you be more specific in what "surprises" you expect?  What algorithms do
>  you suppose those "existing tools" use and what expectations do they have?

If the parsing does not happen in 2 passes and it does not take account
of stdstr setting then the  default breakage would be:
  stdstr=off, U&' \' UESCAPE '!'.

And anything, whose security or functionality depends on parsing SQL
can be broken that way.

Broken functionality would be eg. Slony (or other replication solution)
distributing developer-written SQL code to bunch of nodes.  It needs to
parse text file to SQL statements and execute them separately.

There are probably other solutions who expect to understand SQL
at least token level to function correctly.  (pgpool, java has
probably something depending on it, etc.)

>  > I still stand on my proposal, how about extending E'' strings with
>  > unicode escapes (eg. \uXXXX)?  The E'' strings are already more
>  > clearly defined than '' and they are our "own", we don't need to
>  > consider random standards, but can consider our sanity.
>
>
> This doesn't excite me.  I think the tendency should be to get rid of E''
>  usage, because its definition of escape sequences is single-byte and ASCII
>  centric and thus overall a legacy construct.

Why are you concentrating only on \0xx escapes?  The \\, \n, etc
seem standard and forward looking enough.  Yes, unicode escapes are
missing but we can add them without breaking anything.

>  Certainly, we will want to keep
>  around E'' for a long time or forever, but it is a legitimate goal for
>  application writers to not use it, which is after all the reason behind this
>  whole standards-conforming strings project.  I wouldn't want to have a
>  forward-looking feature such as the Unicode escapes be burdened with that kind
>  of legacy behavior.
>
>  Also note that Unicode escapes are also available for identifiers, for which
>  there is no existing E"" that you can add it to.

Well, I was not rejecting the standard quoting, but suggesting
postponing until the stdstr mess is sorted out.  We can use \uXX
in meantime and I think most Postgres users would prefer to keep
using it...

-- 
marko

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 11:32:09

Peter Eisentraut <peter_e@gmx.net> writes:
> On Saturday 11 April 2009 00:54:25 Tom Lane wrote:
>> If we let this go into 8.4, our previous rounds with security holes
>> caused by careless string parsing will look like a day at the beach.

> Note that the escape character marks the Unicode escapes; it doesn't
> affect the quote characters that delimit the string.  So offhand I
> can't see any potential for quote confusion/SQL injection type
> problems.  Please elaborate if you see a problem.

The problem is the interaction with non-standard-conforming strings.

Case 1:
select u&'foo\' uescape ',' ...

The backend will see the backslash as just a data character, and
will think that "..." is live SQL text.  A non-Unicode-literal-aware
frontend will think that the backslash escapes the second quote, the
comma is live SQL text, and the ... is quoted material.  Construction
of an actual SQL injection attack is left as an exercise for the reader,
but certainly the raw material is here.

Case 2:
select u&'foo' uescape '\' ...

Again, any existing frontend code will think that the backslash quotes
the final quote and the ... is quoted material.  This one is
particularly nasty because we allow arbitrary amounts of whitespace and
commenting on either side of "uescape":

select u&'foo' /* hello joe, do you /* understand nested comments today?*/       -- yes, this one too */uescape-- but
notthis one /*'\' ...

I suspect that it's actually impossible to parse such a thing correctly
without a full-fledged flex lexer or something of equivalent complexity.
Certainly it's a couple of orders of magnitude harder than it is for
either standard-conforming or E'' literals.

Case 3:
select u&'foo\' uescape ',' ...
select u & 'foo\' uescape ',' ...

In the first form the ... is live SQL, in the second form it is quoted
material.  This means that you might correctly validate a query and then
have your results invalidated by later processing that innocently adds
or removes whitespace.  (This is particularly nasty in a standard that
demands we parse "x/-1" and "x / -1" the same ...)

So what we've got here is a whole new set of potential SQL injection
attacks by confusing frontend literal-syntax checking, plus a truly
staggering increase in the minimum *required* complexity of such
checking.  I understand the usefulness of being able to write Unicode
code points, but they're not useful enough to justify this syntax.

This thread has already mentioned a couple of ways we could add the
same facility without making any lexical-level changes, at least for
data values.  I admit that the SQL:2008 way also covers Unicode code
points in identifiers, which we can't emulate without a lexical change;
but frankly I think the use-case for that is so thin as to be almost
nonexistent.  Who is going to choose identifiers that they cannot easily
type on their keyboards?
        regards, tom lane

Re: Unicode string literals versus the world

From

Andrew Dunstan

Date:

14 April 2009, 12:20:37

Marko Kreen wrote:
> I still stand on my proposal, how about extending E'' strings with
> unicode escapes (eg. \uXXXX)?  The E'' strings are already more
> clearly defined than '' and they are our "own", we don't need to
> consider random standards, but can consider our sanity.
>
>   

I suspect there would be lots more support in the user community, where 
\uXXXX is well understood in a number of contexts (Java and ECMAScript, 
for example). It's also tolerably sane.

cheers

andrew

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 12:55:10

Marko Kreen <markokr@gmail.com> writes:
> I would prefer that such quoting extensions would wait until
> stdstr=on setting is the only mode Postgres will operate.
> Fitting new quoting ways to environment with flippable stdstr setting
> will be rather painful for everyone.

It would certainly be a lot safer to wait until non-standard-conforming
strings don't exist anymore.  The problem is that that may never happen,
and is certainly not on the roadmap to happen in the foreseeable future.

> I still stand on my proposal, how about extending E'' strings with
> unicode escapes (eg. \uXXXX)?  The E'' strings are already more
> clearly defined than '' and they are our "own", we don't need to
> consider random standards, but can consider our sanity.

That's one way we could proceed.  The other proposal that seemed
attractive to me was a decode-like function:
uescape('foo\00e9bar')uescape('foo\00e9bar', '\')

(double all the backslashes if you assume not
standard_conforming_strings).  The arguments in favor of this one
are (1) you can apply it to the result of an expression, it's not
strictly tied to literals; and (2) it's a lot lower-footprint solution
since it doesn't affect basic literal handling.  If you wish to suppose
that this is only a stopgap until someday when we can implement the SQL
standard syntax more safely, then low footprint is good.  One could
even imagine back-porting this into existing releases as a user-defined
function.

The solution with \u in extended literals is probably workable too.
I'm slightly worried about the possibility of issues with code that
thinks it knows what an E-literal means but doesn't really.  In
particular something might think it knows that "\u" just means "u",
and proceed to strip the backslash.  I don't see a path for that to
become a security hole though, only a garden-variety bug.  So I could
live with that one on the grounds of being easier to use (which it
would be, because of less typing compared to uescape()).
        regards, tom lane

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 12:55:46

On Tuesday 14 April 2009 14:38:38 Marko Kreen wrote:
> I think the problem is that they should not act like E'' strings, but they
> should act like plain '' strings - they should follow stdstr setting.
>
> That way existing tools that may (or may not..) understand E'' and stdstr
> settings, but definitely have not heard about U&'' strings can still
> parse the SQL without new surprises.

Can you be more specific in what "surprises" you expect?  What algorithms do 
you suppose those "existing tools" use and what expectations do they have?

> I still stand on my proposal, how about extending E'' strings with
> unicode escapes (eg. \uXXXX)?  The E'' strings are already more
> clearly defined than '' and they are our "own", we don't need to
> consider random standards, but can consider our sanity.

This doesn't excite me.  I think the tendency should be to get rid of E'' 
usage, because its definition of escape sequences is single-byte and ASCII 
centric and thus overall a legacy construct.  Certainly, we will want to keep 
around E'' for a long time or forever, but it is a legitimate goal for 
application writers to not use it, which is after all the reason behind this 
whole standards-conforming strings project.  I wouldn't want to have a 
forward-looking feature such as the Unicode escapes be burdened with that kind 
of legacy behavior.

Also note that Unicode escapes are also available for identifiers, for which 
there is no existing E"" that you can add it to.

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 13:06:22

On Saturday 11 April 2009 18:20:47 Sam Mason wrote:
> I can't see much support in the other database engines; searched for
> Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
> for a while.

DB2 supports it, as far as I know.

Re: Unicode string literals versus the world

From

Robert Haas

Date:

14 April 2009, 14:59:15

On Tue, Apr 14, 2009 at 8:53 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> This doesn't excite me.  I think the tendency should be to get rid of E''
> usage, because its definition of escape sequences is single-byte and ASCII
> centric and thus overall a legacy construct.  Certainly, we will want to keep
> around E'' for a long time or forever, but it is a legitimate goal for
> application writers to not use it, which is after all the reason behind this
> whole standards-conforming strings project.  I wouldn't want to have a
> forward-looking feature such as the Unicode escapes be burdened with that kind
> of legacy behavior.
>
> Also note that Unicode escapes are also available for identifiers, for which
> there is no existing E"" that you can add it to.

Maybe I've just got my head deeply in the sand, but I don't understand
what the alternative to E'' supposedly is.  How am I supposed to write
the equivalent of E'\t\n\f' without using E''?  The
standard_conforming_strings syntax apparently supports no escapes of
any kind, which seems so hideously inconvenient that I can't even
imagine why someone wants that behavior.

...Robert

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 15:24:27

Robert Haas <robertmhaas@gmail.com> writes:
> Maybe I've just got my head deeply in the sand, but I don't understand
> what the alternative to E'' supposedly is.  How am I supposed to write
> the equivalent of E'\t\n\f' without using E''?  The
> standard_conforming_strings syntax apparently supports no escapes of
> any kind, which seems so hideously inconvenient that I can't even
> imagine why someone wants that behavior.

Well, quite aside from issues of compatibility with standards and other
databases, I'm sure there are lots of Windows users who are more
interested in being able to store a Windows pathname without doubling
their backslashes than they are in being able to type readable names
for ASCII control characters.  After all, in most cases you can get
those characters into a string just by typing them (especially if you
aren't using readline or something like it).

BTW, does anyone know whether Unicode includes the ASCII control
characters ... ie, is \u0009 a name for tab?  If so, maybe this
syntax is in part an attempt to cover that use-case in the standard.
        regards, tom lane

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 15:31:09

On Tuesday 14 April 2009 20:35:21 Robert Haas wrote:
> Maybe I've just got my head deeply in the sand, but I don't understand
> what the alternative to E'' supposedly is.  How am I supposed to write
> the equivalent of E'\t\n\f' without using E''?

Well, the first alternative is to type those characters in literally.  The 
second alternative is the U&'' syntax. ;-)  The third alternative is to design 
applications that don't need this, because the processing behavior of those 
characters is quite unportable.  But of course in some cases using the E'' 
syntax is the most convenient.

Re: Unicode string literals versus the world

From

"David E. Wheeler"

Date:

14 April 2009, 15:32:29

On Apr 14, 2009, at 11:22 AM, Tom Lane wrote:

> BTW, does anyone know whether Unicode includes the ASCII control
> characters ... ie, is \u0009 a name for tab?  If so, maybe this
> syntax is in part an attempt to cover that use-case in the standard.

Yes, you can use, e.g., 	 in HTML to represent a tab character.

Best,

David

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 15:34:59

On Tuesday 14 April 2009 17:13:00 Marko Kreen wrote:
> If the parsing does not happen in 2 passes and it does not take account
> of stdstr setting then the  default breakage would be:
>
>    stdstr=off, U&' \' UESCAPE '!'.

I think we can handle that and the cases Tom presents by erroring out when the 
U& syntax is used with stdstr off.

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 15:37:31

On Tuesday 14 April 2009 21:22:29 Tom Lane wrote:
> BTW, does anyone know whether Unicode includes the ASCII control
> characters ... ie, is \u0009 a name for tab?  If so, maybe this
> syntax is in part an attempt to cover that use-case in the standard.

Yes on both.

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 15:44:59

On Tuesday 14 April 2009 18:54:33 Tom Lane wrote:
> The other proposal that seemed
> attractive to me was a decode-like function:
>
>         uescape('foo\00e9bar')
>         uescape('foo\00e9bar', '\')

This was discussed previously, but rejected with the following argument:

There are some other disadvantages for making a function call.  You 
couldn't use that kind of literal in any other place where the parser 
calls for a string constant: role names, tablespace locations, 
passwords, copy delimiters, enum values, function body, file names.

Some of those appear to be very plausible use cases.

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 15:46:57

On Tuesday 14 April 2009 17:32:00 Tom Lane wrote:
>  I admit that the SQL:2008 way also covers Unicode code
> points in identifiers, which we can't emulate without a lexical change;
> but frankly I think the use-case for that is so thin as to be almost
> nonexistent.  Who is going to choose identifiers that they cannot easily
> type on their keyboards?

For example, table names are automatically generated, or you write a test case 
for weird looking names, or you want to add special characters in an 
identifier that will later be displayed somewhere, or in general you are 
writing an application for a foreign language.

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 15:48:17

Peter Eisentraut <peter_e@gmx.net> writes:
> I think we can handle that and the cases Tom presents by erroring out
> when the U& syntax is used with stdstr off.

I think you're missing the point --- this is not about whether the
syntax is unambiguous (it is already) but about whether a frontend that
doesn't understand it 100% will be secure against subversion.  I have no
confidence in the latter assumption.
        regards, tom lane

Re: Unicode string literals versus the world

From

Robert Haas

Date:

14 April 2009, 15:50:47

On Tue, Apr 14, 2009 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Maybe I've just got my head deeply in the sand, but I don't understand
>> what the alternative to E'' supposedly is.  How am I supposed to write
>> the equivalent of E'\t\n\f' without using E''?  The
>> standard_conforming_strings syntax apparently supports no escapes of
>> any kind, which seems so hideously inconvenient that I can't even
>> imagine why someone wants that behavior.
>
> Well, quite aside from issues of compatibility with standards and other
> databases, I'm sure there are lots of Windows users who are more
> interested in being able to store a Windows pathname without doubling
> their backslashes than they are in being able to type readable names
> for ASCII control characters.  After all, in most cases you can get
> those characters into a string just by typing them (especially if you
> aren't using readline or something like it).

Well, that's fine, but that's a long way from Peter's statement that
"I think the tendency should be to get rid of E'' usage".  It's only
been a minor inconvenience for me to change my applications to use
E'', but I certainly don't see how I could get by without it, and it's
far more like other programming languages that I use (e.g. C) than the
standard syntax is.

...Robert

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 16:00:29

Peter Eisentraut <peter_e@gmx.net> writes:
> On Tuesday 14 April 2009 18:54:33 Tom Lane wrote:
>> The other proposal that seemed
>> attractive to me was a decode-like function:
>> 
>> uescape('foo\00e9bar')
>> uescape('foo\00e9bar', '\')

> This was discussed previously, but rejected with the following argument:

> There are some other disadvantages for making a function call.  You 
> couldn't use that kind of literal in any other place where the parser 
> calls for a string constant: role names, tablespace locations, 
> passwords, copy delimiters, enum values, function body, file names.

I'm less than convinced that those are really plausible use-cases for
characters that one is unable to type directly.  However, I'll grant the
point.  So that narrows us down to considering the \u extension to E''
strings as a saner and safer alternative to the spec's syntax.
        regards, tom lane

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 16:01:57

Robert Haas <robertmhaas@gmail.com> writes:
> Well, that's fine, but that's a long way from Peter's statement that
> "I think the tendency should be to get rid of E'' usage".

Bear in mind that that's Peter's opinion; it's not necessarily shared
by anyone else.  I was just responding to your assertion of the
diametrically opposed position that non-E strings are useless (which
I don't share either).
        regards, tom lane

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

14 April 2009, 16:24:15

On Tuesday 14 April 2009 21:48:12 Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > I think we can handle that and the cases Tom presents by erroring out
> > when the U& syntax is used with stdstr off.
>
> I think you're missing the point --- this is not about whether the
> syntax is unambiguous (it is already) but about whether a frontend that
> doesn't understand it 100% will be secure against subversion.  I have no
> confidence in the latter assumption.

I think I am getting the point quite well.  Do you have an example how this 
can be subverted?

Re: Unicode string literals versus the world

From

Tom Lane

Date:

14 April 2009, 16:27:58

"Meredith L. Patterson" <mlp@osogato.com> writes:
> Tom Lane wrote:
>> I suspect that it's actually impossible to parse such a thing correctly
>> without a full-fledged flex lexer or something of equivalent complexity.

> Is there a reason not to use a full-fledged flex lexer?

The point is that that's a pretty large imposition on client code that
we don't control or maintain, in order to get a feature that could be
gotten in much less dangerous ways that don't impact any code outside
PG.

> I'd be willing to take a crack at such a thing, but I'm working 80-hour
> weeks through the end of June and likely wouldn't be able to put in any
> time on it till then. So I definitely couldn't promise anything for 8.4,
> but if putting it off till 8.5 works, sign me up.

Shall we pass your name on to every package using Postgres, then?  This
is *not* about code within Postgres.
        regards, tom lane

Re: Unicode string literals versus the world

From

"Meredith L. Patterson"

Date:

14 April 2009, 16:36:05

Tom Lane wrote:
> I suspect that it's actually impossible to parse such a thing correctly
> without a full-fledged flex lexer or something of equivalent complexity.
> Certainly it's a couple of orders of magnitude harder than it is for
> either standard-conforming or E'' literals.

Is there a reason not to use a full-fledged flex lexer?

I'd be willing to take a crack at such a thing, but I'm working 80-hour
weeks through the end of June and likely wouldn't be able to put in any
time on it till then. So I definitely couldn't promise anything for 8.4,
but if putting it off till 8.5 works, sign me up.

Cheers,
--mlp
_____________________
Meredith L. Patterson
Founder and CTO
Osogato, Inc.

Re: Unicode string literals versus the world

From

"Meredith L. Patterson"

Date:

14 April 2009, 16:39:05

Tom Lane wrote:
> This is *not* about code within Postgres.

One typically provides libraries for this sort of thing, but your point
is taken; suggestion withdrawn.

--mlp
_____________________
Meredith L. Patterson
Founder and CTO
Osogato, Inc.

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

14 April 2009, 16:52:04

On 4/14/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>  > On Tuesday 14 April 2009 18:54:33 Tom Lane wrote:
>  >> The other proposal that seemed
>  >> attractive to me was a decode-like function:
>  >>
>  >> uescape('foo\00e9bar')
>  >> uescape('foo\00e9bar', '\')
>
>  > This was discussed previously, but rejected with the following argument:
>
>  > There are some other disadvantages for making a function call.  You
>  > couldn't use that kind of literal in any other place where the parser
>  > calls for a string constant: role names, tablespace locations,
>  > passwords, copy delimiters, enum values, function body, file names.
>
>
> I'm less than convinced that those are really plausible use-cases for
>  characters that one is unable to type directly.  However, I'll grant the
>  point.  So that narrows us down to considering the \u extension to E''
>  strings as a saner and safer alternative to the spec's syntax.

My vote would go to \u.  The U& may be "sql standard" but it's different
from any established practical standard.

Alternative would be to make U& follow stdstr setting:

stdstr=on -> you get fully standard-conforming syntax:
 U&'\xxx' UESCAPE '\'

stdstr=off -> you need to follow old quoting rules:
 U&'\\xxx' UESCAPE '\\'

This would result in safe, and when stdstr=on, fully standard compliant
syntax.  Only downside would be that in practice - stdstr=off - it would
be unusable.

Third alternative would be to do both of them - \u as a usable method
and safe-U& to mark the checkbox for SQL-standard compliance.
If we do want U&, I would prefer that to U&-only syntax.

-- 
marko

Re: Unicode string literals versus the world

From

Robert Haas

Date:

14 April 2009, 17:51:53

On Tue, Apr 14, 2009 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Well, that's fine, but that's a long way from Peter's statement that
>> "I think the tendency should be to get rid of E'' usage".
>
> Bear in mind that that's Peter's opinion; it's not necessarily shared
> by anyone else.  I was just responding to your assertion of the
> diametrically opposed position that non-E strings are useless (which
> I don't share either).

Useless might be an overstatement, but I certainly have gotten no
benefit out of them.  It seems decidedly odd to me to propose that
users embed literal control characters in their code/SQL scripts in
lieu of using escape sequences.  If that were a normal and reasonable
thing to do then I expect C, C++, Perl, Python, Ruby, Javascript, and
countless other languages wouldn't provide this functionality.  In
reality, most of them do provide it, sometimes as the only option.
And as to your point about Windows pathnames, if you're trying to
avoid escaping lots of backslashes, dollar-quoting is a perfectly
adequate solution.

The real motivation for standard_conforming_strings is the one
embedded in the name.  And I'm fine with that.  But I don't see E''
disappearing from my code any time soon.  I periodically have a need
for things like foo || E'\n' || bar and it screws up the formatting of
the code to write foo || '
' || bar.  Writing it that way, it's also entirely non-obvious whether
there are any spaces or tabs after the opening quote and before the
newline, and if so whether or not they are intentional.  But I don't
mind if someone ELSE likes the new, standard strings - it just isn't
me.  :-)

...Robert

Re: Unicode string literals versus the world

From

Sam Mason

Date:

15 April 2009, 08:47:25

On Tue, Apr 14, 2009 at 04:01:48PM +0300, Peter Eisentraut wrote:
> On Saturday 11 April 2009 18:20:47 Sam Mason wrote:
> > I can't see much support in the other database engines; searched for
> > Oracle, MS-SQL, DB2 and Firebird.  MySQL has it planned for 7.1, so not
> > for a while.
> 
> DB2 supports it, as far as I know.

Doh, yes it does doesn't it.  Sorry I searched for a bit and failed to
find anything before.  Looks as though the signal to noise ratio was far
too low as I've just searched again and found a (single) reference to
their docs describing the feature[1].

I've also just noticed that the MySQL todo item points to several other
implementations and how they handle Unicode escape sequences.  The most
common option (bearing in mind that this is a sample of mainly FOSS
databases) seems to be doing some variant of '\u0123', as in the style
of Python.  This is only supported for literals and no support for
identifiers appears to be provided.

--  Sam  http://samason.me.uk/
[1] http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0000731.html

Re: Unicode string literals versus the world

From

"David E. Wheeler"

Date:

15 April 2009, 12:46:45

On Apr 15, 2009, at 4:45 AM, Sam Mason wrote:

> Doh, yes it does doesn't it.  Sorry I searched for a bit and failed to
> find anything before.  Looks as though the signal to noise ratio was
> far
> too low as I've just searched again and found a (single) reference to
> their docs describing the feature[1].

This is ugly, but not completely unpalatable:

U&'\0141ód\017A is a city in Poland'   U&'c:\\temp'   U&'@+01D11E'
UESCAPE '@'

Wouldn't we just then say that U&'' strings are always standard-
conforming?

Best,

David

Re: Unicode string literals versus the world

From

Martijn van Oosterhout

Date:

15 April 2009, 13:11:34

On Tue, Apr 14, 2009 at 08:10:54AM -0400, Andrew Dunstan wrote:
> Marko Kreen wrote:
> >I still stand on my proposal, how about extending E'' strings with
> >unicode escapes (eg. \uXXXX)?  The E'' strings are already more
> >clearly defined than '' and they are our "own", we don't need to
> >consider random standards, but can consider our sanity.
>
> I suspect there would be lots more support in the user community, where
> \uXXXX is well understood in a number of contexts (Java and ECMAScript,
> for example). It's also tolerably sane.

By the way, that's an example of how to do it wrong, there are more
than 2^16 unicode characters, you want to be able to support the full
21-bit range if you're going to do it right.

FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
makes it clear it's hex and doesn't make assumptions as to how many
characters are used.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Unicode string literals versus the world

From

Tom Lane

Date:

15 April 2009, 14:45:13

"David E. Wheeler" <david@kineticode.com> writes:
> Wouldn't we just then say that U&'' strings are always standard- 
> conforming?

That's exactly what's causing the problem --- they are, but there
is lots of software that won't know it.
        regards, tom lane

Re: Unicode string literals versus the world

From

Greg Stark

Date:

15 April 2009, 14:54:18

On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Wouldn't we just then say that U&'' strings are always standard-
>> conforming?
>
> That's exactly what's causing the problem --- they are, but there
> is lots of software that won't know it.


We could say U&'' escapes only work if you have
standards_conforming_strings set to true.


-- 
greg

Re: Unicode string literals versus the world

From

Greg Stark

Date:

15 April 2009, 14:57:26

On Wed, Apr 15, 2009 at 6:52 PM, Greg Stark <stark@enterprisedb.com> wrote:
> On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Wouldn't we just then say that U&'' strings are always standard-
>>> conforming?
>>
>> That's exactly what's causing the problem --- they are, but there
>> is lots of software that won't know it.
>
>
> We could say U&'' escapes only work if you have
> standards_conforming_strings set to true.

Or say that if you have standards_conforming_strings false then any
string which contains a literal \ or ' is an error. You shouldn't ever
really need either since you could use the unicode escape for either
after all.

-- 
greg

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

15 April 2009, 15:06:53

On 4/15/09, Greg Stark <stark@enterprisedb.com> wrote:
> On Wed, Apr 15, 2009 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  >> Wouldn't we just then say that U&'' strings are always standard-
>  >> conforming?
>  >
>  > That's exactly what's causing the problem --- they are, but there
>  > is lots of software that won't know it.
>
> We could say U&'' escapes only work if you have
>  standards_conforming_strings set to true.

Whats wrong with requiring U& to conform with stdstr=off quoting rules?

You can use UESCAPE if you dont want to double backslashes.

-- 
marko

Re: Unicode string literals versus the world

From

Tom Lane

Date:

15 April 2009, 15:26:04

Marko Kreen <markokr@gmail.com> writes:
> Whats wrong with requiring U& to conform with stdstr=off quoting rules?

The sole and only excuse for that misbegotten syntax is to be exactly
SQL spec compliant --- otherwise we might as well pick something saner.
So it needs to work like stdstr=on.  I thought Peter's proposal of
rejecting it altogether when stdstr=off might be reasonable.  The space
sensitivity around the & still sucks, but I have not (yet) thought of
a credible security exploit for that.
        regards, tom lane

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

15 April 2009, 15:41:22

On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marko Kreen <markokr@gmail.com> writes:
> > Whats wrong with requiring U& to conform with stdstr=off quoting rules?
>
>  The sole and only excuse for that misbegotten syntax is to be exactly
>  SQL spec compliant --- otherwise we might as well pick something saner.
>  So it needs to work like stdstr=on.  I thought Peter's proposal of
>  rejecting it altogether when stdstr=off might be reasonable.  The space
>  sensitivity around the & still sucks, but I have not (yet) thought of
>  a credible security exploit for that.

So the U& syntax is only available if stdstr=on?  Sort of makes sense.

As both this and the doubling-\\ way would mean we should have usable
alternative in case of stdstr=off also, so in the end we have agreed
to accept \u also?

-- 
marko

Re: Unicode string literals versus the world

From

Andrew Dunstan

Date:

15 April 2009, 15:50:55


Martijn van Oosterhout wrote:
> On Tue, Apr 14, 2009 at 08:10:54AM -0400, Andrew Dunstan wrote:
>   
>> Marko Kreen wrote:
>>     
>>> I still stand on my proposal, how about extending E'' strings with
>>> unicode escapes (eg. \uXXXX)?  The E'' strings are already more
>>> clearly defined than '' and they are our "own", we don't need to
>>> consider random standards, but can consider our sanity.
>>>       
>> I suspect there would be lots more support in the user community, where 
>> \uXXXX is well understood in a number of contexts (Java and ECMAScript, 
>> for example). It's also tolerably sane.
>>     
>
> By the way, that's an example of how to do it wrong, there are more
> than 2^16 unicode characters, you want to be able to support the full
> 21-bit range if you're going to do it right.
>
> FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
> makes it clear it's hex and doesn't make assumptions as to how many
> characters are used.
>   

I could live with either. Wikipedia says: "The characters outside the 
first plane usually have very specialized or rare use." For years we 
rejected all characters beyond the first plane, and while that's fixed 
now, the volume of complaints wasn't huge.

cheers

andrew

> Have a nice day,
>

Re: Unicode string literals versus the world

From

Tom Lane

Date:

15 April 2009, 15:53:42

Marko Kreen <markokr@gmail.com> writes:
> As both this and the doubling-\\ way would mean we should have usable
> alternative in case of stdstr=off also, so in the end we have agreed
> to accept \u also?

Given Martijn's complaint about more-than-16-bit code points, I think
the \u proposal is not mature enough to go into 8.4.  We can think
about some version of that later, if there's enough interest.
        regards, tom lane

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

15 April 2009, 17:19:49

On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marko Kreen <markokr@gmail.com> writes:
> > As both this and the doubling-\\ way would mean we should have usable
>  > alternative in case of stdstr=off also, so in the end we have agreed
>  > to accept \u also?
>
> Given Martijn's complaint about more-than-16-bit code points, I think
>  the \u proposal is not mature enough to go into 8.4.  We can think
>  about some version of that later, if there's enough interest.

I think it would be good idea. Basically we should pick one from
couple of pre-existing sane schemes.  Here is quick summary
of Python, Perl and Java:

Python [1]:
 \uXXXX         - 16-bit codepoint \UXXXXXXXX     - 32-bit codepoint \N{char-name}  - Characted by name

Perl [2]:
 \x{XXXX..}     - {} contains hexadecimal codepoint \N{char-name}  - Unicode char name

Java [3]:
 \uXXXX         - 16-bit codepoint

Perl is OK, but the \x makes think of literal hex-decimal, and thats
because they have extented their literal byte-escapes to support unicode.
So I doubt we should promote it more.  \u{} would be nicer, but that
would not be an established standard.

Both Python and Java allow using \u to encode higher codepoints with
surrogate pairs.  Which sort of makes it standard and Python superset
of Java.  (Obviously that does not mean you need to store them
as surrogate pairs.)

Problem with having only \u would be that this would make hard to enter
higher codepoints manually.  So \U would also be good to have.

There is no hurry with \N{} but if it would be possible, it would be
main reason to have custom unicode escaping.

So my proposal would be Python escapes without \N{}.

Whether we should have it in 8.4, I don't know.  If we will have U&,
but it does not work with stdstr=off, this seems to hint we should have
some other escaping method available in 8.4 for stdstd=off users?

-- 
marko

[1] http://docs.python.org/reference/lexical_analysis.html#string-literals
[2] http://perldoc.perl.org/perluniintro.html
[3] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html

Re: Unicode string literals versus the world

From

Tatsuo Ishii

Date:

16 April 2009, 01:36:47

> >>> I still stand on my proposal, how about extending E'' strings with
> >>> unicode escapes (eg. \uXXXX)?  The E'' strings are already more
> >>> clearly defined than '' and they are our "own", we don't need to
> >>> consider random standards, but can consider our sanity.
> >>>       
> >> I suspect there would be lots more support in the user community, where 
> >> \uXXXX is well understood in a number of contexts (Java and ECMAScript, 
> >> for example). It's also tolerably sane.
> >>     
> >
> > By the way, that's an example of how to do it wrong, there are more
> > than 2^16 unicode characters, you want to be able to support the full
> > 21-bit range if you're going to do it right.
> >
> > FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
> > makes it clear it's hex and doesn't make assumptions as to how many
> > characters are used.
> >   
> 
> I could live with either. Wikipedia says: "The characters outside the 
> first plane usually have very specialized or rare use." For years we 
> rejected all characters beyond the first plane, and while that's fixed 
> now, the volume of complaints wasn't huge.

I you mean "first plane" as BMP (i.e. 16bit range), above is not true
for PostgreSQL 7.3 or later at least.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Unicode string literals versus the world

From

Tatsuo Ishii

Date:

16 April 2009, 01:52:37

> > >>> I still stand on my proposal, how about extending E'' strings with
> > >>> unicode escapes (eg. \uXXXX)?  The E'' strings are already more
> > >>> clearly defined than '' and they are our "own", we don't need to
> > >>> consider random standards, but can consider our sanity.
> > >>>       
> > >> I suspect there would be lots more support in the user community, where 
> > >> \uXXXX is well understood in a number of contexts (Java and ECMAScript, 
> > >> for example). It's also tolerably sane.
> > >>     
> > >
> > > By the way, that's an example of how to do it wrong, there are more
> > > than 2^16 unicode characters, you want to be able to support the full
> > > 21-bit range if you're going to do it right.
> > >
> > > FWIW, I prefer the perl syntax which simply extends \x: \x{1344}, which
> > > makes it clear it's hex and doesn't make assumptions as to how many
> > > characters are used.
> > >   
> > 
> > I could live with either. Wikipedia says: "The characters outside the 
> > first plane usually have very specialized or rare use." For years we 
> > rejected all characters beyond the first plane, and while that's fixed 
> > now, the volume of complaints wasn't huge.
> 
> I you mean "first plane" as BMP (i.e. 16bit range), above is not true
> for PostgreSQL 7.3 or later at least.

Oops. I meant 8.2 or later.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Unicode string literals versus the world

From

Sam Mason

Date:

16 April 2009, 07:51:22

On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
> On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Given Martijn's complaint about more-than-16-bit code points, I think
> >  the \u proposal is not mature enough to go into 8.4.  We can think
> >  about some version of that later, if there's enough interest.
> 
> I think it would be good idea. Basically we should pick one from
> couple of pre-existing sane schemes.  Here is quick summary
> of Python, Perl and Java:
> 
> Python [1]:
> 
>   \uXXXX         - 16-bit codepoint
>   \UXXXXXXXX     - 32-bit codepoint
>   \N{char-name}  - Characted by name

Microsoft have also gone this way in C#, named code points are not
supported however.

> Perl [2]:
> 
>   \x{XXXX..}     - {} contains hexadecimal codepoint
>   \N{char-name}  - Unicode char name

Looks OK, but the 'x' seems somewhat redundant.  Why not just:
 \{xxxx}

This would be following the BitC[2] project, especially if it was more
like:
 \{U+xxxx}

e.g.
 \{U+03BB}

would be the lowercase lambda character.  Added appeal is in the fact
that this (i.e. U+03BB) is how the Unicode consortium spells code
points.

> Java [3]:
> 
>   \uXXXX         - 16-bit codepoint

AFAIK, Java isn't the best reference to choose; it assumed from an early
point in its design that Unicode characters were at most 16bits and
hence had to switch its internal representation to UTF-16.  I don't
program much Java these days to know how it's all worked out, but it
would be interesting to hear from people who regularly have to deal with
characters outside the BMP (i.e. code points greater than 65535).

--  Sam  http://samason.me.uk/
[1] http://msdn.microsoft.com/en-us/library/aa664669(VS.71).aspx[2]
http://www.bitc-lang.org/docs/bitc/spec.html#stringlit

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

16 April 2009, 08:47:27

On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:
> On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
>  > On 4/15/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > > Given Martijn's complaint about more-than-16-bit code points, I think
>  > >  the \u proposal is not mature enough to go into 8.4.  We can think
>  > >  about some version of that later, if there's enough interest.
>  >
>  > I think it would be good idea. Basically we should pick one from
>  > couple of pre-existing sane schemes.  Here is quick summary
>  > of Python, Perl and Java:
>  >
>  > Python [1]:
>  >
>  >   \uXXXX         - 16-bit codepoint
>  >   \UXXXXXXXX     - 32-bit codepoint
>  >   \N{char-name}  - Characted by name
>
>
> Microsoft have also gone this way in C#, named code points are not
>  supported however.

And it handles also non-BMP codepoints with \u escape similarly:
 http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences

This makes it even more standard.

>  > Perl [2]:
>  >
>  >   \x{XXXX..}     - {} contains hexadecimal codepoint
>  >   \N{char-name}  - Unicode char name
>
>
> Looks OK, but the 'x' seems somewhat redundant.  Why not just:
>
>   \{xxxx}
>
>  This would be following the BitC[2] project, especially if it was more
>  like:
>
>   \{U+xxxx}
>
>  e.g.
>
>   \{U+03BB}
>
>  would be the lowercase lambda character.  Added appeal is in the fact
>  that this (i.e. U+03BB) is how the Unicode consortium spells code
>  points.

We already got yet-another-unique-way-of-escaping-unicode with U&.

Now let's try to support some actual standard also.

>  > Java [3]:
>  >
>  >   \uXXXX         - 16-bit codepoint
>
>
> AFAIK, Java isn't the best reference to choose; it assumed from an early
>  point in its design that Unicode characters were at most 16bits and
>  hence had to switch its internal representation to UTF-16.  I don't
>  program much Java these days to know how it's all worked out, but it
>  would be interesting to hear from people who regularly have to deal with
>  characters outside the BMP (i.e. code points greater than 65535).

You did not read my mail carefully enough - the Java and also Python/C#
already support non-BMP chars with '\u' and exactly the same (utf16) way.

-- 
marko

Re: Unicode string literals versus the world

From

Andrew Dunstan

Date:

16 April 2009, 09:19:17


Tatsuo Ishii wrote:
>>> I could live with either. Wikipedia says: "The characters outside the 
>>> first plane usually have very specialized or rare use." For years we 
>>> rejected all characters beyond the first plane, and while that's fixed 
>>> now, the volume of complaints wasn't huge.
>>>       
>> I you mean "first plane" as BMP (i.e. 16bit range), above is not true
>> for PostgreSQL 7.3 or later at least.
>>     
>
> Oops. I meant 8.2 or later.
>
>   

Umm, that's what I said. We used to do it like that, but we don't any more.

cheers

andrew

Re: Unicode string literals versus the world

From

Sam Mason

Date:

16 April 2009, 11:34:45

On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
> On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:
> > Microsoft have also gone this way in C#, named code points are not
> > supported however.
> 
> And it handles also non-BMP codepoints with \u escape similarly:
> 
>   http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
> 
> This makes it even more standard.

I fail to see what you're pointing out here; as far as I understand it,
\u is for BMP code points and \U extends the range out to 32bit code
points.  I can't see anything about non-BMP and \u in the above link,
you appear free to write your own surrogate pairs but that seems like an
independent issue.

I'd not realised before that C# is specified to use UTF-16 as its
internal encoding.

> >  This would be following the BitC[2] project, especially if it was more
> >  like:
> >
> >   \{U+xxxx}
> 
> We already got yet-another-unique-way-of-escaping-unicode with U&.
> 
> Now let's try to support some actual standard also.

That comes across *very* negatively; I hope it's just a language issue.

I read your parent post as soliciting opinions on possible ways to
encode Unicode characters in PG's literals.  The U&'lit' was criticised,
you posted some suggestions, I followed up with what I hoped to be a
useful addition.  It seems useful here to separate "de jure" from "de
facto" standards; implementing U&'lit' would be following the de jure
standard, anything else would be de facto.

A survey of existing SQL implementations would seem to be more appropriate
as well:

Oracle: UNISTR(string-literal) and \xxxx
 It looks as though Oracle originally used UCS-2 internally (i.e. BMP only) but more recently Unicode support has been
improvedto allow other planes.

MS-SQL Server: 
 can't find anything remotely useful; best seems to be to use NCHAR(integer-expression) which looks somewhat
unmaintainable.

DB2: U&string-literal and \xxxxxx
 i.e. it follows the SQL-2003 spec

FireBird:
 can't find much either; support looks somewhat low on the ground

MySQL:
 same again, seems to assume query is encoded in UTF-8

Summary seems to be that either I'm bad at searching or support for
Unicode doesn't seem very complete in the database world and people work
around it somehow.

> You did not read my mail carefully enough - the Java and also Python/C#
> already support non-BMP chars with '\u' and exactly the same (utf16) way.

Again, I think this may be a language issue; if not then more verbose
explanations help, maybe something like "sorry, I obviously didn't
explain that very well".  You will of course felt you explained it
perfectly well, but everybody enters a discussion with different
intuitions and biases, email has a nasty habit of accentuating these
differences and compounding them with language problems.

I'd never heard of UTF-16 surrogate pairs before this discussion and
hence didn't realise that it's valid to have a surrogate pair in place
of a single code point.  The docs say that <D800 DF02> corresponds to
U+10302, Python would appear to follow my intuitions in that:
 ord(u'\uD800\uDF02')

results in an error instead of giving back 66306, as I'd expect.  Is
this a bug in Python, my understanding, or something else?

--  Sam  http://samason.me.uk/

Re: Unicode string literals versus the world

From

Tom Lane

Date:

16 April 2009, 11:54:20

Sam Mason <sam@samason.me.uk> writes:
> I'd never heard of UTF-16 surrogate pairs before this discussion and
> hence didn't realise that it's valid to have a surrogate pair in place
> of a single code point.  The docs say that <D800 DF02> corresponds to
> U+10302, Python would appear to follow my intuitions in that:

>   ord(u'\uD800\uDF02')

> results in an error instead of giving back 66306, as I'd expect.  Is
> this a bug in Python, my understanding, or something else?

I might be wrong, but I think surrogate pairs are expressly forbidden in
all representations other than UTF16/UCS2.  We definitely forbid them
when validating UTF-8 strings --- that's per an RFC recommendation.
It sounds like Python is doing the same.
        regards, tom lane

Re: Unicode string literals versus the world

From

Sam Mason

Date:

16 April 2009, 12:24:47

On Thu, Apr 16, 2009 at 10:54:16AM -0400, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > I'd never heard of UTF-16 surrogate pairs before this discussion and
> > hence didn't realise that it's valid to have a surrogate pair in place
> > of a single code point.  The docs say that <D800 DF02> corresponds to
> > U+10302, Python would appear to follow my intuitions in that:
> 
> >   ord(u'\uD800\uDF02')
> 
> > results in an error instead of giving back 66306, as I'd expect.  Is
> > this a bug in Python, my understanding, or something else?
> 
> I might be wrong, but I think surrogate pairs are expressly forbidden in
> all representations other than UTF16/UCS2.  We definitely forbid them
> when validating UTF-8 strings --- that's per an RFC recommendation.
> It sounds like Python is doing the same.

OK, that's good.  I thought I was missing something.  A minor point is
that in UCS2 each 16bit value is exactly one character and characters
outside the BMP aren't supported, hence the need for UTF-16.

I've failed to keep up with the discussion so I'm not sure where this
conversation has got to!  Is the consensus for 8.4 to enable SQL2003
style U&lit escaped literals if and only if standard_conforming_strings
is set?  This seems easiest for client code as it can use this
exclusively for knowing what to do with backslashes.

--  Sam  http://samason.me.uk/

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

16 April 2009, 12:34:20

On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:
> On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
>  > On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:
> > > Microsoft have also gone this way in C#, named code points are not
>  > > supported however.
>  >
>  > And it handles also non-BMP codepoints with \u escape similarly:
>  >
>  >   http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
>  >
>  > This makes it even more standard.
>
>
> I fail to see what you're pointing out here; as far as I understand it,
>  \u is for BMP code points and \U extends the range out to 32bit code
>  points.  I can't see anything about non-BMP and \u in the above link,
>  you appear free to write your own surrogate pairs but that seems like an
>  independent issue.

Ok, maybe I glanced too quickly over that page.

I can't find definite deference only hint on several pages:
 \U \Unnnnnnnn Unicode escape sequence for surrogate pairs.

Which hints that you can aswell enter the pairs directly: \uxx\uxx.
If I'd be language designer, I would not see any reason to disallow it.

And anyway, at least mono seems to support it:

using System;
public class HelloWorld {   public static void Main() {       Console.WriteLine("<\uD800\uDF02>\n");   }
}

It will output single UTF8 character.  I think this should settle it.

>  I'd not realised before that C# is specified to use UTF-16 as its
>  internal encoding.
>
>  > >  This would be following the BitC[2] project, especially if it was more
>  > >  like:
>  > >
>  > >   \{U+xxxx}
>  >
>
> > We already got yet-another-unique-way-of-escaping-unicode with U&.
>  >
>  > Now let's try to support some actual standard also.
>
>
> That comes across *very* negatively; I hope it's just a language issue.
>
>  I read your parent post as soliciting opinions on possible ways to
>  encode Unicode characters in PG's literals.  The U&'lit' was criticised,
>  you posted some suggestions, I followed up with what I hoped to be a
>  useful addition.  It seems useful here to separate "de jure" from "de
>  facto" standards; implementing U&'lit' would be following the de jure
>  standard, anything else would be de facto.
>
>  A survey of existing SQL implementations would seem to be more appropriate
>  as well:
>
>  Oracle: UNISTR(string-literal) and \xxxx
>
>   It looks as though Oracle originally used UCS-2 internally (i.e. BMP
>   only) but more recently Unicode support has been improved to allow
>   other planes.
>
>  MS-SQL Server:
>
>   can't find anything remotely useful; best seems to be to use
>   NCHAR(integer-expression) which looks somewhat unmaintainable.
>
>  DB2: U&string-literal and \xxxxxx
>
>   i.e. it follows the SQL-2003 spec
>
>  FireBird:
>
>   can't find much either; support looks somewhat low on the ground
>
>  MySQL:
>
>   same again, seems to assume query is encoded in UTF-8
>
>  Summary seems to be that either I'm bad at searching or support for
>  Unicode doesn't seem very complete in the database world and people work
>  around it somehow.

The de-facto about Postgres is stdstr=off.  Even if not, E'' strings
are still better for various things, so it would be good if they also
aquired unicode-capabilities.

>  > You did not read my mail carefully enough - the Java and also Python/C#
>  > already support non-BMP chars with '\u' and exactly the same (utf16) way.
>
>
> Again, I think this may be a language issue; if not then more verbose
>  explanations help, maybe something like "sorry, I obviously didn't
>  explain that very well".  You will of course felt you explained it
>  perfectly well, but everybody enters a discussion with different
>  intuitions and biases, email has a nasty habit of accentuating these
>  differences and compounding them with language problems.
>
>  I'd never heard of UTF-16 surrogate pairs before this discussion and
>  hence didn't realise that it's valid to have a surrogate pair in place
>  of a single code point.  The docs say that <D800 DF02> corresponds to
>  U+10302, Python would appear to follow my intuitions in that:
>
>   ord(u'\uD800\uDF02')
>
>  results in an error instead of giving back 66306, as I'd expect.  Is
>  this a bug in Python, my understanding, or something else?

Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
that is - plain 16 or 32-bit values.  Seems your python is compiled with
UCS2, not UCS4.  As I understand, in UCS2 mode it simply takes surrogate
values as-is.  From ord() docs:
 If a unicode argument is given and Python was built with UCS2 Unicode, then the character’s code point must be in the
range[0..65535] inclusive; otherwise the string length is two, and a TypeError will be raised. 

So only in UCS4 mode it detects surrogates and converts them to internal
representation.  (Which in Postgres case would be UTF8.)

Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
UTF16 but low-level string routines do not:
 print "<%s>" % u'\uD800\uDF02'

seems to handle it properly.

--
marko

Re: Unicode string literals versus the world

From

Andrew Dunstan

Date:

16 April 2009, 12:34:36


Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
>   
>> I'd never heard of UTF-16 surrogate pairs before this discussion and
>> hence didn't realise that it's valid to have a surrogate pair in place
>> of a single code point.  The docs say that <D800 DF02> corresponds to
>> U+10302, Python would appear to follow my intuitions in that:
>>     
>
>   
>>   ord(u'\uD800\uDF02')
>>     
>
>   
>> results in an error instead of giving back 66306, as I'd expect.  Is
>> this a bug in Python, my understanding, or something else?
>>     
>
> I might be wrong, but I think surrogate pairs are expressly forbidden in
> all representations other than UTF16/UCS2.  We definitely forbid them
> when validating UTF-8 strings --- that's per an RFC recommendation.
> It sounds like Python is doing the same.
>
>             
>   

You mustn't encode the surrogate, but it's up to us how we allow people 
to designate a given code point.

Frankly, I think we shouldn't provide for using surrogates at all. I 
would prefer something like \uXXXX for BMP items and \UXXXXXXXX as the 
straight 32bit designation of a higher codepoint.

cheers

andrew

Re: Unicode string literals versus the world

From

Marko Kreen

Date:

16 April 2009, 12:50:40

On 4/16/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sam Mason <sam@samason.me.uk> writes:
>  > I'd never heard of UTF-16 surrogate pairs before this discussion and
>  > hence didn't realise that it's valid to have a surrogate pair in place
>  > of a single code point.  The docs say that <D800 DF02> corresponds to
>  > U+10302, Python would appear to follow my intuitions in that:
>
>  >   ord(u'\uD800\uDF02')
>
>  > results in an error instead of giving back 66306, as I'd expect.  Is
>  > this a bug in Python, my understanding, or something else?
>
>
> I might be wrong, but I think surrogate pairs are expressly forbidden in
>  all representations other than UTF16/UCS2.  We definitely forbid them
>  when validating UTF-8 strings --- that's per an RFC recommendation.
>  It sounds like Python is doing the same.

The point here is that Python/Java/C# allow them for escaping non-BMP
unicode values, irrespective of their interal encoding.

-- 
marko

Re: Unicode string literals versus the world

From

Sam Mason

Date:

16 April 2009, 13:08:15

On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote:
> Which hints that you can aswell enter the pairs directly: \uxx\uxx.
> If I'd be language designer, I would not see any reason to disallow it.
> 
> And anyway, at least mono seems to support it:
> 
> using System;
> public class HelloWorld {
>     public static void Main() {
>         Console.WriteLine("<\uD800\uDF02>\n");
>     }
> }
> 
> It will output single UTF8 character.  I think this should settle it.

I don't have any .net stuff installed so can't test; but C# is defined
to use UTF-16 as its internal representation so it would make sense if
the above gets treated as a single character internally.  However, if it
used any other encoding the above should be treated as an error.

> The de-facto about Postgres is stdstr=off.  Even if not, E'' strings
> are still better for various things, so it would be good if they also
> aquired unicode-capabilities.

OK, this seems independent of the U&'lit' discussion that started the
thread.  Note that PG already supports UTF8; if you want the character
I've been using in my examples up-thread, you can do:
 SELECT E'\xF0\x90\x8C\x82';

I have a feeling that this is predicated on the server_encoding being
set to "utf8" and this can only be done at database creation time.
Another alternative would be to use the convert_from function, i.e:
 SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8');

Never had to do this though, so there may be better options available.

> Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
> that is - plain 16 or 32-bit values.  Seems your python is compiled with
> UCS2, not UCS4.

Cool, I didn't know that.  I believe mine is UCS4 as I can do:
 ord(u'\U00010302')

and I get 66306 back rather than an error.

> As I understand, in UCS2 mode it simply takes surrogate
> values as-is.

UCS2 doesn't have surrogate pairs, or at least I believe it's considered
a bug if you don't get an error when you present it with one.

> From ord() docs:
> 
>   If a unicode argument is given and Python was built with UCS2 Unicode,
>   then the character’s code point must be in the range [0..65535]
>   inclusive; otherwise the string length is two, and a TypeError will
>   be raised.
> 
> So only in UCS4 mode it detects surrogates and converts them to internal
> representation.  (Which in Postgres case would be UTF8.)

I think you mean UTF-16 instead of UCS4; but otherwise, yes.

> Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
> UTF16 but low-level string routines do not:
> 
>   print "<%s>" % u'\uD800\uDF02'
> 
> seems to handle it properly.

Yes, I get this as well.  It's all a bit weird, which is why I was
asking if "this a bug in Python, my understanding, or something else".

When I do:
 python <<EOF | hexdump -C print u"\uD800\uDF02" EOF

to see what it's doing I get an error which I'm not expecting, hence I
think it's probably my understanding.

--  Sam  http://samason.me.uk/

Re: Unicode string literals versus the world

From

Tom Lane

Date:

16 April 2009, 13:08:58

Sam Mason <sam@samason.me.uk> writes:
> I've failed to keep up with the discussion so I'm not sure where this
> conversation has got to!  Is the consensus for 8.4 to enable SQL2003
> style U&lit escaped literals if and only if standard_conforming_strings
> is set?

That was Peter's proposal, and no one's shot a hole in it yet ...

I think the discussion about whether/how to add a Unicode extension to
E''-style literals is 8.5 material.  We are in beta so now is not
the time to add new features, especially ones that weren't even on the
TODO list before.
        regards, tom lane

Re: Unicode string literals versus the world

From

Sam Mason

Date:

17 April 2009, 08:26:07

On Thu, Apr 16, 2009 at 12:08:37PM -0400, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > I've failed to keep up with the discussion so I'm not sure where this
> > conversation has got to!  Is the consensus for 8.4 to enable SQL2003
> > style U&lit escaped literals if and only if standard_conforming_strings
> > is set?
> 
> That was Peter's proposal, and no one's shot a hole in it yet ...

Just noticed that the spec only supports four hex digits; this would
imply that support for anything outside the BMP would have to be done
by encoding the character as a surrogate pair.  If the code doesn't
do this already (the original patch didn't seem to) these should be
normalised back to a single character in a similar manner to Marko's
recent patch[1].

> I think the discussion about whether/how to add a Unicode extension to
> E''-style literals is 8.5 material.  We are in beta so now is not
> the time to add new features, especially ones that weren't even on the
> TODO list before.

OK, sounds reasonable.

--  Sam  http://samason.me.uk/
[1] http://archives.postgresql.org//pgsql-hackers/2009-04/msg00904.php

Re: Unicode string literals versus the world

From

Tom Lane

Date:

17 April 2009, 11:16:01

Sam Mason <sam@samason.me.uk> writes:
> Just noticed that the spec only supports four hex digits;

Better read it again.
        regards, tom lane

Re: Unicode string literals versus the world

From

Sam Mason

Date:

17 April 2009, 11:31:03

On Fri, Apr 17, 2009 at 10:15:57AM -0400, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > Just noticed that the spec only supports four hex digits;
> 
> Better read it again.

You're right of course.  My ability to read patches seems not to be very
good.

--  Sam  http://samason.me.uk/

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

04 May 2009, 19:11:09

On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
> On Tuesday 14 April 2009 17:13:00 Marko Kreen wrote:
> > If the parsing does not happen in 2 passes and it does not take account
> > of stdstr setting then the  default breakage would be:
> >
> >    stdstr=off, U&' \' UESCAPE '!'.
>
> I think we can handle that and the cases Tom presents by erroring out when
> the U& syntax is used with stdstr off.

Proposed patch for that attached.

Re: Unicode string literals versus the world

From

Tom Lane

Date:

04 May 2009, 21:01:15

Peter Eisentraut <peter_e@gmx.net> writes:
> On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
>> I think we can handle that and the cases Tom presents by erroring out when
>> the U& syntax is used with stdstr off.

> Proposed patch for that attached.

I have not been able to think of any security hole in that proposal,
so this patch seems acceptable to me.  I wonder though whether any
corresponding change is needed in psql's lexer, and if so how should
it react exactly to the rejection case.
        regards, tom lane

Re: Unicode string literals versus the world

From

"Hiroshi Saito"

Date:

04 May 2009, 23:40:49

Hi.

quick test for great patch. !

== SCRIPT ==
set CLIENT_ENCODING to 'UTF-8';
DROP TABLE ucheck CASCADE;
CREATE TABLE ucheck (key VARCHAR(10) PRIMARY KEY, data NCHAR(50));
set STANDARD_CONFORMING_STRINGS to on;
INSERT INTO ucheck VALUES('ucheck1',u&'\68ee\9dd7\5916');
SELECT * FROM ucheck;
set CLIENT_ENCODING to 'SHIFTJIS2004';
SELECT * FROM ucheck;
set STANDARD_CONFORMING_STRINGS to off;
INSERT INTO ucheck VALUES('ucheck2',u&'\68ee\9dd7\5916');

=== As for psql ===
This should notice the console of Japanese windows-XP about code can't  to display. 

C:\work>psql
psql (8.4beta1)
"help" でヘルプを表示します.

HIROSHI=# set CLIENT_ENCODING to 'UTF-8';
SET
HIROSHI=# DROP TABLE ucheck CASCADE;
DROP TABLE
HIROSHI=# CREATE TABLE ucheck (key VARCHAR(10) PRIMARY KEY, data NCHAR(50));
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "ucheck_pkey" fortable "ucheck"
CREATE TABLE
HIROSHI=# set STANDARD_CONFORMING_STRINGS to on;
SET
HIROSHI=# INSERT INTO ucheck VALUES('ucheck',u&'\68ee\9dd7\5916');
INSERT 0 1
HIROSHI=# SELECT * FROM ucheck; key   |                         data
--------+-------------------------------------------------------ucheck | 譽ョ鮃怜、・
(1 行)


HIROSHI=# set CLIENT_ENCODING to 'SHIFTJIS2004';
SET
HIROSHI=# SELECT * FROM ucheck; key   |                         data
--------+-------------------------------------------------------ucheck | 森・外
(1 行)

HIROSHI=# set STANDARD_CONFORMING_STRINGS to off;
SET
HIROSHI=# INSERT INTO ucheck VALUES('ucheck2',u&'\68ee\9dd7\5916');
ERROR:  unsafe use of string constant with Unicode escapes
DETAIL:  String constants with Unicode escapes cannot be used when standard_conf
orming_strings is off.

=== As for pgAdminIII ===
pgadmin is shown very well.!!:-)
http://winpg.jp/~saito/pg_work/ucheck/jisx0213.png

P.S)
This test obtains the same result by MimerDB.

Regards,
Hiroshi Saito

----- Original Message ----- 
From: "Tom Lane" <tgl@sss.pgh.pa.us>


> Peter Eisentraut <peter_e@gmx.net> writes:
>> On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
>>> I think we can handle that and the cases Tom presents by erroring out when
>>> the U& syntax is used with stdstr off.
> 
>> Proposed patch for that attached.
> 
> I have not been able to think of any security hole in that proposal,
> so this patch seems acceptable to me.  I wonder though whether any
> corresponding change is needed in psql's lexer, and if so how should
> it react exactly to the rejection case.
> 
> regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

05 May 2009, 15:37:31

On Tuesday 05 May 2009 03:01:05 Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
> >> I think we can handle that and the cases Tom presents by erroring out
> >> when the U& syntax is used with stdstr off.
> >
> > Proposed patch for that attached.
>
> I have not been able to think of any security hole in that proposal,
> so this patch seems acceptable to me.  I wonder though whether any
> corresponding change is needed in psql's lexer, and if so how should
> it react exactly to the rejection case.

I had thought about that as well, but concluded that no additional change is 
necessary.

Note that the *corresponding* change would be psql complaining "I don't like 
what you entered", versus the just-committed behavior that psql is indifferent 
and the server complains "I don't like what you sent me".

In any case, the point of the change is to prevent confusion in client 
programs, so if we had to patch psql to make sense, then the change would have 
been pointless in the first place.

Re: Unicode string literals versus the world

From

Bruce Momjian

Date:

29 May 2009, 00:31:25

Peter Eisentraut wrote:
> On Tuesday 05 May 2009 03:01:05 Tom Lane wrote:
> > Peter Eisentraut <peter_e@gmx.net> writes:
> > > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
> > >> I think we can handle that and the cases Tom presents by erroring out
> > >> when the U& syntax is used with stdstr off.
> > >
> > > Proposed patch for that attached.
> >
> > I have not been able to think of any security hole in that proposal,
> > so this patch seems acceptable to me.  I wonder though whether any
> > corresponding change is needed in psql's lexer, and if so how should
> > it react exactly to the rejection case.
> 
> I had thought about that as well, but concluded that no additional change is 
> necessary.
> 
> Note that the *corresponding* change would be psql complaining "I don't like 
> what you entered", versus the just-committed behavior that psql is indifferent 
> and the server complains "I don't like what you sent me".
> 
> In any case, the point of the change is to prevent confusion in client 
> programs, so if we had to patch psql to make sense, then the change would have 
> been pointless in the first place.

I assume there is no TODO here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Unicode string literals versus the world

From

Peter Eisentraut

Date:

29 May 2009, 03:50:34

On Friday 29 May 2009 06:31:23 Bruce Momjian wrote:
> Peter Eisentraut wrote:
> > On Tuesday 05 May 2009 03:01:05 Tom Lane wrote:
> > > Peter Eisentraut <peter_e@gmx.net> writes:
> > > > On Tuesday 14 April 2009 21:34:51 Peter Eisentraut wrote:
> > > >> I think we can handle that and the cases Tom presents by erroring
> > > >> out when the U& syntax is used with stdstr off.
> > > >
> > > > Proposed patch for that attached.
> > >
> > > I have not been able to think of any security hole in that proposal,
> > > so this patch seems acceptable to me.  I wonder though whether any
> > > corresponding change is needed in psql's lexer, and if so how should
> > > it react exactly to the rejection case.
> >
> > I had thought about that as well, but concluded that no additional change
> > is necessary.
> >
> > Note that the *corresponding* change would be psql complaining "I don't
> > like what you entered", versus the just-committed behavior that psql is
> > indifferent and the server complains "I don't like what you sent me".
> >
> > In any case, the point of the change is to prevent confusion in client
> > programs, so if we had to patch psql to make sense, then the change would
> > have been pointless in the first place.
>
> I assume there is no TODO here.

No, it should be fine.