Thread: Escaped backslash in SQL constant

Escaped backslash in SQL constant

From
"CN"
Date:
Hello!

The database cluster is initialized to use UNICODE. The client encoding
is set to BIG5. The middleware escapes the backslash in the following
string before writing to TEXT/VARCHAR column in server:

a5 5c af e0

This is a string comprises Big5 characters each of 2 octets big. Note
that the second octets, 5c, of the first Big5 character clashes
backslash '\'. Thus, this string is escaped to

a5 5c 31 33 34 af e0

and is inserted to backend something like this:

INSERT INTO x VALUES ('y\134na');

Where octets 'y' and character "na" are unprintable characters here in
terms of ASCII.

Problem is that this string is stored exactly the same as the input: "a5
5c 31 33 34 af e0" instead of "a5 5c af e0". The SELECT outputs the
escaped string (7 octets) instead of the unescaped string (4 octets),
too.

However, the server manages the following string differently:

INSERT INTO x VALUES ('A\134B');

and

SELECT * FROM x;

outputs

A\B

Its size stored in the column is 3 octets. This second case is exactly
what I need. I guess strings like "C:\dir" is properly processed too
though I did not test it.

Why server treats the first string in this undesired way?

Regards,

CN

--
http://www.fastmail.fm - Email service worth paying for. Try it for free


Re: Escaped backslash in SQL constant

From
Tom Lane
Date:
"CN" <cnliou9@fastmail.fm> writes:
> The database cluster is initialized to use UNICODE. The client encoding
> is set to BIG5. The middleware escapes the backslash in the following
> string before writing to TEXT/VARCHAR column in server:

> a5 5c af e0

Seems to me that you need to fix your broken middleware --- it has no
business doing that.

> This is a string comprises Big5 characters each of 2 octets big. Note
> that the second octets, 5c, of the first Big5 character clashes
> backslash '\'. Thus, this string is escaped to

> a5 5c 31 33 34 af e0

Now you have a 2-octet Big5 character, followed by 3 1-octet ASCII
digits, followed by a 2-octet Big5 character.  All the subsequent
processing is doing what it should with this, AFAICS.

            regards, tom lane

Re: Escaped backslash in SQL constant

From
"CN"
Date:
Many thanks for the lightening fast answer!

> > The database cluster is initialized to use UNICODE. The client encoding
> > is set to BIG5. The middleware escapes the backslash in the following
> > string before writing to TEXT/VARCHAR column in server:
>
> > a5 5c af e0
>
> Seems to me that you need to fix your broken middleware --- it has no
> business doing that.

Are you suggesting that the middleware should not escape backslashes in
the first place?

The doc in
http://www.postgresql.org/docs/8.1/static/sql-syntax.html#SQL-SYNTAX-CONSTANTS
reads:

"Any other character following a backslash is taken literally. Thus, to
include a backslash in a string constant, write two backslashes."

My understanding on this statement is that escaping \ to \\ blindly for
any string and any encoding hurts nothing. Also, I thought that escaping
"\" to "\134" for TEXT column should be always equivalent to escaping it
to "\\", at least with this current (8.1) version.

> > This is a string comprises Big5 characters each of 2 octets big. Note
> > that the second octets, 5c, of the first Big5 character clashes
> > backslash '\'. Thus, this string is escaped to
>
> > a5 5c 31 33 34 af e0
>
> Now you have a 2-octet Big5 character, followed by 3 1-octet ASCII
> digits, followed by a 2-octet Big5 character.  All the subsequent
> processing is doing what it should with this, AFAICS.

What I still don't realize is that as mentioned in my first posting,
"A\134B" gets stored in TEXT with value "A\B" but the Big5 string
escaped in the same manner yeids different result - the latter string is
stored in escaped form and becomes longer than intended. I thought two
2-octet Big5 characters would be stored in backend.

Regards,

CN

--
http://www.fastmail.fm - IMAP accessible web-mail


Re: Escaped backslash in SQL constant

From
Tom Lane
Date:
"CN" <cnliou9@fastmail.fm> writes:
>> Seems to me that you need to fix your broken middleware --- it has no
>> business doing that.

> Are you suggesting that the middleware should not escape backslashes in
> the first place?

No, I'm suggesting that it shouldn't be let loose on Big5 data when it
evidently hasn't a clue about that encoding.  The byte in question
*is not* a backslash, it's not even an independent character; and so
changing it on the assumption that it is logically a backslash simply
breaks the data.

Your quickest route to a solution may be to avoid Big5 in favor of
an encoding that is ASCII-safe, such as UTF8.  You can feed that through
code that only understands ASCII with much less risk than an encoding
where second and later bytes might look like ASCII.

            regards, tom lane

Re: Escaped backslash in SQL constant

From
"CN"
Date:
> No, I'm suggesting that it shouldn't be let loose on Big5 data when it
> evidently hasn't a clue about that encoding.  The byte in question
> *is not* a backslash, it's not even an independent character; and so
> changing it on the assumption that it is logically a backslash simply
> breaks the data.

Would you please enlighten me the behavior of the backend - why

SET CLIENT_ENCODING TO Big5;
INSERT INTO y VALUES ('A\134B');

stores

A\B

while

INSERT INTO y VALUES ('y\134na');
--"y\" and "na" are two Big5 characters.

stores

y\134na

instead of

y\na

> Your quickest route to a solution may be to avoid Big5 in favor of
> an encoding that is ASCII-safe, such as UTF8.  You can feed that through
> code that only understands ASCII with much less risk than an encoding
> where second and later bytes might look like ASCII.

Are you suggesting me to implement the middleware that will translate
Big5 input to UTF8 and then escape the latter before sending it to
PostgreSQL?

SET CLIENT_ENCODING TO UTF8;
[BIG5 string from user] --> [middleware] --> [UTF8] --> [escaped UTF8]
--> PostgreSQL (initdb with -E UNICODE)

Best regards,

CN

--
http://www.fastmail.fm - Does exactly what it says on the tin


Re: Escaped backslash in SQL constant

From
Tom Lane
Date:
"CN" <cnliou9@fastmail.fm> writes:
> INSERT INTO y VALUES ('y\134na');
> --"y\" and "na" are two Big5 characters.

I'm not sure how to explain it any more clearly: the backslash in this
example is not a backslash.  It's a byte within a multibyte character,
which *entirely coincidentally* happens to have the same numeric value
as an ASCII backslash.  But it isn't a backslash.  And it won't be
processed as a backslash by any Big5-aware code.

Code that does not understand about multibyte characters is simply
unsafe to apply to data that is in Big5.  You need to fix that
middleware to understand Big5 encoding; or if that seems impractical,
switch to using another encoding for the data the middleware sees.

            regards, tom lane

Re: Escaped backslash in SQL constant

From
"CN"
Date:
Thank you again for the clarification!

> I'm not sure how to explain it any more clearly: the backslash in this
> example is not a backslash.  It's a byte within a multibyte character,
> which *entirely coincidentally* happens to have the same numeric value
> as an ASCII backslash.  But it isn't a backslash.  And it won't be
> processed as a backslash by any Big5-aware code.

If I understand this explanation correctly, then you have given me the
answer I needed!
Please correct me if my understanding that follows is wrong again:

PostgreSQL is Big5-aware code. It does not blindly unescape every
backlash it encounters in SQL literals. Instead, it sees backslash as
part of some Big5 characters as they are supposed to be when client
encoding is set to Big5 (SET CLIENT_ENCODING TO BIG5). In other words,
PostgreSQL sees the backslash in "C:\134" differently from that being
part of multi-byte characters depending on which client encoding is
used.

Regards,

CN

--
http://www.fastmail.fm - And now for something completely different�