Thread: Escaped backslash in SQL constant
Hello! The database cluster is initialized to use UNICODE. The client encoding is set to BIG5. The middleware escapes the backslash in the following string before writing to TEXT/VARCHAR column in server: a5 5c af e0 This is a string comprises Big5 characters each of 2 octets big. Note that the second octets, 5c, of the first Big5 character clashes backslash '\'. Thus, this string is escaped to a5 5c 31 33 34 af e0 and is inserted to backend something like this: INSERT INTO x VALUES ('y\134na'); Where octets 'y' and character "na" are unprintable characters here in terms of ASCII. Problem is that this string is stored exactly the same as the input: "a5 5c 31 33 34 af e0" instead of "a5 5c af e0". The SELECT outputs the escaped string (7 octets) instead of the unescaped string (4 octets), too. However, the server manages the following string differently: INSERT INTO x VALUES ('A\134B'); and SELECT * FROM x; outputs A\B Its size stored in the column is 3 octets. This second case is exactly what I need. I guess strings like "C:\dir" is properly processed too though I did not test it. Why server treats the first string in this undesired way? Regards, CN -- http://www.fastmail.fm - Email service worth paying for. Try it for free
"CN" <cnliou9@fastmail.fm> writes: > The database cluster is initialized to use UNICODE. The client encoding > is set to BIG5. The middleware escapes the backslash in the following > string before writing to TEXT/VARCHAR column in server: > a5 5c af e0 Seems to me that you need to fix your broken middleware --- it has no business doing that. > This is a string comprises Big5 characters each of 2 octets big. Note > that the second octets, 5c, of the first Big5 character clashes > backslash '\'. Thus, this string is escaped to > a5 5c 31 33 34 af e0 Now you have a 2-octet Big5 character, followed by 3 1-octet ASCII digits, followed by a 2-octet Big5 character. All the subsequent processing is doing what it should with this, AFAICS. regards, tom lane
Many thanks for the lightening fast answer! > > The database cluster is initialized to use UNICODE. The client encoding > > is set to BIG5. The middleware escapes the backslash in the following > > string before writing to TEXT/VARCHAR column in server: > > > a5 5c af e0 > > Seems to me that you need to fix your broken middleware --- it has no > business doing that. Are you suggesting that the middleware should not escape backslashes in the first place? The doc in http://www.postgresql.org/docs/8.1/static/sql-syntax.html#SQL-SYNTAX-CONSTANTS reads: "Any other character following a backslash is taken literally. Thus, to include a backslash in a string constant, write two backslashes." My understanding on this statement is that escaping \ to \\ blindly for any string and any encoding hurts nothing. Also, I thought that escaping "\" to "\134" for TEXT column should be always equivalent to escaping it to "\\", at least with this current (8.1) version. > > This is a string comprises Big5 characters each of 2 octets big. Note > > that the second octets, 5c, of the first Big5 character clashes > > backslash '\'. Thus, this string is escaped to > > > a5 5c 31 33 34 af e0 > > Now you have a 2-octet Big5 character, followed by 3 1-octet ASCII > digits, followed by a 2-octet Big5 character. All the subsequent > processing is doing what it should with this, AFAICS. What I still don't realize is that as mentioned in my first posting, "A\134B" gets stored in TEXT with value "A\B" but the Big5 string escaped in the same manner yeids different result - the latter string is stored in escaped form and becomes longer than intended. I thought two 2-octet Big5 characters would be stored in backend. Regards, CN -- http://www.fastmail.fm - IMAP accessible web-mail
"CN" <cnliou9@fastmail.fm> writes: >> Seems to me that you need to fix your broken middleware --- it has no >> business doing that. > Are you suggesting that the middleware should not escape backslashes in > the first place? No, I'm suggesting that it shouldn't be let loose on Big5 data when it evidently hasn't a clue about that encoding. The byte in question *is not* a backslash, it's not even an independent character; and so changing it on the assumption that it is logically a backslash simply breaks the data. Your quickest route to a solution may be to avoid Big5 in favor of an encoding that is ASCII-safe, such as UTF8. You can feed that through code that only understands ASCII with much less risk than an encoding where second and later bytes might look like ASCII. regards, tom lane
> No, I'm suggesting that it shouldn't be let loose on Big5 data when it > evidently hasn't a clue about that encoding. The byte in question > *is not* a backslash, it's not even an independent character; and so > changing it on the assumption that it is logically a backslash simply > breaks the data. Would you please enlighten me the behavior of the backend - why SET CLIENT_ENCODING TO Big5; INSERT INTO y VALUES ('A\134B'); stores A\B while INSERT INTO y VALUES ('y\134na'); --"y\" and "na" are two Big5 characters. stores y\134na instead of y\na > Your quickest route to a solution may be to avoid Big5 in favor of > an encoding that is ASCII-safe, such as UTF8. You can feed that through > code that only understands ASCII with much less risk than an encoding > where second and later bytes might look like ASCII. Are you suggesting me to implement the middleware that will translate Big5 input to UTF8 and then escape the latter before sending it to PostgreSQL? SET CLIENT_ENCODING TO UTF8; [BIG5 string from user] --> [middleware] --> [UTF8] --> [escaped UTF8] --> PostgreSQL (initdb with -E UNICODE) Best regards, CN -- http://www.fastmail.fm - Does exactly what it says on the tin
"CN" <cnliou9@fastmail.fm> writes: > INSERT INTO y VALUES ('y\134na'); > --"y\" and "na" are two Big5 characters. I'm not sure how to explain it any more clearly: the backslash in this example is not a backslash. It's a byte within a multibyte character, which *entirely coincidentally* happens to have the same numeric value as an ASCII backslash. But it isn't a backslash. And it won't be processed as a backslash by any Big5-aware code. Code that does not understand about multibyte characters is simply unsafe to apply to data that is in Big5. You need to fix that middleware to understand Big5 encoding; or if that seems impractical, switch to using another encoding for the data the middleware sees. regards, tom lane
Thank you again for the clarification! > I'm not sure how to explain it any more clearly: the backslash in this > example is not a backslash. It's a byte within a multibyte character, > which *entirely coincidentally* happens to have the same numeric value > as an ASCII backslash. But it isn't a backslash. And it won't be > processed as a backslash by any Big5-aware code. If I understand this explanation correctly, then you have given me the answer I needed! Please correct me if my understanding that follows is wrong again: PostgreSQL is Big5-aware code. It does not blindly unescape every backlash it encounters in SQL literals. Instead, it sees backslash as part of some Big5 characters as they are supposed to be when client encoding is set to Big5 (SET CLIENT_ENCODING TO BIG5). In other words, PostgreSQL sees the backslash in "C:\134" differently from that being part of multi-byte characters depending on which client encoding is used. Regards, CN -- http://www.fastmail.fm - And now for something completely different�