Thread: BUG #16236: Invalid escape encoding
The following bug has been logged on the website: Bug reference: 16236 Logged by: Stéphane Campinas Email address: stephane.campinas@gmail.com PostgreSQL version: 12.1 Operating system: linux 5.4.14 Description: Hi, From the documentation [0] about the encode function, the "escape" format should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn) and doubles backslashes." However, executing "select encode(E'aaa\bccc', 'escape');" outputs "aaa\x08ccc", although according to the documentation I should get "aaa\010ccc". This was found in [1] while mimicking the output of encode, where you can get more details. Cheers, [0] https://www.postgresql.org/docs/12/functions-string.html [1] https://github.com/crate/crate/pull/9540
PG Bug reporting form <noreply@postgresql.org> writes: > From the documentation [0] about the encode function, the "escape" format > should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn) > and doubles backslashes." > However, executing "select encode(E'aaa\bccc', 'escape');" outputs > "aaa\x08ccc", although according to the documentation I should get > "aaa\010ccc". No, I don't think so. The \b gives rise to a byte with hex value 08 (that is, control-H or backspace) in the E'' literal, which converts to the same byte value in the bytea value that gets passed to encode(). Since that's not either a zero or a high-bit-set value, encode() just repeats it literally in the text result, and you end up with the same thing as if you'd just done =# select E'aaa\bccc'::text; text ------------ aaa\x08ccc (1 row) I think it must be psql itself that's choosing to represent the backspace as \x08, because nothing in the backend does that. (pokes around ... yeah, it's pg_wcsformat() that's doing it) You could certainly make an argument that encode() ought to backslashify all ASCII control characters, not only \0. But it's behaving as documented, AFAICS. regards, tom lane
Thanks Tom for the reply!
I read once more the doc and now I better understand the "high-bit-set
value" part ;o)
myDatabaseName=# select encode('\x00017F80', 'escape');
encode
------------------
\000\x01\x7F\200
If I understand correctly, with the input "\x00017F80", I get the
outputted value above because:
- "00" is converted to "\000"
- "01" and "7F" get converted to "\x01" and "\x7F" respectively as they
are not 0 or a high-bit-set value
- "80" is converted to "\200" since it is a high-bit-set value
I remember getting confused by the fact I got hexadecimal values in
output and I didn't really get the "high-bit-set" part of the doc.
Do you know why there is this distinction between high-bit-set values
and other non-printable characters ?
Also, I still have 2 more questions.
First, the following is strange: I cannot decode what the encode method
returned
myDatabaseName=# select encode('\x00017F80', 'escape');
encode
------------------
\000\x01\x7F\200
(1 row)
myDatabaseName=# select decode('\000\x01\x7F\200', 'escape');
ERROR: invalid input syntax for type bytea
Second, as I was poking around the code, I found out about the
"bytea_output". If I set it to "escape", I still get hexadecimals. Is
that expected ?
myDatabaseName=# set bytea_output to escape;
SET
myDatabaseName=# select encode('\x00017F80', 'escape');
encode
------------------
\000\x01\x7F\200
(1 row)
Cheers,
On Mon, Jan 27, 2020 at 06:05:45PM -0500, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
> > From the documentation [0] about the encode function, the "escape" format
> > should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn)
> > and doubles backslashes."
> > However, executing "select encode(E'aaa\bccc', 'escape');" outputs
> > "aaa\x08ccc", although according to the documentation I should get
> > "aaa\010ccc".
>
> No, I don't think so. The \b gives rise to a byte with hex value 08
> (that is, control-H or backspace) in the E'' literal, which converts
> to the same byte value in the bytea value that gets passed to
> encode(). Since that's not either a zero or a high-bit-set value,
> encode() just repeats it literally in the text result, and you end
> up with the same thing as if you'd just done
>
> =# select E'aaa\bccc'::text;
> text
> ------------
> aaa\x08ccc
> (1 row)
>
> I think it must be psql itself that's choosing to represent the
> backspace as \x08, because nothing in the backend does that.
> (pokes around ... yeah, it's pg_wcsformat() that's doing it)
>
> You could certainly make an argument that encode() ought to
> backslashify all ASCII control characters, not only \0. But
> it's behaving as documented, AFAICS.
>
> regards, tom lane
--
Campinas Stéphane
I read once more the doc and now I better understand the "high-bit-set
value" part ;o)
myDatabaseName=# select encode('\x00017F80', 'escape');
encode
------------------
\000\x01\x7F\200
If I understand correctly, with the input "\x00017F80", I get the
outputted value above because:
- "00" is converted to "\000"
- "01" and "7F" get converted to "\x01" and "\x7F" respectively as they
are not 0 or a high-bit-set value
- "80" is converted to "\200" since it is a high-bit-set value
I remember getting confused by the fact I got hexadecimal values in
output and I didn't really get the "high-bit-set" part of the doc.
Do you know why there is this distinction between high-bit-set values
and other non-printable characters ?
Also, I still have 2 more questions.
First, the following is strange: I cannot decode what the encode method
returned
myDatabaseName=# select encode('\x00017F80', 'escape');
encode
------------------
\000\x01\x7F\200
(1 row)
myDatabaseName=# select decode('\000\x01\x7F\200', 'escape');
ERROR: invalid input syntax for type bytea
Second, as I was poking around the code, I found out about the
"bytea_output". If I set it to "escape", I still get hexadecimals. Is
that expected ?
myDatabaseName=# set bytea_output to escape;
SET
myDatabaseName=# select encode('\x00017F80', 'escape');
encode
------------------
\000\x01\x7F\200
(1 row)
Cheers,
On Mon, Jan 27, 2020 at 06:05:45PM -0500, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
> > From the documentation [0] about the encode function, the "escape" format
> > should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn)
> > and doubles backslashes."
> > However, executing "select encode(E'aaa\bccc', 'escape');" outputs
> > "aaa\x08ccc", although according to the documentation I should get
> > "aaa\010ccc".
>
> No, I don't think so. The \b gives rise to a byte with hex value 08
> (that is, control-H or backspace) in the E'' literal, which converts
> to the same byte value in the bytea value that gets passed to
> encode(). Since that's not either a zero or a high-bit-set value,
> encode() just repeats it literally in the text result, and you end
> up with the same thing as if you'd just done
>
> =# select E'aaa\bccc'::text;
> text
> ------------
> aaa\x08ccc
> (1 row)
>
> I think it must be psql itself that's choosing to represent the
> backspace as \x08, because nothing in the backend does that.
> (pokes around ... yeah, it's pg_wcsformat() that's doing it)
>
> You could certainly make an argument that encode() ought to
> backslashify all ASCII control characters, not only \0. But
> it's behaving as documented, AFAICS.
>
> regards, tom lane
--
Campinas Stéphane
--
Campinas Stéphane
Attachment
[ please keep the list cc'd ] =?utf-8?B?U3TDqXBoYW5l?= Campinas <stephane.campinas@gmail.com> writes: > myDatabaseName=# select encode('\x00017F80', 'escape'); > encode > ------------------ > \000\x01\x7F\200 > If I understand correctly, with the input "\x00017F80", I get the > outputted value above because: > - "00" is converted to "\000" > - "01" and "7F" get converted to "\x01" and "\x7F" respectively as they > are not 0 or a high-bit-set value > - "80" is converted to "\200" since it is a high-bit-set value The point here is that the encode function is only doing the first and last of those things. It lets the 01 and 7F bytes through as-is, because the text data type can store and transport those just fine. It's psql's table-printing code that is deciding that those bytes are nonprintable and then choosing to render them in the \x01 style. (The large distance between those bits of code helps to explain the inconsistency of style.) > Do you know why there is this distinction between high-bit-set values > and other non-printable characters ? Probably, whoever wrote the encode-as-escape code didn't see a need to escape anything that type text could store without it. That code's old enough that it might predate psql's decision to render control characters this way, too. (Type text won't store zero bytes, and it will only accept high-bit-set bytes if they form part of a validly encoded character, which limits the allowed sequences if the database encoding is, say, UTF8. So those cases *have* to be escaped in order to turn any valid bytea into a valid text object.) There's certainly an argument to be made that it'd be more friendly for encode() to escape these other byte values as well. But the code is operating as designed. > First, the following is strange: I cannot decode what the encode method > returned > myDatabaseName=# select encode('\x00017F80', 'escape'); > encode > ------------------ > \000\x01\x7F\200 > (1 row) > myDatabaseName=# select decode('\000\x01\x7F\200', 'escape'); > ERROR: invalid input syntax for type bytea That's because that's *not* what encode() returned, it's just how psql chose to print it. One way to write what encode() really returned is regression=# select octet_length(E'\\000\x01\x7F\\200'::text); octet_length -------------- 10 (1 row) regression=# select decode(E'\\000\x01\x7F\\200'::text, 'escape'); decode ------------ \x00017f80 (1 row) > Second, as I was poking around the code, I found out about the > "bytea_output". If I set it to "escape", I still get hexadecimals. Is > that expected ? Yes, because encode()'s output is type text and hence not subject to that setting. If you were looking for a way to control what psql does with these bytes, you'd have to look into its commands, probably \pset. (I don't think there is a way to control it, but if there was, that's where we'd put it.) regards, tom lane