Thread: BUG #16236: Invalid escape encoding

BUG #16236: Invalid escape encoding

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16236
Logged by:          Stéphane Campinas
Email address:      stephane.campinas@gmail.com
PostgreSQL version: 12.1
Operating system:   linux 5.4.14
Description:

Hi,

From the documentation [0] about the encode function, the "escape" format
should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn)
and doubles backslashes."
However, executing "select encode(E'aaa\bccc', 'escape');" outputs
"aaa\x08ccc", although according to the documentation I should get
"aaa\010ccc". This was found in [1] while mimicking the output of encode,
where you can get more details.

Cheers,

[0] https://www.postgresql.org/docs/12/functions-string.html
[1] https://github.com/crate/crate/pull/9540


Re: BUG #16236: Invalid escape encoding

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> From the documentation [0] about the encode function, the "escape" format
> should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn)
> and doubles backslashes."
> However, executing "select encode(E'aaa\bccc', 'escape');" outputs
> "aaa\x08ccc", although according to the documentation I should get
> "aaa\010ccc".

No, I don't think so.  The \b gives rise to a byte with hex value 08
(that is, control-H or backspace) in the E'' literal, which converts
to the same byte value in the bytea value that gets passed to
encode().  Since that's not either a zero or a high-bit-set value,
encode() just repeats it literally in the text result, and you end
up with the same thing as if you'd just done

=# select E'aaa\bccc'::text;     
    text    
------------
 aaa\x08ccc
(1 row)

I think it must be psql itself that's choosing to represent the
backspace as \x08, because nothing in the backend does that.
(pokes around ... yeah, it's pg_wcsformat() that's doing it)

You could certainly make an argument that encode() ought to
backslashify all ASCII control characters, not only \0.  But
it's behaving as documented, AFAICS.

            regards, tom lane



BUG #16236: Invalid escape encoding

From
Stéphane Campinas
Date:
Thanks Tom for the reply!

I read once more the doc and now I better understand the "high-bit-set
value" part ;o)

        myDatabaseName=# select encode('\x00017F80', 'escape');
              encode
        ------------------
         \000\x01\x7F\200

If I understand correctly, with the input "\x00017F80", I get the
outputted value above because:
- "00" is converted to "\000"
- "01" and "7F" get converted to "\x01" and "\x7F" respectively as they
  are not 0 or a high-bit-set value
- "80" is converted to "\200" since it is a high-bit-set value

I remember getting confused by the fact I got hexadecimal values in
output and I didn't really get the "high-bit-set" part of the doc.

Do you know why there is this distinction between high-bit-set values
and other non-printable characters ?

Also, I still have 2 more questions.

First, the following is strange: I cannot decode what the encode method
returned

        myDatabaseName=# select encode('\x00017F80', 'escape');
              encode
        ------------------
         \000\x01\x7F\200
        (1 row)

        myDatabaseName=# select decode('\000\x01\x7F\200', 'escape');
        ERROR:  invalid input syntax for type bytea

Second, as I was poking around the code, I found out about the
"bytea_output". If I set it to "escape", I still get hexadecimals. Is
that expected ?

        myDatabaseName=# set bytea_output to escape;
        SET
        myDatabaseName=# select encode('\x00017F80', 'escape');
              encode
        ------------------
         \000\x01\x7F\200
        (1 row)

Cheers,

On Mon, Jan 27, 2020 at 06:05:45PM -0500, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
> > From the documentation [0] about the encode function, the "escape" format
> > should "convert zero bytes and high-bit-set bytes to octal sequences (\nnn)
> > and doubles backslashes."
> > However, executing "select encode(E'aaa\bccc', 'escape');" outputs
> > "aaa\x08ccc", although according to the documentation I should get
> > "aaa\010ccc".
>
> No, I don't think so.  The \b gives rise to a byte with hex value 08
> (that is, control-H or backspace) in the E'' literal, which converts
> to the same byte value in the bytea value that gets passed to
> encode().  Since that's not either a zero or a high-bit-set value,
> encode() just repeats it literally in the text result, and you end
> up with the same thing as if you'd just done
>
> =# select E'aaa\bccc'::text;     
>     text   
> ------------
>  aaa\x08ccc
> (1 row)
>
> I think it must be psql itself that's choosing to represent the
> backspace as \x08, because nothing in the backend does that.
> (pokes around ... yeah, it's pg_wcsformat() that's doing it)
>
> You could certainly make an argument that encode() ought to
> backslashify all ASCII control characters, not only \0.  But
> it's behaving as documented, AFAICS.
>
>                       regards, tom lane

--
Campinas Stéphane


--
Campinas Stéphane
Attachment

Re: BUG #16236: Invalid escape encoding

From
Tom Lane
Date:
[ please keep the list cc'd ]

=?utf-8?B?U3TDqXBoYW5l?= Campinas <stephane.campinas@gmail.com> writes:
>         myDatabaseName=# select encode('\x00017F80', 'escape');
>               encode
>         ------------------
>          \000\x01\x7F\200

> If I understand correctly, with the input "\x00017F80", I get the
> outputted value above because:
> - "00" is converted to "\000"
> - "01" and "7F" get converted to "\x01" and "\x7F" respectively as they
>   are not 0 or a high-bit-set value
> - "80" is converted to "\200" since it is a high-bit-set value

The point here is that the encode function is only doing the first
and last of those things.  It lets the 01 and 7F bytes through
as-is, because the text data type can store and transport those
just fine.  It's psql's table-printing code that is deciding that
those bytes are nonprintable and then choosing to render them in
the \x01 style.  (The large distance between those bits of code
helps to explain the inconsistency of style.)

> Do you know why there is this distinction between high-bit-set values
> and other non-printable characters ?

Probably, whoever wrote the encode-as-escape code didn't see a need
to escape anything that type text could store without it.  That code's
old enough that it might predate psql's decision to render control
characters this way, too.  (Type text won't store zero bytes, and
it will only accept high-bit-set bytes if they form part of a
validly encoded character, which limits the allowed sequences if the
database encoding is, say, UTF8.  So those cases *have* to be escaped
in order to turn any valid bytea into a valid text object.)

There's certainly an argument to be made that it'd be more friendly
for encode() to escape these other byte values as well.  But the
code is operating as designed.

> First, the following is strange: I cannot decode what the encode method
> returned
>         myDatabaseName=# select encode('\x00017F80', 'escape');
>               encode
>         ------------------
>          \000\x01\x7F\200
>         (1 row)
>         myDatabaseName=# select decode('\000\x01\x7F\200', 'escape');
>         ERROR:  invalid input syntax for type bytea

That's because that's *not* what encode() returned, it's just how
psql chose to print it.  One way to write what encode() really
returned is

regression=# select octet_length(E'\\000\x01\x7F\\200'::text);
 octet_length 
--------------
           10
(1 row)

regression=# select decode(E'\\000\x01\x7F\\200'::text, 'escape');
   decode   
------------
 \x00017f80
(1 row)

> Second, as I was poking around the code, I found out about the
> "bytea_output". If I set it to "escape", I still get hexadecimals. Is
> that expected ?

Yes, because encode()'s output is type text and hence not subject
to that setting.  If you were looking for a way to control what
psql does with these bytes, you'd have to look into its commands,
probably \pset.  (I don't think there is a way to control it,
but if there was, that's where we'd put it.)

            regards, tom lane