Thread: possible bug in COPY ... FROM ... NULL '\0'

possible bug in COPY ... FROM ... NULL '\0'

From
Jeff Davis
Date:
The documentation says:

"The specified null string is sent by COPY TO without adding any
backslashes; conversely, COPY FROM matches the input against the null
string before removing backslashes."
  -- http://www.postgresql.org/docs/devel/static/sql-copy.html

Those seem inconsistent with the following behavior:

  postgres=# copy foo to '/tmp/a.copy' null '\0';
  COPY 2
  postgres=# copy foo from '/tmp/a.copy' null '\0';
  ERROR:  invalid byte sequence for encoding "UTF8": 0x00
  CONTEXT:  COPY foo, line 2: "\0"

  $ cat /tmp/a.copy
  1
  \0

COPY TO seems to follow the documentation, inserting the null string
without modification into the output file. COPY FROM seems to de-escape
the input before trying to match it against the null string, leading to
the invalid byte sequence.

standard_conforming_strings is on.

Regards,
    Jeff Davis

Re: possible bug in COPY ... FROM ... NULL '\0'

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> The documentation says:
> "The specified null string is sent by COPY TO without adding any
> backslashes; conversely, COPY FROM matches the input against the null
> string before removing backslashes."
>   -- http://www.postgresql.org/docs/devel/static/sql-copy.html

> Those seem inconsistent with the following behavior:

>   postgres=# copy foo to '/tmp/a.copy' null '\0';
>   COPY 2
>   postgres=# copy foo from '/tmp/a.copy' null '\0';
>   ERROR:  invalid byte sequence for encoding "UTF8": 0x00
>   CONTEXT:  COPY foo, line 2: "\0"

>   $ cat /tmp/a.copy
>   1
>   \0

> COPY TO seems to follow the documentation, inserting the null string
> without modification into the output file. COPY FROM seems to de-escape
> the input before trying to match it against the null string, leading to
> the invalid byte sequence.

This seems to have worked as per the documentation before 8.4.  I guess
it got broken in one of the efficiency-driven rewrites of the COPY
logic.  Specifically, the issue seems to be that CopyReadAttributesText
tries to combine de-escaping with locating the end-of-field, so it does
all the de-escaping --- and the ensuing encoding validity check ---
before checking to see if the input matches the defined null string.
I'm pretty sure that the original code first broke the line into fields,
then checked the null marker, then did de-escaping.

Fortunately, it seems pretty easy to fix: we just have to reverse the
order of the null-marker check and the pg_verifymbstr call.  However,
this is dependent on the parsing loop never having a reason to throw
error, which at the very least is something that needs a comment.

            regards, tom lane