Thread: backslash-dot quoting in COPY CSV

backslash-dot quoting in COPY CSV

From
"Daniel Verite"
Date:
 Hi,

The doc on COPY CSV says about the backslash-dot sequence:

  To avoid any misinterpretation, a \. data value appearing as a
  lone entry on a line is automatically quoted on output, and on
  input, if quoted, is not interpreted as the end-of-data marker

However this quoting does not happen when \. is already part
of a quoted field. Example:

COPY (select 'somevalue', E'foo\n\\.\nbar') TO STDOUT CSV;

outputs:

somevalue,"foo
\.
bar"

which conforms to the CSV rules, by which we are not allowed
to replace \. by anything AFAICS.
The trouble is, when trying to import this back with COPY FROM,
it will error out at the backslash-dot and not import anything.
Furthermore, if these data are meant to be embedded into a
script, it creates a potential risk of SQL injection.

It is a known issue? I haven't found previous discussions on this.
It looks to me like the ability of backslash-dot to be an end-of-data
marker should be neutralizable for CSV. When the data is not embedded,
it's not needed anyway, and when it's embedded, we could surely think
of alternatives.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: backslash-dot quoting in COPY CSV

From
Bruce Momjian
Date:
On Wed, Jan  2, 2019 at 04:58:35PM +0100, Daniel Verite wrote:
>  Hi,
> 
> The doc on COPY CSV says about the backslash-dot sequence:
> 
>   To avoid any misinterpretation, a \. data value appearing as a
>   lone entry on a line is automatically quoted on output, and on
>   input, if quoted, is not interpreted as the end-of-data marker
> 
> However this quoting does not happen when \. is already part
> of a quoted field. Example:
> 
> COPY (select 'somevalue', E'foo\n\\.\nbar') TO STDOUT CSV;
> 
> outputs:
> 
> somevalue,"foo
> \.
> bar"
> 
> which conforms to the CSV rules, by which we are not allowed
> to replace \. by anything AFAICS.
> The trouble is, when trying to import this back with COPY FROM,
> it will error out at the backslash-dot and not import anything.
> Furthermore, if these data are meant to be embedded into a
> script, it creates a potential risk of SQL injection.
> 
> It is a known issue? I haven't found previous discussions on this.
> It looks to me like the ability of backslash-dot to be an end-of-data
> marker should be neutralizable for CSV. When the data is not embedded,
> it's not needed anyway, and when it's embedded, we could surely think
> of alternatives.

I was unable to reproduce the failure here using files:

    CREATE TABLE test (x TEXT);
    INSERT INTO test VALUES (E'foo\n\\.\nbar');

    COPY test TO STDOUT CSV;
    "foo
    \.
    bar"

    COPY test TO '/u/postgres/tmp/x' CSV;
    
    COPY test FROM '/u/postgres/tmp/x' CSV;

    SELECT * FROM test;
      x
    -----
     foo+
     \. +
     bar
     foo+
     \. +
     bar

but I am able to see the failure using STDIN:

    COPY test FROM STDIN CSV;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself, or an EOF signal.
    "foo
    \.
    ERROR:  unterminated CSV quoted field
    CONTEXT:  COPY test, line 1: ""foo

This seems like a bug to me.  Looking at the code, psql issues the
prompts for STDIN, but when it sees \. alone on a line, it has no idea
you are in a quoted CSV string, so it thinks the copy is done and sends
the result to the server.  I can't see an easy way to fix this.  I guess
we could document it.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: backslash-dot quoting in COPY CSV

From
"Daniel Verite"
Date:
    Bruce Momjian wrote:

> but I am able to see the failure using STDIN:
>
>         COPY test FROM STDIN CSV;
>         Enter data to be copied followed by a newline.
>         End with a backslash and a period on a line by itself, or an EOF
> signal.
>         "foo
>         \.
>         ERROR:  unterminated CSV quoted field
>         CONTEXT:  COPY test, line 1: ""foo
>
> This seems like a bug to me.  Looking at the code, psql issues the
> prompts for STDIN, but when it sees \. alone on a line, it has no idea
> you are in a quoted CSV string, so it thinks the copy is done and sends
> the result to the server.  I can't see an easy way to fix this.  I guess
> we could document it.

Thanks for looking into this.

\copy from file with csv is also affected since it uses COPY FROM
STDIN behind the scene. The case of embedded data looks more worrying
because psql will execute the data following \. as if they were
SQL statements.

ISTM that only ON_ERROR_STOP=on prevents the risk of SQL injection
in that scenario, but it's off by default.

What about this idea: when psql is feeding COPY data from its command
stream and an error occurs, it should act as if ON_ERROR_STOP was "on"
even if it's not.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: backslash-dot quoting in COPY CSV

From
Michael Paquier
Date:
On Thu, Jan 24, 2019 at 10:09:30PM -0500, Bruce Momjian wrote:
> This seems like a bug to me.  Looking at the code, psql issues the
> prompts for STDIN, but when it sees \. alone on a line, it has no idea
> you are in a quoted CSV string, so it thinks the copy is done and sends
> the result to the server.  I can't see an easy way to fix this.  I guess
> we could document it.

In src/bin/psql/copy.c, handleCopyIn():

/*
 * This code erroneously assumes '\.' on a line alone
 * inside a quoted CSV string terminates the \copy.
 * http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
 */
if (strcmp(buf, "\\.\n") == 0 ||
    strcmp(buf, "\\.\r\n") == 0)
{
    copydone = true;
    break;
}

This story pops up from time to time..
--
Michael

Attachment

Re: backslash-dot quoting in COPY CSV

From
"Daniel Verite"
Date:
    Michael Paquier wrote:

> In src/bin/psql/copy.c, handleCopyIn():
>
> /*
> * This code erroneously assumes '\.' on a line alone
> * inside a quoted CSV string terminates the \copy.
> *
> http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
> */
> if (strcmp(buf, "\\.\n") == 0 ||
>    strcmp(buf, "\\.\r\n") == 0)
> {
>    copydone = true;
>    break;
> }

Indeed, it's exactly that problem.
And there's the related problem that it derails the input stream
in a way that lines of data become commands, but that one is
not specific to that particular error.

For the backslash-dot in a quoted string, the root cause is
that psql is not aware that the contents are CSV so it can't
parse them properly.
I can think of several ways of working around that, more or less
inelegant:

- the end of data could be expressed as a length (in number of lines
for instance) instead of an in-data marker.

- the end of data could be configurable, as in the MIME structure of
multipart mail messages, where a part is ended by a "boundary",
line, generally a long randomly generated string. This boundary
would have to be known to psql through setting a dedicated
variable or command.

- COPY as the SQL command could have the boundary option
for data fed through its STDIN. This could neutralize the
special role of backslash-dot in general, not just in quoted fields,
since the necessity to quote backslash-dot is a wart anyway.

- psql could be told somehow that the next piece of inline data is in
the CSV format, and then pass it through a CSV parser.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: backslash-dot quoting in COPY CSV

From
Bruce Momjian
Date:
On Sun, Jan 27, 2019 at 10:10:36PM +0900, Michael Paquier wrote:
> On Thu, Jan 24, 2019 at 10:09:30PM -0500, Bruce Momjian wrote:
> > This seems like a bug to me.  Looking at the code, psql issues the
> > prompts for STDIN, but when it sees \. alone on a line, it has no idea
> > you are in a quoted CSV string, so it thinks the copy is done and sends
> > the result to the server.  I can't see an easy way to fix this.  I guess
> > we could document it.
> 
> In src/bin/psql/copy.c, handleCopyIn():
> 
> /*
>  * This code erroneously assumes '\.' on a line alone
>  * inside a quoted CSV string terminates the \copy.
>  * http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
>  */
> if (strcmp(buf, "\\.\n") == 0 ||
>     strcmp(buf, "\\.\r\n") == 0)
> {
>     copydone = true;
>     break;
> }
> 
> This story pops up from time to time..

The killer is I committed this C comment six years ago, and didn't
remember it.  :-O

    commit 361b94c4b98b85b19b850cff37be76d1f6d4f8f7
    Author: Bruce Momjian <bruce@momjian.us>
    Date:   Thu Jul 4 13:09:52 2013 -0400
    
        Add C comment about \copy bug in CSV mode
        Comment: This code erroneously assumes '\.' on a line alone inside a
        quoted CSV string terminates the \copy.
        http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org

Glad I mentioned the URL, at least.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: backslash-dot quoting in COPY CSV

From
Bruce Momjian
Date:
On Fri, Jan 25, 2019 at 01:01:22PM +0100, Daniel Verite wrote:
>     Bruce Momjian wrote:
> 
> > but I am able to see the failure using STDIN:
> > 
> >         COPY test FROM STDIN CSV;
> >         Enter data to be copied followed by a newline.
> >         End with a backslash and a period on a line by itself, or an EOF
> > signal.
> >         "foo
> >         \.
> >         ERROR:  unterminated CSV quoted field
> >         CONTEXT:  COPY test, line 1: ""foo
> > 
> > This seems like a bug to me.  Looking at the code, psql issues the
> > prompts for STDIN, but when it sees \. alone on a line, it has no idea
> > you are in a quoted CSV string, so it thinks the copy is done and sends
> > the result to the server.  I can't see an easy way to fix this.  I guess
> > we could document it.
> 
> Thanks for looking into this.  
> 
> \copy from file with csv is also affected since it uses COPY FROM
> STDIN behind the scene. The case of embedded data looks more worrying
> because psql will execute the data following \. as if they were
> SQL statements.
> 
> ISTM that only ON_ERROR_STOP=on prevents the risk of SQL injection
> in that scenario, but it's off by default.

You are correct that someone having data that is SQL commands would be
able to perhaps execute those commands on restore.  pg_dump doesn't use
CSV, and this only affects STDIN, not files or PROGRAM input.  I think
the question is how many people are using CSV/STDIN for insecure data
loads?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: backslash-dot quoting in COPY CSV

From
Bruce Momjian
Date:
On Mon, Jan 28, 2019 at 04:06:17PM +0100, Daniel Verite wrote:
>     Michael Paquier wrote:
> 
> > In src/bin/psql/copy.c, handleCopyIn():
> > 
> > /*
> > * This code erroneously assumes '\.' on a line alone
> > * inside a quoted CSV string terminates the \copy.
> > *
> > http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
> > */
> > if (strcmp(buf, "\\.\n") == 0 ||
> >    strcmp(buf, "\\.\r\n") == 0)
> > {
> >    copydone = true;
> >    break;
> > }
> 
> Indeed, it's exactly that problem.
> And there's the related problem that it derails the input stream
> in a way that lines of data become commands, but that one is
> not specific to that particular error.
> 
> For the backslash-dot in a quoted string, the root cause is
> that psql is not aware that the contents are CSV so it can't
> parse them properly.
> I can think of several ways of working around that, more or less
> inelegant:
> 
> - the end of data could be expressed as a length (in number of lines
> for instance) instead of an in-data marker.
> 
> - the end of data could be configurable, as in the MIME structure of
> multipart mail messages, where a part is ended by a "boundary",
> line, generally a long randomly generated string. This boundary
> would have to be known to psql through setting a dedicated
> variable or command.
> 
> - COPY as the SQL command could have the boundary option
> for data fed through its STDIN. This could neutralize the
> special role of backslash-dot in general, not just in quoted fields,
> since the necessity to quote backslash-dot is a wart anyway.

Well, these all kind of require a change to the COPY format, which
hasn't changed in many years.

> - psql could be told somehow that the next piece of inline data is in
> the CSV format, and then pass it through a CSV parser.

That might be the cleanest solution, but how would we actually input
multi-line data in CSV mode with \. alone on a line?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: backslash-dot quoting in COPY CSV

From
"Daniel Verite"
Date:
    Bruce Momjian wrote:

> > - the end of data could be expressed as a length (in number of lines
> > for instance) instead of an in-data marker.
> >
> > - the end of data could be configurable, as in the MIME structure of
> > multipart mail messages, where a part is ended by a "boundary",
> > line, generally a long randomly generated string. This boundary
> > would have to be known to psql through setting a dedicated
> > variable or command.
> >
> > - COPY as the SQL command could have the boundary option
> > for data fed through its STDIN. This could neutralize the
> > special role of backslash-dot in general, not just in quoted fields,
> > since the necessity to quote backslash-dot is a wart anyway.
>
> Well, these all kind of require a change to the COPY format, which
> hasn't changed in many years.

Not for the first two. As an example of solution #2, it could look like this:

=# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
=# COPY table FROM STDIN CSV;
somevalue,"foo
\.
bar"
==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==

Instead of looking for \. on a line by itself, psql would look for the
boundary to know where the data ends.
The boundary is not transmitted to the server, it has no need to know
about it.

> > - psql could be told somehow that the next piece of inline data is in
> > the CSV format, and then pass it through a CSV parser.
>
> That might be the cleanest solution, but how would we actually input
> multi-line data in CSV mode with \. alone on a line?

With this solution, the content doesn't change at all.
The weird part would be the user interface, because the information
psql needs is not only "CSV", it's also the options DELIMITER, QUOTE,
ESCAPE and possibly ENCODING. Currently it doesn't know any of these,
they're passed to the server in an opaque, unparsed form within
the COPY command.

Personally, the solution I find cleaner is the server side not having
any end-of-data marker for CSV. So backslash-dot would never be
special. psql could allow for a custom ending boundary for in-script
data, and users could set that to backslash-dot if they want, but that
would be their choice.
That would be clearly not backward compatible, and I believe it wouldn't
work with the v2 protocol, so I'm not sure it would have much chance of
approval.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: backslash-dot quoting in COPY CSV

From
Bruce Momjian
Date:
On Wed, Jan 30, 2019 at 06:32:11PM +0100, Daniel Verite wrote:
>     Bruce Momjian wrote:
> > Well, these all kind of require a change to the COPY format, which
> > hasn't changed in many years.
> 
> Not for the first two. As an example of solution #2, it could look like this:
> 
> =# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
> =# COPY table FROM STDIN CSV;
> somevalue,"foo
> \.
> bar"
> ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
> 
> Instead of looking for \. on a line by itself, psql would look for the
> boundary to know where the data ends.
> The boundary is not transmitted to the server, it has no need to know
> about it.

Wow, that is an odd API, as you stated below.

> > > - psql could be told somehow that the next piece of inline data is in
> > > the CSV format, and then pass it through a CSV parser.
> > 
> > That might be the cleanest solution, but how would we actually input
> > multi-line data in CSV mode with \. alone on a line?
> 
> With this solution, the content doesn't change at all.
> The weird part would be the user interface, because the information
> psql needs is not only "CSV", it's also the options DELIMITER, QUOTE,
> ESCAPE and possibly ENCODING. Currently it doesn't know any of these,
> they're passed to the server in an opaque, unparsed form within
> the COPY command.
> 
> Personally, the solution I find cleaner is the server side not having
> any end-of-data marker for CSV. So backslash-dot would never be
> special. psql could allow for a custom ending boundary for in-script
> data, and users could set that to backslash-dot if they want, but that
> would be their choice.
> That would be clearly not backward compatible, and I believe it wouldn't
> work with the v2 protocol, so I'm not sure it would have much chance of
> approval.

I had forgotten that the DELIMITER and QUOTE can be changed --- that
kills the idea of adding a simple CSV parser into psql because we would
have to parse the COPY SQL command as well.

I am wondering if we should just disallow CSV from STDIN, on security
grounds.  How big a problem would that be for people?  Would we have to
disable to STDOUT as well since it could not be restored?  Should we
issue some kind of security warning in such cases?  Should we document
this?

In hindsight, I am not sure how we could have designed this more
securly.  I guess we could have required some special text to start all
CSV continuation lines that were not end-of-file, but that would have
been very unportable, which is the goal of CSV.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: backslash-dot quoting in COPY CSV

From
Pavel Stehule
Date:


st 30. 1. 2019 18:51 odesílatel Bruce Momjian <bruce@momjian.us> napsal:
On Wed, Jan 30, 2019 at 06:32:11PM +0100, Daniel Verite wrote:
>       Bruce Momjian wrote:
> > Well, these all kind of require a change to the COPY format, which
> > hasn't changed in many years.
>
> Not for the first two. As an example of solution #2, it could look like this:
>
> =# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
> =# COPY table FROM STDIN CSV;
> somevalue,"foo
> \.
> bar"
> ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
>
> Instead of looking for \. on a line by itself, psql would look for the
> boundary to know where the data ends.
> The boundary is not transmitted to the server, it has no need to know
> about it.

Wow, that is an odd API, as you stated below.

> > > - psql could be told somehow that the next piece of inline data is in
> > > the CSV format, and then pass it through a CSV parser.
> >
> > That might be the cleanest solution, but how would we actually input
> > multi-line data in CSV mode with \. alone on a line?
>
> With this solution, the content doesn't change at all.
> The weird part would be the user interface, because the information
> psql needs is not only "CSV", it's also the options DELIMITER, QUOTE,
> ESCAPE and possibly ENCODING. Currently it doesn't know any of these,
> they're passed to the server in an opaque, unparsed form within
> the COPY command.
>
> Personally, the solution I find cleaner is the server side not having
> any end-of-data marker for CSV. So backslash-dot would never be
> special. psql could allow for a custom ending boundary for in-script
> data, and users could set that to backslash-dot if they want, but that
> would be their choice.
> That would be clearly not backward compatible, and I believe it wouldn't
> work with the v2 protocol, so I'm not sure it would have much chance of
> approval.

I had forgotten that the DELIMITER and QUOTE can be changed --- that
kills the idea of adding a simple CSV parser into psql because we would
have to parse the COPY SQL command as well.

I am wondering if we should just disallow CSV from STDIN, on security
grounds.  How big a problem would that be for people?  Would we have to
disable to STDOUT as well since it could not be restored?  Should we
issue some kind of security warning in such cases?  Should we document
this?

it is pretty common pattern for etl, copy from stdin. I am thinking it can be big problem



In hindsight, I am not sure how we could have designed this more
securly.  I guess we could have required some special text to start all
CSV continuation lines that were not end-of-file, but that would have
been very unportable, which is the goal of CSV.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: backslash-dot quoting in COPY CSV

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> st 30. 1. 2019 18:51 odesílatel Bruce Momjian <bruce@momjian.us> napsal:
>> I am wondering if we should just disallow CSV from STDIN, on security
>> grounds.  How big a problem would that be for people?  Would we have to
>> disable to STDOUT as well since it could not be restored?  Should we
>> issue some kind of security warning in such cases?  Should we document
>> this?

> it is pretty common pattern for etl, copy from stdin. I am thinking it can
> be big problem

Given how long we've had COPY CSV support, and the tiny number of
complaints to date, I do not think it's something to panic over.
Disallowing the functionality altogether is surely an overreaction.

I don't really see an argument for calling it a security problem,
given that pg_dump doesn't use CSV and it isn't the default for
anything else either.  Sure, you can imagine some bad actor hoping
to cause problems by putting crafted data into a table, but how
does that data end up in a script that's using COPY CSV FROM STDIN
(as opposed to copying out-of-line data)?  It's a bit far-fetched.

A documentation warning might be the appropriate response.  I don't
see any plausible way for psql to actually fix the problem, short
of a protocol change to allow the backend to tell it how the data
stream is going to be parsed.

            regards, tom lane


Re: backslash-dot quoting in COPY CSV

From
Michael Paquier
Date:
On Wed, Jan 30, 2019 at 01:20:59PM -0500, Tom Lane wrote:
> Given how long we've had COPY CSV support, and the tiny number of
> complaints to date, I do not think it's something to panic over.
> Disallowing the functionality altogether is surely an overreaction.
>
> A documentation warning might be the appropriate response.  I don't
> see any plausible way for psql to actually fix the problem, short
> of a protocol change to allow the backend to tell it how the data
> stream is going to be parsed.

Yes, agreed.  I looked at this problem a couple of months (year(s)?)
ago and gave up on designing a clear portable solution after a couple
of hours over it, and that's quite a corner case.
--
Michael

Attachment